Why clean income at all
Self-reported household income in the LISS panel contains a familiar family of artefacts: misplaced decimal points, monthly amounts entered in an annual field, personal income entered in the household field, placeholder values, sign errors, and residual SPSS missing codes. Left in place, a single misplaced zero moves a household across every income bracket an analysis uses.
liss_clean_income() turns the cleaning procedures that
used to live in ad hoc analysis scripts into a rule-driven, fully
audited stage of the lissr pipeline. Every decision rule is declared in
a YAML ruleset, every applied decision lands in a ledger with its
evidence and a plain-language justification, the original values stay in
the returned data, and a generated report lets any reader reconstruct
and contest each modification.
The ruleset
The default ruleset ships with the package and is the single source of truth for what the cleaner may do:
library(lissr)
rs <- liss_cleaning_ruleset()
print(rs)
system.file("cleaning", "income_cleaning_rules.yml", package = "lissr")Rules are grouped into four sections. Preparation rules (P01 to P06)
resolve columns, attach background demographics, expand bracket codes,
rectify sign errors, and sweep residual SPSS missing codes. Detection
rules (D01 to D11) identify implausible cells, from unrecoverable
placeholder values through household-level scale errors and robust
statistical outliers to dataset-level extremes. Correction rules (C01 to
C06) generate candidate replacement values. The finalization rule (F01)
voids anything still outside the plausible range. Each rule carries an
id, a description, a rationale, optional literature references,
parameters, and an enabled switch.
Quick start: dry run first, then correct
The cleaner operates on merged module output. A dry run in
"flag" mode performs the full procedure without touching a
single value; the simulated result arrives in
nethh_proposed and every would-be decision in the
ledger:
result <- merge_liss_module(liss_recipe("ci"), data_dir = "data/ci",
output_dir = "output/ci")
dry <- liss_clean_income(result, mode = "flag")
head(dry$decisions)
summary(dry)Once the proposals look right, run in the default
"correct" mode and write the audit artifacts:
cleaned <- liss_clean_income(result, output_dir = "output/ci")
cleaned$data$nethh # cleaned values
cleaned$data$nethh_observed # untouched input, always preserved
table(cleaned$data$nethh_clean_status, useNA = "ifany")A third mode, "na_only", voids detected cells without
imputing anything, for analyses that prefer missingness over model-based
corrections.
Reading a decision
Every modified cell has at least one ledger row. A scale-error correction looks like this:
d <- cleaned$decisions
d[d$rule_id == "D06", c("person_id", "wave", "observed", "corrected",
"candidate_source", "evidence")]
# the justification column carries the full sentence, for example:
# "detected by scale_error; replaced with the household_median candidate
# 25800 (closest of 4 admissible candidate(s) to the household_median
# 25800; constrained to [8000, 150000])"The candidates column lists every admissible candidate
the selection saw, so a reader can verify that the applied value was the
closest to the anchor, and the
valid_min/valid_max columns show the
constraint window (category bounds where reported, global limits
otherwise).
The report
liss_cleaning_report() renders three artifacts: a
markdown report whose methodology section is generated from the ruleset
itself (every rule with its description, rationale, parameters, and
references), the complete decision ledger as CSV, and an engine-shaped
JSONL audit log:
liss_cleaning_report(cleaned, output_dir = "output/ci")Because the methodology is generated from the ruleset that actually ran, the report can never drift from the code.
Disagreeing with a rule
Every rule can be switched off or re-parameterised per run, and the overrides are recorded in the report:
# a stricter volatility requirement for scale-error detection,
# no extreme-z net, and a higher plausibility cap
cleaned2 <- liss_clean_income(
result,
income_cap = 175000,
disable = c("D10"),
params = list(D06 = list(volatility_min = 0.7))
)For a lasting policy, copy the packaged YAML, edit it, and pass the
path; validate_cleaning_ruleset() enforces the schema:
cleaned3 <- liss_clean_income(result, ruleset = "my_income_rules.yml")Attaching background demographics
Donor-pool matching (correction rule C05) works best with household size, position, occupation, age, and education attached. Rule P01 aligns a monthly background file to the annual wave scale, keeps one row per person-year, and joins on the person id, never on the household id:
background <- haven::read_sav("data/avars_201801_EN_1.0p.sav")
cleaned <- liss_clean_income(result, background = background)Equivalised income
For cross-household comparison, liss_equivalise_income()
provides the scale the source analysis pipelines used, alongside the
OECD-modified and square-root alternatives:
library(magrittr)
panel <- cleaned$data %>%
dplyr::mutate(
stand_inc = liss_equivalise_income(nethh, aantalhh, aantalki)
)Where the rules came from
The ruleset consolidates the income-cleaning procedures of two
production analysis projects and corrects several latent defects found
while porting them; INCOME_CLEANING_DESIGN.md in the
repository maps every legacy construct to its rule, documents each
deviation, and describes how to propose new rules or parameter changes
through pull requests, so the methodology can improve under community
review while the audit trail keeps every historical run
reproducible.