Changelog
Source:NEWS.md
lissr 1.2.0
Feature release: a rule-driven income-cleaning framework for merged LISS data. Every behavior below is exercised by the regression tests in tests/testthat/test-clean-income.R and by a seeded end-to-end smoke run (inst/scripts/verification/income_cleaning_smoke.R); the architecture, rule catalog, and the mapping to the legacy analysis scripts live in the repository’s INCOME_CLEANING_DESIGN.md.
Income cleaning
- New
liss_clean_income()detects, evaluates, and corrects implausible household-income values under a declarative YAML ruleset (inst/cleaning/income_cleaning_rules.yml, schema 1.0.0) of 24 rules: six preparation rules (P01-P06), eleven detectors (D01-D11), six correction-candidate generators (C01-C06), and one finalizer (F01). Rules dispatch by action, evaluate in ruleset order, and every rule carries a description, rationale, parameters, and literature references that the generated report reproduces. - Full decision transparency. Original values are preserved unchanged in
<target>_observed; every modified cell is marked in<target>_clean_statuswith its final action and rule; and every decision, applied or proposed, lands in an 18-column typed ledger with the responsible rule, the evidence, the admissible candidate set with sources, the anchor, the valid range, and a plain-language justification.liss_cleaning_report()renders the methodology directly from the ruleset plus a decision appendix, and writes the ledger as CSV and an engine-shaped JSONL audit log. - Three modes.
correctapplies corrections;flagis a true dry run whose<target>_proposedcolumn is identical to whatcorrectwould write while the data remain untouched;na_onlyvoids detected cells without imputing. Re-running on already-cleaned data aborts (the*_observedcolumn acts as a double-cleaning guard). - Researcher control.
income_cap,min_income,disable,enable_only, per-ruleparams, avariablesmapping, and custom ruleset files are honored and recorded in the run metadata and the report, so a reviewer can reproduce any configured run from its report alone. - Seeded smoke evidence: on 400 synthetic households (2,168 rows) with 158 planted errors across seven families, recovery is 100 percent in every family (65/65 decimal shifts, 33/33 extra zeros, 9/9 cap blowouts, 22/22 personal-income echoes, 11/11 tiny junk values, 11/11 sign flips, 7/7 residual sentinels), 0 of 2,010 clean cells are falsely modified, 83 percent of scale corrections land within 10 percent of the true value (median relative error 3.8 percent, at the simulation’s noise floor), and the run completes in about half a second.
- The modified-z detector (D10) gained a
min_relative_deviationgate (default 0.3) after the smoke run exposed that the ungated legacy criterion rewrote 12.09 percent of clean cells in tight households, where a tiny MAD inflates the z-score of ordinary variation. The gate removes every false positive with recall unchanged; the tight-household case is pinned in the regression suite. -
liss_equivalise_income()converts household income to a per-equivalent-adult scale (weighted_sqrt, matching the source pipelines’stand_incformula, plusoecd_modifiedandsqrt). - New exports:
liss_clean_income(),liss_cleaning_ruleset(),validate_cleaning_ruleset(),liss_cleaning_report(), andliss_equivalise_income(), with print and summary methods for run results and rulesets. A new vignette,income-cleaning, walks the workflow. - No new dependencies. Temporal smoothing uses a native weighted moving average (equivalent to
imputeTS::na_mawith linear weighting, including window widening), and the numeric kernels are base R.
Corrections to the source cleaning logic
The framework supersedes the income-cleaning blocks of the two analysis scripts it was distilled from. Eleven behaviors were deliberately changed, each documented in INCOME_CLEANING_DESIGN.md and pinned by a regression test, among them: the donor pool no longer offers the flagged row as its own donor; the power-of-ten kernel returns a full-length vector so zeros and negatives cannot desynchronize magnitudes from rows; the target variable resolves by explicit name and alias instead of a net|brut pattern match that could capture personal-income columns; blanket abs() on the target became the ledgered sign-rectification rule P03; bound violations rank by deviation ratio rather than first index; households process in wave order rather than file order; and residual SPSS user-missing codes are swept by the declared metadata (P06) rather than trusted to upstream reads.
Tests
- 44 new test blocks in
tests/testthat/test-clean-income.R: kernel units against hand-computed values, one fixture household per detector, ledger invariants, mode contracts, determinism and row-order independence, override paths, alias and fallback resolution, background attachment, report artifacts, and equivalisation. - Full suite in installed-package context: 74 test blocks, 208 passing expectations, 0 failures, 6 skips (5 empirical gates behind
LISSR_VERIFICATION_DIR, 1 CRAN skip).
lissr 1.1.0
Correctness release. Every fix below was verified against real LISS Panel files; the empirical evidence, per-wave counts, and methodology live in the repository’s lissr-verification-report.md and in the regression tests under tests/testthat/test-engine-regressions.R.
Merge engine
- Value recodes now use snapshot semantics.
value_recode,recode_to_na, andrecodemasks are built against the column as it stood when the rule started, so overlapping maps can no longer chain. On real data the old sequential loop misclassified 86 of 2,992 answered respondents (2.9 percent) in cr08a and 312 of 2,222 (14.0 percent) in cr14g once the religion crosswalks were made to run; the snapshot engine reproduces the verified target distributions exactly. -
recode_to_naaccepts therecode:alias for its sentinel map and honors wave-scopedexclude:blocks (skip cells whose suffix and wave both match). This activates the cv module’s HR01 through HR04 sentinel rules, which previously validated but never executed. - Superseded-release protection. When several primary files match one wave’s
file_pattern, the engine ranks release versions parsed from the file names, keeps the highest, and warns; unrankable candidates abort. Auxiliary files declared viaaux_filesresolve independently of the primary pattern and must be disjoint from the primary file on the id variable; any shared respondent id aborts. A duplicate-id gate runs on every wave regardless. Previously all pattern matches were stacked, which duplicated all 3,626 cd10c respondents when both the 1.0p and the superseding 1.1p release were on disk, and resurrected a field the 1.1p release had redacted. -
read_wave_filereads by extension from a whitelist (.sav, .zsav, .dta, .csv) and aborts on anything else instead of parsing it as CSV. SPSS files are read withuser_na = TRUE, so declared DK/refusal codes reach the recipes as values instead of being silently converted to NA at read. - Value labels and user-missing declarations round-trip under
labelled_policy: to_numeric. Metadata is stashed per wave at read time and restored at write time where provably safe (identical metadata across waves, every observed value accounted for); columns that cannot be restored pass through a per-wave residual sweep that converts codes their own wave declared user-missing to NA, honoring recipeexcludeblocks as a veto. Outputs regain their value labels (75 labelled columns in the cv verification merge) and no DK or refusal code can leak into the output as a substantive value. - Rules that resolve zero target columns write a
NO_TARGETSentry to the JSONL log, so mis-keyed or mis-scoped rules are visible in the audit trail. - Validation checks that name an unimplemented type report
SKIPwithpassed = NAinstead of PASS; summaries reportn_pass,n_fail, andn_skip. Newuniquenesscheck family (aliasesassert_unique,n_duplicates). - New
strictargument onmerge_liss_module()andmerge_liss_modules(): failed checks of severityerrorabort before any output is written. DefaultFALSEpreserves 1.0.0 behavior. - Two-factor authentication returns
NULLon failure instead of a half-initialized session, soliss_login()can no longer cache an unauthenticated session as logged in.
Recipes
- cr: HR10, HR11, HR12, HR20, HR21 scoped with
suffixes(the singularsuffixkey was not read, so the entire three-era religion harmonization silently never ran). All 40 crosswalk code assignments were verified against the wave value labels and codebooks. HR12’s instrument description corrected to 14 values plus -9. - cd: cd10c loads the superseding 1.1p release only; the stacking rule is retired and the overlap claim corrected (both releases contain the identical 3,626 respondents; 1.1p removes the redacted open-text dwelling-location item cd10c059). CHK06 and CHK07 re-pointed at the executable
uniquenesscheck. - cp: the A1 DK recodes scope with
suffixes: ["010", "011", "019"](theitemskey was not read, so the recode fell back to every numeric column, destroying legitimate duration paradata). - cs: A1_dk_recode rewritten as an executable
value_recode(999 to -9) on the verified DK suffixes 001, 002, and 283 for the pre-cs20m waves. - cv: schema_version 1.1.0; HR01 through HR04 now execute through the engine’s
recodealias andexcludesupport (no rule changes needed beyond comments). - Revised recipes carry
recipe_version: 1.1.0.
lissr 1.0.0
First public release.
- Recipe-driven merge engine. Longitudinal LISS waves are merged from declarative YAML recipes that conform to the canonical schema (
CANONICAL_SCHEMA.md, schema version 1.0.0). A recipe captures every merge-relevant decision for a module: wave file patterns, variable harmonization, boundary handling, comparability contracts, and validation checks. - Controlled action vocabulary with fail-fast validation. Recipes are validated before any merge runs via
validate_recipe()(also called byload_recipe()andmerge_liss_module()); unknown actions and malformed rules are rejected up front. - Authoring-time check for unrecognized rule keys.
validate_recipe()emits a non-fatal warning listing any rule-level key that the merge engine neither consults nor sanctions as documentation, so mis-named keys are surfaced at load time rather than ignored silently. The check is warning-only; every recipe still loads and merges unchanged. The recognized set and the documentation allow-list are both documented inCANONICAL_SCHEMA.md. - Audit-grade JSONL logging, with a per-run summary artifact.
- Ten built-in module recipes: Assets (ca), Housing (cd), Family and Household (cf), Health (ch), Economic Integration (ci), Personality (cp), Religion and Ethnicity (cr), Culture and Sports (cs), Politics and Values (cv), and Work and Schooling (cw).
- Authentication against the LISS Data Archive with two-factor verification; credentials stored via the system keyring.
- Interactive browse, select, and download workflow (
liss_modules(),liss_wave_matrix(),liss_select(),liss_download()). - New-wave onboarding via
onboard_new_wave()to extend an existing recipe to a newly released wave.