Skip to contents

Overview

lissr includes a recipe-driven merge engine that processes YAML specifications conforming to the Canonical Schema v1.0.0. Each recipe encodes all merge-relevant decisions for a module: file patterns, variable harmonization, boundary handling, comparability contracts, and validation checks.

All merged output is written in SPSS .sav format to preserve variable labels, value labels, and user-defined missing values.

Single module merge

library(lissr)

recipe <- liss_recipe("ch")
result <- merge_liss_module(
  recipe,
  data_dir   = "liss/ch",
  output_dir = "./output"
)

This produces four files:

  • ch_merged.sav — merged data (SPSS format, preserving all labels)
  • ch_merge_log.jsonl — audit-grade structured log
  • ch_merge_summary.json — per-run summary (if enabled in recipe)
  • ch_merge_report.txt — human-readable report

Batch merge

modules <- c("ch", "cv", "cd", "cf", "cw", "cp", "cs", "ci")
recipe_paths <- purrr::map_chr(modules, ~ {
  system.file("recipes", paste0(.x, "_merge_recipe.yml"), package = "lissr")
})

results <- merge_liss_modules(
  recipe_paths,
  data_dir   = "liss",
  output_dir = "./output"
)

Cross-module panel merge

After merging individual modules, combine them into one wide dataset. Columns are prefixed with the module code to avoid collisions (e.g. ch_s004, cv_s004).

panel <- merge_liss_panel(results, write_to = "./output/liss_panel.sav")

# only respondent-years present in all modules
panel_inner <- merge_liss_panel(results, join_type = "inner")

Validate recipes

recipe <- liss_recipe("ch")
validate_recipe(recipe, "ch_merge_recipe.yml")

Onboard a new wave

When a new wave is released, the onboarding helper automates most of the checklist: variable diffs, candidate wave_index entry, expected-presence checks, and boundary alerts.

onboard_new_wave(
  recipe_path  = system.file("recipes", "ch_merge_recipe.yml", package = "lissr"),
  new_file     = "ch25r_EN_1.0p.sav",
  prev_wave_id = "ch24q"
)

Background variables

The background variables module (CA) is a monthly snapshot used as the linkage backbone for all other modules. When merging background variables with survey data:

  1. Use only nomem_encr as the join key (never nohouse_encr).
  2. Match the fieldwork month of the survey, not the calendar year.

The merged output includes a fieldwork_ym column (YYYYMM integer) derived automatically from the LISS _m suffix. Background variable files carry the period in their filename rather than as a column, so you need to tag each file before stacking:

# example: merge Health survey with background variables
survey <- haven::read_sav("output/ch_merged.sav")

# read avars files and tag each with YYYYMM from the filename
bg_files <- list.files("data/avars/", pattern = "\\.sav$", full.names = TRUE)
bg_data  <- purrr::map_dfr(bg_files, function(f) {
  ym <- as.integer(stringr::str_extract(basename(f), "\\d{6}"))
  haven::read_sav(f) |> dplyr::mutate(fieldwork_ym = ym)
})

merged <- dplyr::left_join(
  survey, bg_data,
  by = c("nomem_encr", "fieldwork_ym")
)