When to write a custom recipe
The built-in recipes cover the eight core LISS modules as distributed by the archive. You might need a custom recipe when:
- You are working with a special-purpose LISS study (not one of the eight core modules) that has no built-in recipe.
- You want to subset waves or add project-specific harmonization on top of the standard recipe.
- You have downloaded data in a non-standard format (e.g. Stata instead of SPSS, or English vs Dutch labels).
- You want to change the sentinel-code policy —
perhaps keeping “don’t know” as a distinct category rather than recoding
to
NA.
Anatomy of a recipe
A recipe is a YAML file with these top-level sections:
| Section | Required | Purpose |
|---|---|---|
meta |
yes | module name, version, covered waves |
global |
yes | id variable, labelled policy, sentinel policy |
wave_index |
yes | one entry per wave: id, year, file pattern |
variable_rules |
no | prefix stripping, type coercion, renaming |
harmonization_rules |
no | sentinel recoding, value mapping, label fixes |
boundary_rules |
no | era flags, split variables, structural NA |
drop_retain_rules |
no | columns to drop or force-keep |
derived_variables |
no | new columns computed from existing ones |
validation_checks |
no | post-merge assertions |
logging |
yes | log file names, summary artifact toggle |
Every rule must have a unique rule_id, a non-empty
action from the controlled vocabulary, and a
description.
Scenario A — fork and modify a built-in recipe
The most common path is to start from a built-in recipe and adapt it.
library(lissr)
# load the built-in Health recipe as a list
recipe <- liss_recipe("ch")
# inspect its structure
str(recipe, max.level = 1)
#> List of 10
#> $ meta :List of 8
#> $ global :List of 6
#> $ wave_index :List of 17
#> $ variable_rules :List of 5
#> $ harmonization_rules:List of 8
#> $ boundary_rules :List of 4
#> $ drop_retain_rules :List of 3
#> $ derived_variables :List of 2
#> $ validation_checks :List of 4
#> $ logging :List of 3Subset to specific waves
# keep only post-2015 waves for an analysis of recent trends
recipe$wave_index <- purrr::keep(
recipe$wave_index,
~ as.integer(.x$year) >= 2015
)
# update the metadata to reflect the change
recipe$meta$covered_waves <- purrr::map_chr(recipe$wave_index, "id")
recipe$meta$notes <- "forked from built-in ch recipe; restricted to 2015+"Change the labelled policy
By default, the recipe converts haven-labelled columns to numeric. If your analysis needs factor levels (e.g. for ordinal logistic regression in R where factor ordering matters), switch the policy:
recipe$global$labelled_policy <- "to_factor"Add a custom harmonization rule
Suppose you want to recode self-rated health (suffix 001) from its original 1–5 scale to a binary indicator (1–2 = poor/moderate vs 3–5 = good/very good/excellent):
Validate and run
# the schema validator catches typos in action names, missing fields, etc.
validate_recipe(recipe, "custom_ch_recipe")
# run the merge
result <- merge_liss_module(recipe, data_dir = "data/ch", output_dir = "output/custom")Save the modified recipe to disk
yaml::write_yaml(recipe, "my_ch_recipe.yml")
# later, reload it
recipe <- load_recipe("my_ch_recipe.yml")Scenario B — write a recipe from scratch
For a non-core LISS study (e.g. a special-purpose module on COVID attitudes), you write the recipe from scratch.
my_recipe <- list(
meta = list(
module = "covid",
module_label = "COVID Attitudes Special Study",
recipe_version = "1.0.0",
created = format(Sys.Date()),
source_spec = "covid_codebook_EN.pdf",
covered_waves = list("covid20a", "covid20b", "covid21c"),
schema_version = "1.0.0"
),
global = list(
id_variable = "nomem_encr",
wave_variable = "wave_id",
year_variable = "wave_year",
labelled_policy = "to_numeric",
missing_variable_policy = "warn_and_create_na",
strip_label_whitespace = TRUE
),
wave_index = list(
list(id = "covid20a", year = 2020, file_pattern = "covid20a_*"),
list(id = "covid20b", year = 2020, file_pattern = "covid20b_*"),
list(id = "covid21c", year = 2021, file_pattern = "covid21c_*")
),
variable_rules = list(
list(
rule_id = "V01_strip_prefix",
action = "strip_prefix",
description = "remove wave prefix from all columns"
)
),
harmonization_rules = list(
list(
rule_id = "H01_sentinel_recode",
action = "recode_to_na",
description = "recode -9 (DK) and -8 (PNTS) to NA",
scope = "all_numeric",
codes = list(-9, -8)
)
),
boundary_rules = list(),
drop_retain_rules = list(),
derived_variables = list(),
validation_checks = list(),
logging = list(
log_file = "covid_merge_log.jsonl",
report_file = "covid_merge_report.txt",
summary_artifact = list(enabled = TRUE)
)
)
# validate before first use
validate_recipe(my_recipe, "covid_recipe.yml")
# save to disk
yaml::write_yaml(my_recipe, "covid_merge_recipe.yml")Scenario C — keep sentinel codes as distinct values
Some analyses treat “don’t know” and “prefer not to say” as
informative categories (e.g. in survey methodology research studying
item non-response). In that case, skip the harmonization rules that
recode sentinels to NA.
recipe <- liss_recipe("ch")
# remove all recode_to_na rules
recipe$harmonization_rules <- purrr::discard(
recipe$harmonization_rules,
~ .x$action == "recode_to_na"
)
# optionally add a rule to rename sentinels instead of dropping them
recipe$harmonization_rules <- append(
recipe$harmonization_rules,
list(list(
rule_id = "CUSTOM_keep_dk",
action = "value_recode",
description = "recode -9 to 97 (DK) and -8 to 98 (PNTS) for explicit modelling",
suffixes = list("001"),
mapping = list("-9" = 97, "-8" = 98),
waves = "all"
))
)
validate_recipe(recipe, "ch_keep_sentinels")
result <- merge_liss_module(recipe, data_dir = "data/ch", output_dir = "output/sentinels")Scenario D — add comparability contracts to your custom rules
If your custom recipe introduces a boundary that future users should know about, annotate it with a comparability contract:
recipe <- liss_recipe("cs")
recipe$boundary_rules <- append(
recipe$boundary_rules,
list(list(
rule_id = "CUSTOM_B01_sport_freq_redesign",
action = "add_period_flag",
description = "sport frequency question redesigned in cs20m",
flag_column = "sport_freq_era",
waves_early = list("cs08a", "cs09b", "cs10c", "cs11d", "cs12e",
"cs13f", "cs14g", "cs15h", "cs16i", "cs17j",
"cs18k", "cs19l"),
waves_late = list("cs20m", "cs21n", "cs22o", "cs23p", "cs24q"),
early_label = "open_numeric",
late_label = "categorical",
comparability = list(
status = "non_comparable",
method = "no_pool",
rationale = paste(
"pre-cs20m used open numeric entry conditional on participation;",
"post-cs20m uses categorical scale asked unconditionally.",
"do not pool without explicit period interaction."
)
)
))
)
validate_recipe(recipe, "cs_with_custom_boundary")Controlled action vocabulary reference
Consult the canonical schema for the full list of allowed actions. The validator rejects any action not in this vocabulary:
# inspect the vocabulary programmatically
lissr:::VALID_ACTIONS
#> $variable_rules
#> [1] "strip_prefix" "type_coerce" "rename"
#> [4] "set_label" "apply_labelled_policy" "strip_value_labels"
#> [7] "note_only"
#>
#> $harmonization_rules
#> [1] "recode_to_na" "value_recode" "fix_label"
#> [4] "crosswalk" "strip_question_stem" "lowercase_labels"
#> [7] "flag_only" "note_only"
#>
#> $boundary_rules
#> [1] "add_era_flag" "add_flag" "add_period_flag"
#> [4] "split_variable" "structural_na" "filter_rows"
#> [7] "crosswalk_rename" "stack_aux_files" "note_only"
#>
#> $drop_retain_rules
#> [1] "drop" "retain"
#> [3] "retain_if_present" "retain_as_metadata_only"
#> [5] "note_only"Sharing recipes with collaborators
If you publish an analysis using LISS data, include your recipe YAML
in the replication package. Anyone with lissr installed can
then reproduce your exact merge pipeline:
# in the collaborator's R session
recipe <- lissr::load_recipe("custom_ch_recipe.yml")
result <- lissr::merge_liss_module(recipe, data_dir = "data/ch", output_dir = "output")The recipe file is the single source of truth for every data-cleaning decision, replacing the typical sequence of ad-hoc R scripts that are hard to audit and easy to break.