Skip to content

read_sav() exposes internal auxiliary variables of long string columns instead of reconstructing them #798

@jplavulca

Description

@jplavulca

Description

When reading a .sav file that contains long string variables (> 255 bytes), read_sav() exposes the internal auxiliary segmentation variables as separate columns in the dataframe, instead of reconstructing the original variable with its full content.

SPSS handles this correctly — it reads the same file and displays the variable as a single column with its full length. The issue is specific to read_sav().


Reproducible Example

library(haven)

# Create a dataframe with:
# - Q2:  a long string (> 255 bytes), which haven segments internally
#        generating auxiliaries Q21, Q22, Q23, Q24, Q25, Q26
# - Q23: another long string (> 255 bytes), whose name collides
#        with auxiliary Q23 generated for Q2

long_string_q2  <- paste(rep("a", 800), collapse = "")  # 800 bytes
long_string_q23 <- paste(rep("b", 300), collapse = "")  # 300 bytes

df <- data.frame(
  Q2  = long_string_q2,
  Q23 = long_string_q23,
  stringsAsFactors = FALSE
)

# Write to SAV — SPSS opens this file correctly,
# showing Q2 and Q23 as single columns with full content
haven::write_sav(df, "test.sav")

# Read back with haven
df_read <- haven::read_sav("test.sav")
names(df_read)
# Expected: c("Q2", "Q23")
# Actual:   c("Q2", "Q23", "Q231")  ← unintended auxiliary variable exposed

# Q23 is also truncated to 255 bytes instead of 300
vapply(df_read, function(x) attr(x, "format.spss"), character(1))
# Q2  → A400  (correct)
# Q23 → A255  (incorrect, should be A300)
# Q231 → A55  (internal auxiliary, should not be exposed)

Expected behavior

read_sav() should reconstruct long string variables from their internal segments and return a dataframe with the original variables at their full length, as SPSS does.


Actual behavior

  • Internal auxiliary variables (e.g. Q231) are exposed as separate columns
  • The original variable (Q23) is truncated to 255 bytes
  • The dataframe has more columns than the original

Context

This was discovered while comparing metadata across monthly SAV files. One month had Q23 with a maximum of 391 bytes, while another had 633 bytes. The 633-byte version was written and read back correctly. The 391-byte version produced Q23 truncated to 255 bytes and an extra Q231 column when read back with read_sav().

After extensive debugging, the root cause was identified: Q2 (1482 bytes) generates internal auxiliaries Q21...Q26. Since Q23 exists as an original variable and also exceeds 255 bytes, read_sav() cannot correctly distinguish between the original Q23 and the auxiliary Q23 generated for Q2, resulting in the corrupted read.

The file itself is valid — SPSS reads it correctly. The issue is in how read_sav() handles the reassembly of segmented long string variables when naming conflicts exist.


Session Info

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 26200)

Matrix products: default

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] haven_2.5.4

loaded via a namespace (and not attached):
[1] R6_2.5.1 utf8_1.2.4 tzdb_0.4.0 magrittr_2.0.3 glue_1.7.0 tibble_3.2.1
[7] pkgconfig_2.0.3 lifecycle_1.0.4 readr_2.1.5 cli_3.6.2 fansi_1.0.6 vctrs_0.6.5
[13] compiler_4.3.2 forcats_1.0.0 rstudioapi_0.15.0 tools_4.3.2 hms_1.1.3 pillar_1.9.0
[19] rlang_1.1.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions