Description
When reading a .sav file that contains long string variables (> 255 bytes), read_sav() exposes the internal auxiliary segmentation variables as separate columns in the dataframe, instead of reconstructing the original variable with its full content.
SPSS handles this correctly — it reads the same file and displays the variable as a single column with its full length. The issue is specific to read_sav().
Reproducible Example
library(haven)
# Create a dataframe with:
# - Q2: a long string (> 255 bytes), which haven segments internally
# generating auxiliaries Q21, Q22, Q23, Q24, Q25, Q26
# - Q23: another long string (> 255 bytes), whose name collides
# with auxiliary Q23 generated for Q2
long_string_q2 <- paste(rep("a", 800), collapse = "") # 800 bytes
long_string_q23 <- paste(rep("b", 300), collapse = "") # 300 bytes
df <- data.frame(
Q2 = long_string_q2,
Q23 = long_string_q23,
stringsAsFactors = FALSE
)
# Write to SAV — SPSS opens this file correctly,
# showing Q2 and Q23 as single columns with full content
haven::write_sav(df, "test.sav")
# Read back with haven
df_read <- haven::read_sav("test.sav")
names(df_read)
# Expected: c("Q2", "Q23")
# Actual: c("Q2", "Q23", "Q231") ← unintended auxiliary variable exposed
# Q23 is also truncated to 255 bytes instead of 300
vapply(df_read, function(x) attr(x, "format.spss"), character(1))
# Q2 → A400 (correct)
# Q23 → A255 (incorrect, should be A300)
# Q231 → A55 (internal auxiliary, should not be exposed)
Expected behavior
read_sav() should reconstruct long string variables from their internal segments and return a dataframe with the original variables at their full length, as SPSS does.
Actual behavior
- Internal auxiliary variables (e.g.
Q231) are exposed as separate columns
- The original variable (
Q23) is truncated to 255 bytes
- The dataframe has more columns than the original
Context
This was discovered while comparing metadata across monthly SAV files. One month had Q23 with a maximum of 391 bytes, while another had 633 bytes. The 633-byte version was written and read back correctly. The 391-byte version produced Q23 truncated to 255 bytes and an extra Q231 column when read back with read_sav().
After extensive debugging, the root cause was identified: Q2 (1482 bytes) generates internal auxiliaries Q21...Q26. Since Q23 exists as an original variable and also exceeds 255 bytes, read_sav() cannot correctly distinguish between the original Q23 and the auxiliary Q23 generated for Q2, resulting in the corrupted read.
The file itself is valid — SPSS reads it correctly. The issue is in how read_sav() handles the reassembly of segmented long string variables when naming conflicts exist.
Session Info
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 26200)
Matrix products: default
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] haven_2.5.4
loaded via a namespace (and not attached):
[1] R6_2.5.1 utf8_1.2.4 tzdb_0.4.0 magrittr_2.0.3 glue_1.7.0 tibble_3.2.1
[7] pkgconfig_2.0.3 lifecycle_1.0.4 readr_2.1.5 cli_3.6.2 fansi_1.0.6 vctrs_0.6.5
[13] compiler_4.3.2 forcats_1.0.0 rstudioapi_0.15.0 tools_4.3.2 hms_1.1.3 pillar_1.9.0
[19] rlang_1.1.3
Description
When reading a
.savfile that contains long string variables (> 255 bytes),read_sav()exposes the internal auxiliary segmentation variables as separate columns in the dataframe, instead of reconstructing the original variable with its full content.SPSS handles this correctly — it reads the same file and displays the variable as a single column with its full length. The issue is specific to
read_sav().Reproducible Example
Expected behavior
read_sav()should reconstruct long string variables from their internal segments and return a dataframe with the original variables at their full length, as SPSS does.Actual behavior
Q231) are exposed as separate columnsQ23) is truncated to 255 bytesContext
This was discovered while comparing metadata across monthly SAV files. One month had
Q23with a maximum of 391 bytes, while another had 633 bytes. The 633-byte version was written and read back correctly. The 391-byte version producedQ23truncated to 255 bytes and an extraQ231column when read back withread_sav().After extensive debugging, the root cause was identified:
Q2(1482 bytes) generates internal auxiliariesQ21...Q26. SinceQ23exists as an original variable and also exceeds 255 bytes,read_sav()cannot correctly distinguish between the originalQ23and the auxiliaryQ23generated forQ2, resulting in the corrupted read.The file itself is valid — SPSS reads it correctly. The issue is in how
read_sav()handles the reassembly of segmented long string variables when naming conflicts exist.Session Info
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 26200)
Matrix products: default
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] haven_2.5.4
loaded via a namespace (and not attached):
[1] R6_2.5.1 utf8_1.2.4 tzdb_0.4.0 magrittr_2.0.3 glue_1.7.0 tibble_3.2.1
[7] pkgconfig_2.0.3 lifecycle_1.0.4 readr_2.1.5 cli_3.6.2 fansi_1.0.6 vctrs_0.6.5
[13] compiler_4.3.2 forcats_1.0.0 rstudioapi_0.15.0 tools_4.3.2 hms_1.1.3 pillar_1.9.0
[19] rlang_1.1.3