read_sav() exposes internal auxiliary variables of long string columns instead of reconstructing them

## Description

When reading a `.sav` file that contains long string variables (> 255 bytes), `read_sav()` exposes the internal auxiliary segmentation variables as separate columns in the dataframe, instead of reconstructing the original variable with its full content.

SPSS handles this correctly — it reads the same file and displays the variable as a single column with its full length. The issue is specific to `read_sav()`.


---

## Reproducible Example

```r
library(haven)

# Create a dataframe with:
# - Q2:  a long string (> 255 bytes), which haven segments internally
#        generating auxiliaries Q21, Q22, Q23, Q24, Q25, Q26
# - Q23: another long string (> 255 bytes), whose name collides
#        with auxiliary Q23 generated for Q2

long_string_q2  <- paste(rep("a", 800), collapse = "")  # 800 bytes
long_string_q23 <- paste(rep("b", 300), collapse = "")  # 300 bytes

df <- data.frame(
  Q2  = long_string_q2,
  Q23 = long_string_q23,
  stringsAsFactors = FALSE
)

# Write to SAV — SPSS opens this file correctly,
# showing Q2 and Q23 as single columns with full content
haven::write_sav(df, "test.sav")

# Read back with haven
df_read <- haven::read_sav("test.sav")
names(df_read)
# Expected: c("Q2", "Q23")
# Actual:   c("Q2", "Q23", "Q231")  ← unintended auxiliary variable exposed

# Q23 is also truncated to 255 bytes instead of 300
vapply(df_read, function(x) attr(x, "format.spss"), character(1))
# Q2  → A400  (correct)
# Q23 → A255  (incorrect, should be A300)
# Q231 → A55  (internal auxiliary, should not be exposed)
```

---

## Expected behavior

`read_sav()` should reconstruct long string variables from their internal segments and return a dataframe with the original variables at their full length, as SPSS does.

---

## Actual behavior

- Internal auxiliary variables (e.g. `Q231`) are exposed as separate columns
- The original variable (`Q23`) is truncated to 255 bytes
- The dataframe has more columns than the original

---

## Context

This was discovered while comparing metadata across monthly SAV files. One month had `Q23` with a maximum of 391 bytes, while another had 633 bytes. The 633-byte version was written and read back correctly. The 391-byte version produced `Q23` truncated to 255 bytes and an extra `Q231` column when read back with `read_sav()`.

After extensive debugging, the root cause was identified: `Q2` (1482 bytes) generates internal auxiliaries `Q21`...`Q26`. Since `Q23` exists as an original variable and also exceeds 255 bytes, `read_sav()` cannot correctly distinguish between the original `Q23` and the auxiliary `Q23` generated for `Q2`, resulting in the corrupted read.

The file itself is valid — SPSS reads it correctly. The issue is in how `read_sav()` handles the reassembly of segmented long string variables when naming conflicts exist.

---

## Session Info

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 26200)

Matrix products: default

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] haven_2.5.4

loaded via a namespace (and not attached):
 [1] R6_2.5.1          utf8_1.2.4        tzdb_0.4.0        magrittr_2.0.3    glue_1.7.0        tibble_3.2.1     
 [7] pkgconfig_2.0.3   lifecycle_1.0.4   readr_2.1.5       cli_3.6.2         fansi_1.0.6       vctrs_0.6.5      
[13] compiler_4.3.2    forcats_1.0.0     rstudioapi_0.15.0 tools_4.3.2       hms_1.1.3         pillar_1.9.0     
[19] rlang_1.1.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_sav() exposes internal auxiliary variables of long string columns instead of reconstructing them #798

Description

Reproducible Example

Expected behavior

Actual behavior

Context

Session Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

read_sav() exposes internal auxiliary variables of long string columns instead of reconstructing them #798

Description

Description

Reproducible Example

Expected behavior

Actual behavior

Context

Session Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions