Skip to content

Latest commit

 

History

History
184 lines (141 loc) · 7.75 KB

File metadata and controls

184 lines (141 loc) · 7.75 KB

Contents


Column Selection

Wide CSV files often contain dozens or hundreds of columns, but a given application typically only needs a handful of them. The headers: { only: } and headers: { except: } options let you declare upfront which columns you want, so SmarterCSV skips allocation and hash insertion for everything else — both in the Ruby path and in the C-accelerated hot path.

Options

Option Default Description
headers: { only: } nil Keep only the listed columns in each result hash
headers: { except: } nil Remove the listed columns from each result hash

You cannot use both options at the same time — doing so raises SmarterCSV::ValidationError.

Basic usage

# Keep only two columns out of a wide file
data = SmarterCSV.process('big.csv', headers: { only: [:id, :email] })
# => [{id: 1, email: "alice@example.com"}, ...]

# Keep everything except one noisy column
data = SmarterCSV.process('big.csv', headers: { except: [:internal_notes] })

Input flexibility

Both options accept an Array of symbols or strings, or a single symbol or string — anything that makes sense as a column name. All values are normalized to symbols internally.

headers: { only: :id }                     # single symbol — same as [:id]
headers: { only: 'id' }                    # single string — normalized to :id
headers: { only: [:id, :email] }           # array of symbols
headers: { only: ['id', 'email'] }         # array of strings — normalized to symbols

Names refer to post-mapping keys

headers: { only: } and headers: { except: } use the post-mapping column name — the symbol that actually appears in the result hash after key_mapping: has been applied. You never need to know the original CSV header spelling.

# CSV has header "First Name"; key_mapping renames it to :given_name
data = SmarterCSV.process('contacts.csv',
  key_mapping:  { first_name: :given_name },
  headers:      { only: [:given_name] },   # post-mapping name
)
# => [{given_name: "Alice"}, ...]

Interaction with with_line_numbers:

:csv_line_number is added to each hash after column selection runs, so it is always present when with_line_numbers: true — even if it is not listed in headers: { only: }.

data = SmarterCSV.process('data.csv',
  headers:           { only: [:name] },
  with_line_numbers: true,
)
data.each { |row| puts "#{row[:csv_line_number]}: #{row[:name]}" }

Interaction with strict:

strict: true raises SmarterCSV::HeaderSizeMismatch when a data row contains more fields than the header row. This check runs before column selection, so schema validation still catches malformed rows even when headers: { only: } is active.

# Raises HeaderSizeMismatch on the row with extra fields, regardless of headers: { only: }
SmarterCSV.process('data.csv', headers: { only: [:name] }, strict: true)

Extra columns without strict:

When strict: is false (the default) and a data row has more fields than the header, the extra columns are silently dropped — they cannot be in the headers: { only: } set, so the filter discards them naturally.

Important: missing_headers: :auto (auto-generating names like column_7, column_8 for extra data columns) does not work in combination with headers: { only: }. headers: { only: } is a performance improvement that causes the parser to stop scanning a row once all requested columns have been found — any extra columns beyond the header count are never visited, so no auto-names are generated for them. If you need to capture auto-named overflow columns, do not use headers: { only: } at the same time.

Unknown column names are silently ignored

Listing a column name that doesn't exist in the file is not an error. The column simply never appears in any row hash.

# :nonexistent_column is not in the file — no error, just absent from results
data = SmarterCSV.process('data.csv', headers: { only: [:id, :nonexistent_column] })

Performance

Both options are implemented in the C extension (when acceleration is enabled). Excluded columns are skipped entirely inside the C parsing loop — no Ruby string is allocated, no numeric conversion runs, and no rb_hash_aset call is made for fields the caller doesn't need. This makes column selection a genuine performance option for wide CSV files, not just a post-processing filter.

The Ruby fallback path applies the same filter via hash.select! / hash.reject! after parsing, giving correct results on all platforms.

headers: { only: } vs headers: { except: } — performance asymmetry

headers: { only: } enables early exit. Once every requested column has been parsed, the parser stops scanning the current row entirely — the remaining fields are never visited. For a 500-column file where you only need 5 columns near the start, this can be 10–14× faster than parsing all columns.

headers: { except: } cannot have early exit. To know which columns to keep, the parser must scan every field in the row to the end. Skipping just a few columns out of many saves very little work, so benchmark results for headers: { except: } are typically flat (0.7×–1.0× vs full parse).

Rule of thumb:

  • Use headers: { only: } when you want a small subset of a wide file — this is the fast path.
  • Use headers: { except: } only when you want almost everything and excluding a known noisy column is more convenient than listing all the ones you want.
  • Avoid headers: { except: } as a performance tool on wide files — it provides no speed benefit.

headers: { only: } vs remove_unmapped_keys:

If you are already using key_mapping: to rename headers, the remove_unmapped_keys: true option lets you implicitly drop everything not in the map — without listing each unwanted column explicitly. This is a convenient alternative to headers: { only: } when renaming and selecting go hand in hand:

# With key_mapping + remove_unmapped_keys: convenient when renaming
SmarterCSV.process('data.csv',
  key_mapping:          { col_a: :name, col_b: :email },
  remove_unmapped_keys: true,
)

# With headers: { only: }: better for pure selection — C-path early exit applies
SmarterCSV.process('data.csv',
  headers: { only: [:col_a, :col_b] },
)

headers: { only: } is the faster choice for wide files since unneeded fields are skipped inside the C parser before any Ruby objects are created. remove_unmapped_keys: is a post-parse filter — all fields are parsed first, then the unwanted keys are deleted. See Header Transformations for more details.


PREVIOUS: Header Validations | NEXT: Data Transformations | UP: README