Real-World CSV Files in Production

CSV is the most common data exchange format in enterprise software — and also one of the most inconsistently implemented. This page documents what you will actually encounter when processing production CSV files, and how SmarterCSV handles each case.

Status Legend

Symbol	Meaning
✅	Handled automatically — no configuration needed
🔘	Handled — but requires the user to specify an option
❌	Not handled — caller must pre-process or work around

Encoding & BOM

Real-world files come from dozens of different systems, each with their own default encoding. Excel in particular is notorious for writing UTF-8 files with a Byte Order Mark (BOM) that trips up many parsers.

Issue	Status	Notes
UTF-8 with BOM (`\xEF\xBB\xBF`)	✅	Stripped automatically from the first line. Excel always writes this.
`\r\n` CRLF line endings	✅	Auto-detected. The default on Windows and most enterprise exports.
`\r` only (classic Mac)	✅	Auto-detected. Rare today but still seen in legacy pipelines.
Windows-1252 / Latin-1	🔘	Specify `file_encoding: 'windows-1252'`. Common in European financial exports, older SAP systems, QuickBooks.
UTF-16 LE with BOM	🔘	Specify `file_encoding: 'utf-16le'`. Some Microsoft SQL Server and Access exports default to this.
Shift-JIS / EUC-JP	🔘	Specify `file_encoding: 'shift_jis'` or `'euc-jp'`. Japanese ERP and POS systems.

Quoting & Escaping

Two competing quoting conventions exist in the wild and are both common: RFC 4180 (used by Excel) and backslash escaping (used by MySQL, PostgreSQL). SmarterCSV defaults to :auto mode which tries backslash first and falls back to RFC 4180.

Issue	Status	Notes
RFC 4180 double-quote escaping (`""`)	✅	The Excel standard. Handled by `quote_escaping: :auto` (default).
Backslash escaping (`\"`)	✅	Used by MySQL `SELECT INTO OUTFILE`, `mysqldump`, PostgreSQL `COPY`. Handled by `:auto` default.
Newlines inside quoted fields	✅	Multi-line field stitching. Common in address fields, notes, and CRM comment exports.
Mid-field quote characters (`5'10"`, inch marks, apostrophes)	✅	`quote_boundary: :standard` (default since 1.16.0) only recognizes quotes at field boundaries. Mid-field quotes are treated as literal characters.
Semicolon-delimited files mislabeled as CSV	✅	`col_sep: :auto` (default) detects the actual separator. Common in European locales where comma is the decimal separator.
Tab-delimited TSV files	✅	`col_sep: :auto` detects tabs. Common in bioinformatics and some government data portals.

Header Quirks

Headers in production files are rarely as clean as you'd expect. They carry units, source system field names, BOM characters, duplicates, and sometimes no headers at all.

Issue	Status	Notes
BOM on first header field	✅	Stripped automatically. Without this, the first key would be `:\xEF\xBB\xBFname` instead of `:name`.
Duplicate headers	✅	Disambiguated using `duplicate_header_suffix` (default `''` → `:email`, `:email_2`, `:email_3`).
Empty or whitespace-only headers	✅	Auto-named using `missing_header_prefix` (default `column_`) → `:column_1`, `:column_2`. Values are never silently dropped.
Trailing comma on header row (phantom empty column)	✅	The phantom column is auto-named just like any other empty header.
Headers with spaces and special characters (`Revenue (USD)`)	✅	Spaces and dashes normalized to underscores → `:revenue_(usd)`. Parentheses, slashes, etc. are preserved.
Extra data columns beyond the header row	✅	Auto-generates `column_N` names for extra fields. Controlled by `missing_headers:` option.
No header row at all	🔘	Use `headers_in_file: false, user_provided_headers: [:col1, :col2, ...]`. Common in raw database dumps and fixed-format legacy exports.

Numeric & Data Type Landmines

Numeric conversion is one of the most common sources of data loss. SmarterCSV converts values that look like numbers by default — which is correct for most cases — but certain fields must be excluded explicitly.

Issue	Status	Notes
Integer and float conversion	✅	`convert_values_to_numeric: true` (default). `"42"` → `42`, `"3.14"` → `3.14`.
Currency symbols in values (`$1,234.56`, `€1.234,56`)	✅ / 🔘	Won't match the numeric pattern — safely left as a string. Use `value_converters` if numeric value is needed.
Percentage values (`12.5%`)	✅ / 🔘	Won't match the numeric pattern — safely left as a string. Use `value_converters` if numeric value is needed.
Leading zeros (ZIP codes, phone numbers, SKUs, account numbers)	🔘	`convert_values_to_numeric: { except: [:zip, :phone, :sku] }`. Without this, `"01234"` becomes `1234`. One of the most common silent data loss bugs in CSV processing! US ZIP codes have leading zeroes.
NULL / empty value variants (`NULL`, `\N`, `N/A`, `(null)`, `#N/A`)	🔘	Use `nil_values_matching: /\A(NULL\\|\\N\|N\/A\|#N\/A\|\$null\$)\z/i`. Without configuration these are left as literal strings.
Date values (`2023-01-15`, `01/02/2023`, `Jan 2, 2023`)	🔘	Use `value_converters` with a date parsing lambda. SmarterCSV does not auto-convert dates — format ambiguity (`01/02/2023` = Jan 2 or Feb 1?) makes auto-conversion unsafe.
Boolean variants (`Y/N`, `Yes/No`, `TRUE/FALSE`, `1/0`, `X/` in SAP)	🔘	Use `value_converters` for the relevant columns.
European number format (`1.234,56` meaning 1234.56)	🔘	Use a value_converter that swaps dot and comma before parsing. Common in German, French, Italian, and Spanish exports.

File Size & Structure

Issue	Status	Notes
Millions of rows	✅	Use `chunk_size: N` for batch processing. SmarterCSV streams the file and never loads it entirely into memory.
Gigabyte-sized files	✅	Streaming architecture. Memory usage is proportional to chunk size, not file size.
Single-column CSV files	✅	`col_sep: :auto` handles files with no detected separator gracefully (fixed in issue #222).
Ragged rows — fewer fields than headers	✅	Missing trailing fields produce no key in the hash. Combined with `remove_empty_values: true` (default), short rows are handled cleanly.
Ragged rows — more fields than headers	✅	Extra columns are auto-named `column_N` via `missing_header_prefix`.
Empty or whitespace-only file	✅	Raises `SmarterCSV::EmptyFileError` with a clear message instead of a cryptic internal error.

Enterprise & Application-Specific Patterns

Databases & Data Warehouses

Source	Issue	Status	Notes
MySQL `SELECT INTO OUTFILE`	Backslash quote escaping	✅	`quote_escaping: :auto` default.
PostgreSQL `COPY TO`	Backslash quote escaping, `\N` for NULL	✅ / 🔘	Escaping handled automatically; `\N` as nil requires `nil_values_matching`.
SEC EDGAR	Pipe-delimited, UTF-8, clean format	✅	`col_sep: :auto` detects the pipe separator.
UNIX DB Dumps†	CTRL-A col separator, CTRL-B row separator, `#` comment lines	🔘	`col_sep: "\cA", row_sep: "\cB", comment_regexp: /^#/`

ERP & CRM

Source	Issue	Status	Notes
SAP ALV / IDOC exports	Space-padded fixed-width fields	✅	`strip_whitespace: true` (default) trims all field values.
SAP BW/BEx	Very wide exports (300–500+ columns)	✅	No column count limit.
Salesforce reports	Trailing empty columns, quoted address fields with newlines	✅	Both handled by default.

Spreadsheets

Source	Issue	Status	Notes
Excel `Save As CSV`	UTF-8 BOM, RFC 4180 quoting, 1,048,576 row limit	✅	BOM stripped, quoting handled. Row limit is an Excel constraint — SmarterCSV will parse whatever Excel wrote.

Finance & Banking

Source	Issue	Status	Notes
Stripe / Coinbase / modern fintechs	Clean UTF-8 CSV, ISO 8601 dates, no BOM	✅	No special configuration needed.
Bank statement exports (Chase, Wells Fargo, Barclays, …)	Metadata preamble rows before the header (account number, date range, institution name)	🔘	Use `skip_lines: N` to skip the preamble. N varies by bank and may change with format updates.
Accounting negative notation	`(1,234.56)` instead of `-1234.56` — used by QuickBooks, Xero, SAP, and most bank exports	🔘	Use a `value_converters` lambda: `->(v) { v&.match?(/\A$.*$\z/) ? -v.gsub(/[(),]/, '').to_f : v }`
PayPal transaction exports	Preamble rows, mixed currency/amount columns, locale-specific date format	🔘	Use `skip_lines:` for preamble; use `value_converters` for dates and signed amounts.
Bloomberg / Refinitiv terminal exports	`\|` separator, `N.A.` for nulls, proprietary date formats	🔘	`col_sep: "\|"`, `nil_values_matching: /\AN\.A\.\z/`, `value_converters` for dates.
QuickBooks exports	Windows-1252 encoding, currency-formatted values	🔘	Specify `file_encoding: 'windows-1252'`. Currency values like `"$1,234.56"` stay as strings.

Government & Public Data

Source	Issue	Status	Notes
Government open data portals	Semicolons as separator, Latin-1, inconsistent quoting	✅ / 🔘	`col_sep: :auto` handles semicolons; specify `file_encoding:` if non-UTF-8.
US Census Bureau	Very large files (millions of rows), heavily coded values	✅	Use `chunk_size:` for memory-efficient processing.
US Treasury / USASpending.gov	Large files, many empty columns, dollar amounts as plain strings	✅	Works out of the box; `remove_empty_values: true` (default) drops empty columns.
World Bank / IMF data exports	4–5 preamble rows (title, source, notes) before the header	🔘	`skip_lines: N` to skip the preamble. N is typically 4 for World Bank, 5 for IMF.
Australian ABS (Bureau of Statistics)	UTF-8 BOM, preamble metadata rows before the header	🔘	BOM stripped automatically; use `skip_lines: N` for the preamble.

Healthcare & Life Sciences

Source	Issue	Status	Notes
HL7 / FHIR flattened exports	Very wide files (100+ columns), many empty fields, cryptic column names (`component_0_valueQuantity_value`)	✅	Parses fine. `remove_empty_values: true` (default) drops empty fields automatically.
Epic / Cerner EHR exports	Windows-1252 encoding, locale-specific date formats	🔘	`file_encoding: 'windows-1252'`; use `value_converters` for date columns.
Lab instrument exports (Roche, Abbott, Siemens)	Semicolon separator (European instruments), preamble rows with instrument metadata	🔘	`col_sep: :auto` detects the separator; `skip_lines: N` for the preamble.
DICOM-SR flattened to CSV	Nested structured report data squashed into column names	✅	Parses fine. Data model is messy but no special configuration needed.
FDA adverse event / MedWatch exports	Pipe-delimited, `null` literal strings, long free-text fields with embedded newlines	🔘	`col_sep: "\|"`, `nil_values_matching: /\Anull\z/i`; embedded newlines handled automatically.
Bioinformatics (VCF-derived)	Thousands of columns (one sample per column)	✅	No column count limit in the parsing hot path.

E-commerce & Survey Tools

Source	Issue	Status	Notes
Shopify / WooCommerce	Pipe-delimited values within a field (`tag1\|tag2\|tag3`)	🔘	Use `value_converters` to split on `\|` for the relevant column.
Qualtrics / SurveyMonkey	200–800 columns, multi-row headers, HTML in values	🔘	Multi-row headers require pre-processing; HTML in values left as-is (use value_converters to strip).

Legacy & Unusual Formats

Source	Issue	Status	Notes
Apple iTunes DB export†	CTRL-A col separator, CTRL-B row separator, `#` comment lines	🔘	`col_sep: "\cA", row_sep: "\cB", comment_regexp: /^#/`

I/O Patterns

Source	Issue	Status	Notes
Gzipped CSV (`.csv.gz`)	Compressed file	🔘	Decompress and pass the resulting IO object: `SmarterCSV.process(Zlib::GzipReader.open(path))`.
HTTP streaming	Parsing from a live HTTP response	🔘	Pass any IO-compatible object that responds to `#gets`.

†: Legacy Apple DB Dump and older UNIX data dumps use ASCII control characters as delimiters:

col_sep = "\x01" # CTRL-A
row_sep = "\x02" # CTRL-B
comment_prefix = "#"

This is a clever design: since CTRL-A and CTRL-B never appear in normal text, fields never need quoting or escaping — eliminating an entire class of parsing ambiguity.

Pathological Cases ❌

These formats have structural problems that no CSV parser can transparently resolve. Pre-processing the file before passing it to SmarterCSV is the only reliable solution.

Issue	Why it breaks	Workaround
Mixed line endings within one file	Row separator is detected once from the first N bytes. A file mixing `\r\n` and `\n` will produce rows with stray `\r` on some values.	Pre-process with `dos2unix` or equivalent.
Mixed encodings within one file	Happens when CSVs are concatenated from multiple sources. `force_utf8: true` with `invalid_byte_sequence: ''` is the best available mitigation, but true mixed-encoding files cannot be reliably fixed by any parser.	Identify and re-encode each source file before concatenating.
Unquoted fields containing the column separator	Malformed CSV — the field will be split incorrectly and there is no way to recover the original value.	Fix upstream at the data source.
Repeated header row mid-file	Happens when files are assembled with `cat chunk_1.csv chunk_2.csv`. The repeated header lands as a data row: `{name: "name", age: "age"}`.	Strip repeated header lines before parsing, or post-filter rows where all values equal their key names.
Trailer / summary rows	Totals or citation rows at end of file have no consistent marker.	Pre-process to remove, or post-filter with a sentinel check: `rows.reject { \|r\| r[:date].nil? }`.
REDCap (clinical trial data)	Two-row header: field names row + field labels row. The labels row lands as the first data row.	Drop post-parse: `rows.drop(1)`, or pre-process to remove the labels row.
IRS / SOI Tax Stats — footnote rows mixed into data	Footnote rows (e.g. `* Data suppressed`) appear mid-file with no consistent column structure. No option to filter mid-file rows by pattern.	Pre-process to strip footnote lines before parsing.
UK ONS (Office for National Statistics) — multi-row headers	Title row + unit row before the actual header row. SmarterCSV reads one header row; the extra rows land as data.	Pre-process to collapse or remove the extra header rows.
UK ONS — footer footnotes (`[note]`, `[x]`)	Footnote rows at end of file use inline markers with no consistent structure.	Pre-process to strip footer lines, or post-filter rows where key fields are nil.
World Bank / IMF — footer with source citation	Last 1–3 lines contain source attribution text, not data.	Pre-process to strip, or use `rows[0..-N]` to drop the last N rows post-parse.
Australian ABS — merged cell artifacts	Excel merged cells export as a value in the first occurrence and blank in subsequent rows. The blank column becomes `:column_1` with empty values.	Post-process: forward-fill the blank column from the previous non-empty value.

Quick Reference: Common Option Combinations

# Legacy enterprise export (Windows, Latin-1, BOM, CRLF)
SmarterCSV.process(file, file_encoding: 'windows-1252')

# MySQL dump (backslash escaping, \N for NULL)
SmarterCSV.process(file,
  quote_escaping: :backslash,
  nil_values_matching: /\A\\N\z/)

# Financial data (preserve leading zeros, no numeric conversion on key fields)
SmarterCSV.process(file,
  convert_values_to_numeric: { except: [:account_number, :zip, :routing_number] })

# SAP wide export with duplicate column names
SmarterCSV.process(file,
  duplicate_header_suffix: '_',
  strip_whitespace: true)

# Survey export with boolean and N/A values
SmarterCSV.process(file,
  nil_values_matching: /\A(N\/A|NA|n\/a)\z/,
  value_converters: {
    completed: ->(v) { v&.upcase == 'Y' }
  })

# Gzipped CSV
require 'zlib'
SmarterCSV.process(Zlib::GzipReader.open('data.csv.gz'))

# HTTP streaming
require 'open-uri'
SmarterCSV.process(URI.open('https://example.com/data.csv'))

PREVIOUS: Examples | NEXT: SmarterCSV over the Years | UP: README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Contents

Real-World CSV Files in Production

Status Legend

Encoding & BOM

Quoting & Escaping

Header Quirks

Numeric & Data Type Landmines

File Size & Structure

Enterprise & Application-Specific Patterns

Databases & Data Warehouses

ERP & CRM

Spreadsheets

Finance & Banking

Government & Public Data

Healthcare & Life Sciences

E-commerce & Survey Tools

Legacy & Unusual Formats

I/O Patterns

Pathological Cases ❌

Quick Reference: Common Option Combinations

Uh oh!

FilesExpand file tree

real_world_csv.md

Latest commit

History

real_world_csv.md

File metadata and controls

Contents

Real-World CSV Files in Production

Status Legend

Encoding & BOM

Quoting & Escaping

Header Quirks

Numeric & Data Type Landmines

File Size & Structure

Enterprise & Application-Specific Patterns

Databases & Data Warehouses

ERP & CRM

Spreadsheets

Finance & Banking

Government & Public Data

Healthcare & Life Sciences

E-commerce & Survey Tools

Legacy & Unusual Formats

I/O Patterns

Pathological Cases ❌

Quick Reference: Common Option Combinations