You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CSV is the most common data exchange format in enterprise software — and also one of the most inconsistently implemented. This page documents what you will actually encounter when processing production CSV files, and how SmarterCSV handles each case.
Status Legend
Symbol
Meaning
✅
Handled automatically — no configuration needed
🔘
Handled — but requires the user to specify an option
❌
Not handled — caller must pre-process or work around
Encoding & BOM
Real-world files come from dozens of different systems, each with their own default encoding. Excel in particular is notorious for writing UTF-8 files with a Byte Order Mark (BOM) that trips up many parsers.
Issue
Status
Notes
UTF-8 with BOM (\xEF\xBB\xBF)
✅
Stripped automatically from the first line. Excel always writes this.
\r\n CRLF line endings
✅
Auto-detected. The default on Windows and most enterprise exports.
\r only (classic Mac)
✅
Auto-detected. Rare today but still seen in legacy pipelines.
Windows-1252 / Latin-1
🔘
Specify file_encoding: 'windows-1252'. Common in European financial exports, older SAP systems, QuickBooks.
UTF-16 LE with BOM
🔘
Specify file_encoding: 'utf-16le'. Some Microsoft SQL Server and Access exports default to this.
Shift-JIS / EUC-JP
🔘
Specify file_encoding: 'shift_jis' or 'euc-jp'. Japanese ERP and POS systems.
Quoting & Escaping
Two competing quoting conventions exist in the wild and are both common: RFC 4180 (used by Excel) and backslash escaping (used by MySQL, PostgreSQL). SmarterCSV defaults to :auto mode which tries backslash first and falls back to RFC 4180.
Issue
Status
Notes
RFC 4180 double-quote escaping ("")
✅
The Excel standard. Handled by quote_escaping: :auto (default).
Backslash escaping (\")
✅
Used by MySQL SELECT INTO OUTFILE, mysqldump, PostgreSQL COPY. Handled by :auto default.
Newlines inside quoted fields
✅
Multi-line field stitching. Common in address fields, notes, and CRM comment exports.
Mid-field quote characters (5'10", inch marks, apostrophes)
✅
quote_boundary: :standard (default since 1.16.0) only recognizes quotes at field boundaries. Mid-field quotes are treated as literal characters.
Semicolon-delimited files mislabeled as CSV
✅
col_sep: :auto (default) detects the actual separator. Common in European locales where comma is the decimal separator.
Tab-delimited TSV files
✅
col_sep: :auto detects tabs. Common in bioinformatics and some government data portals.
Header Quirks
Headers in production files are rarely as clean as you'd expect. They carry units, source system field names, BOM characters, duplicates, and sometimes no headers at all.
Issue
Status
Notes
BOM on first header field
✅
Stripped automatically. Without this, the first key would be :\xEF\xBB\xBFname instead of :name.
Duplicate headers
✅
Disambiguated using duplicate_header_suffix (default '' → :email, :email_2, :email_3).
Empty or whitespace-only headers
✅
Auto-named using missing_header_prefix (default column_) → :column_1, :column_2. Values are never silently dropped.
Trailing comma on header row (phantom empty column)
✅
The phantom column is auto-named just like any other empty header.
Headers with spaces and special characters (Revenue (USD))
✅
Spaces and dashes normalized to underscores → :revenue_(usd). Parentheses, slashes, etc. are preserved.
Extra data columns beyond the header row
✅
Auto-generates column_N names for extra fields. Controlled by missing_headers: option.
No header row at all
🔘
Use headers_in_file: false, user_provided_headers: [:col1, :col2, ...]. Common in raw database dumps and fixed-format legacy exports.
Numeric & Data Type Landmines
Numeric conversion is one of the most common sources of data loss. SmarterCSV converts values that look like numbers by default — which is correct for most cases — but certain fields must be excluded explicitly.
Won't match the numeric pattern — safely left as a string. Use value_converters if numeric value is needed.
Percentage values (12.5%)
✅ / 🔘
Won't match the numeric pattern — safely left as a string. Use value_converters if numeric value is needed.
Leading zeros (ZIP codes, phone numbers, SKUs, account numbers)
🔘
convert_values_to_numeric: { except: [:zip, :phone, :sku] }. Without this, "01234" becomes 1234. One of the most common silent data loss bugs in CSV processing! US ZIP codes have leading zeroes.
NULL / empty value variants (NULL, \N, N/A, (null), #N/A)
🔘
Use nil_values_matching: /\A(NULL\|\\N|N\/A|#N\/A|\\(null\\))\z/i. Without configuration these are left as literal strings.
Date values (2023-01-15, 01/02/2023, Jan 2, 2023)
🔘
Use value_converters with a date parsing lambda. SmarterCSV does not auto-convert dates — format ambiguity (01/02/2023 = Jan 2 or Feb 1?) makes auto-conversion unsafe.
Boolean variants (Y/N, Yes/No, TRUE/FALSE, 1/0, X/ in SAP)
🔘
Use value_converters for the relevant columns.
European number format (1.234,56 meaning 1234.56)
🔘
Use a value_converter that swaps dot and comma before parsing. Common in German, French, Italian, and Spanish exports.
File Size & Structure
Issue
Status
Notes
Millions of rows
✅
Use chunk_size: N for batch processing. SmarterCSV streams the file and never loads it entirely into memory.
Gigabyte-sized files
✅
Streaming architecture. Memory usage is proportional to chunk size, not file size.
Single-column CSV files
✅
col_sep: :auto handles files with no detected separator gracefully (fixed in issue #222).
Ragged rows — fewer fields than headers
✅
Missing trailing fields produce no key in the hash. Combined with remove_empty_values: true (default), short rows are handled cleanly.
Ragged rows — more fields than headers
✅
Extra columns are auto-named column_N via missing_header_prefix.
Empty or whitespace-only file
✅
Raises SmarterCSV::EmptyFileError with a clear message instead of a cryptic internal error.
Enterprise & Application-Specific Patterns
Databases & Data Warehouses
Source
Issue
Status
Notes
MySQL SELECT INTO OUTFILE
Backslash quote escaping
✅
quote_escaping: :auto default.
PostgreSQL COPY TO
Backslash quote escaping, \N for NULL
✅ / 🔘
Escaping handled automatically; \N as nil requires nil_values_matching.
SEC EDGAR
Pipe-delimited, UTF-8, clean format
✅
col_sep: :auto detects the pipe separator.
UNIX DB Dumps†
CTRL-A col separator, CTRL-B row separator, # comment lines
This is a clever design: since CTRL-A and CTRL-B never appear in normal text, fields never need quoting or escaping — eliminating an entire class of parsing ambiguity.
Pathological Cases ❌
These formats have structural problems that no CSV parser can transparently resolve. Pre-processing the file before passing it to SmarterCSV is the only reliable solution.
Issue
Why it breaks
Workaround
Mixed line endings within one file
Row separator is detected once from the first N bytes. A file mixing \r\n and \n will produce rows with stray \r on some values.
Pre-process with dos2unix or equivalent.
Mixed encodings within one file
Happens when CSVs are concatenated from multiple sources. force_utf8: true with invalid_byte_sequence: '' is the best available mitigation, but true mixed-encoding files cannot be reliably fixed by any parser.
Identify and re-encode each source file before concatenating.
Unquoted fields containing the column separator
Malformed CSV — the field will be split incorrectly and there is no way to recover the original value.
Fix upstream at the data source.
Repeated header row mid-file
Happens when files are assembled with cat chunk_1.csv chunk_2.csv. The repeated header lands as a data row: {name: "name", age: "age"}.
Strip repeated header lines before parsing, or post-filter rows where all values equal their key names.
Trailer / summary rows
Totals or citation rows at end of file have no consistent marker.
Pre-process to remove, or post-filter with a sentinel check: rows.reject { |r| r[:date].nil? }.
REDCap (clinical trial data)
Two-row header: field names row + field labels row. The labels row lands as the first data row.
Drop post-parse: rows.drop(1), or pre-process to remove the labels row.
IRS / SOI Tax Stats — footnote rows mixed into data
Footnote rows (e.g. * Data suppressed) appear mid-file with no consistent column structure. No option to filter mid-file rows by pattern.
Pre-process to strip footnote lines before parsing.
UK ONS (Office for National Statistics) — multi-row headers
Title row + unit row before the actual header row. SmarterCSV reads one header row; the extra rows land as data.
Pre-process to collapse or remove the extra header rows.
UK ONS — footer footnotes ([note], [x])
Footnote rows at end of file use inline markers with no consistent structure.
Pre-process to strip footer lines, or post-filter rows where key fields are nil.
World Bank / IMF — footer with source citation
Last 1–3 lines contain source attribution text, not data.
Pre-process to strip, or use rows[0..-N] to drop the last N rows post-parse.
Australian ABS — merged cell artifacts
Excel merged cells export as a value in the first occurrence and blank in subsequent rows. The blank column becomes :column_1 with empty values.
Post-process: forward-fill the blank column from the previous non-empty value.
Quick Reference: Common Option Combinations
# Legacy enterprise export (Windows, Latin-1, BOM, CRLF)SmarterCSV.process(file,file_encoding: 'windows-1252')# MySQL dump (backslash escaping, \N for NULL)SmarterCSV.process(file,quote_escaping: :backslash,nil_values_matching: /\A\\N\z/)# Financial data (preserve leading zeros, no numeric conversion on key fields)SmarterCSV.process(file,convert_values_to_numeric: {except: [:account_number,:zip,:routing_number]})# SAP wide export with duplicate column namesSmarterCSV.process(file,duplicate_header_suffix: '_',strip_whitespace: true)# Survey export with boolean and N/A valuesSmarterCSV.process(file,nil_values_matching: /\A(N\/A|NA|n\/a)\z/,value_converters: {completed: ->(v){v&.upcase == 'Y'}})# Gzipped CSVrequire'zlib'SmarterCSV.process(Zlib::GzipReader.open('data.csv.gz'))# HTTP streamingrequire'open-uri'SmarterCSV.process(URI.open('https://example.com/data.csv'))