- Introduction
- Migrating from Ruby CSV
- Ruby CSV Pitfalls
- Parsing Strategy
- The Basic Read API
- The Basic Write API
- Batch Processing
- Configuration Options
- Row and Column Separators
- Header Transformations
- Header Validations
- Column Selection
- Data Transformations
- Value Converters
- Bad Row Quarantine
- Instrumentation Hooks
- Examples
- Real-World CSV Files
- SmarterCSV over the Years
- Release Notes
Real-world CSV files are often malformed. By default, SmarterCSV raises an exception on the
first bad row it encounters. The on_bad_row option lets you keep processing and handle bad
rows in whatever way suits your application.
- Malformed CSV (unclosed quoted fields, unterminated multiline rows)
- A field that exceeds
field_size_limit(see Limiting field size) - Extra columns when running in
strict: truemode - Any
SmarterCSV::ErrororEOFErrorraised during row parsing
| Option | Default | Description |
|---|---|---|
on_bad_row |
:raise |
How to handle a bad row: :raise, :skip, :collect, or a callable |
collect_raw_lines |
true |
Include raw_logical_line in the error record |
bad_row_limit |
nil |
Raise SmarterCSV::TooManyBadRows after this many bad rows |
Current behavior — the exception propagates and processing stops:
SmarterCSV.process('data.csv')
# => raises SmarterCSV::MalformedCSV on the first bad rowThe on_bad_row option controls what happens when a bad row is encountered:
on_bad_row: :raise(default) fails fast.on_bad_row: :collectquarantines them — error records available viaSmarterCSV.errorsorreader.errors.on_bad_row: ->(rec) { ... }calls your lambda per bad row — works with bothSmarterCSV.processandSmarterCSV::Reader.on_bad_row: :skipdiscards bad rows silently — count available viaSmarterCSV.errorsorreader.errors.
Continue processing and store a structured error record for each bad row.
Error records are available via SmarterCSV.errors[:bad_rows] (class-level API)
or reader.errors[:bad_rows] (Reader API).
# Class-level API — use SmarterCSV.errors after the call
good_rows = SmarterCSV.process('data.csv', on_bad_row: :collect)
good_rows.each { |row| MyModel.create!(row) }
SmarterCSV.errors[:bad_rows].each do |rec|
Rails.logger.warn "Bad row at line #{rec[:csv_line_number]}: #{rec[:error_message]}"
Rails.logger.warn "Raw content: #{rec[:raw_logical_line]}"
end# Reader API — use when you also need access to headers or other reader state
reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
result = reader.process
result.each { |row| MyModel.create!(row) }
reader.errors[:bad_rows].each do |rec|
Rails.logger.warn "Bad row at line #{rec[:csv_line_number]}: #{rec[:error_message]}"
Rails.logger.warn "Raw content: #{rec[:raw_logical_line]}"
endPass any object that responds to #call. It is invoked once per bad row with the
error record hash, then processing continues. Because the lambda receives errors
inline, this works with both SmarterCSV.process and SmarterCSV::Reader —
you do not need a Reader instance to handle bad rows.
# Works with SmarterCSV.process — no Reader instance needed
bad_rows = []
good_rows = SmarterCSV.process('data.csv',
on_bad_row: ->(rec) { bad_rows << rec })# Log to a dead-letter file
quarantine = File.open('quarantine.csv', 'w')
SmarterCSV.process('data.csv',
on_bad_row: ->(rec) { quarantine.puts(rec[:raw_logical_line]) })
quarantine.close# Send to a monitoring system
SmarterCSV.process('data.csv',
on_bad_row: ->(rec) { Metrics.increment('csv.bad_rows', tags: { error: rec[:error_class].name }) })Silently skip bad rows and continue. The count of skipped rows is available via
SmarterCSV.errors[:bad_row_count] (class-level API) or reader.errors[:bad_row_count]
(Reader API). No error records are stored.
# Class-level API — use SmarterCSV.errors after the call
SmarterCSV.process('data.csv', on_bad_row: :skip)
puts "Skipped: #{SmarterCSV.errors[:bad_row_count] || 0} bad rows"# Reader API — access reader.errors directly
reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :skip)
result = reader.process
puts "Processed: #{result.size} good rows"
puts "Skipped: #{reader.errors[:bad_row_count] || 0} bad rows"Each error record is a Hash:
{
csv_line_number: 3, # logical row (counting header as row 1)
file_line_number: 3, # physical file line where the row started
file_lines_consumed: 1, # physical lines spanned (>1 for multiline)
error_class: SmarterCSV::HeaderSizeMismatch, # exception class object
error_message: "extra columns detected ...", # exception message string
raw_logical_line: "Jane,25,Boston,EXTRA_DATA\n", # present when collect_raw_lines: true (default)
}collect_raw_lines: true (default) — raw_logical_line is always included in the error
record. Set to false if you want to reduce memory usage and don't need the raw content:
reader = SmarterCSV::Reader.new('data.csv',
on_bad_row: :collect,
collect_raw_lines: false,
)For multiline rows (quoted fields spanning several physical lines), raw_logical_line contains
the fully stitched content — it may include embedded newline characters. The
file_lines_consumed field tells you how many physical lines were read.
To abort processing after too many failures, set bad_row_limit. This works with :skip,
:collect, and callable modes:
reader = SmarterCSV::Reader.new('data.csv',
on_bad_row: :collect,
bad_row_limit: 10,
)
begin
result = reader.process
rescue SmarterCSV::TooManyBadRows => e
puts "Aborting: #{e.message}"
puts "Collected so far: #{reader.errors[:bad_rows].size} bad rows"
endThere are two ways to access bad row data after processing:
SmarterCSV.errors returns the errors from the most recent call to process, parse,
each, or each_chunk on the current thread. It is cleared at the start of each new call.
SmarterCSV.process('data.csv', on_bad_row: :skip)
puts SmarterCSV.errors[:bad_row_count] # => 3
SmarterCSV.process('data.csv', on_bad_row: :collect)
puts SmarterCSV.errors[:bad_row_count] # => 3
puts SmarterCSV.errors[:bad_rows].size # => 3Note:
SmarterCSV.errorsonly surfaces errors from the most recent run on the current thread. In a multi-threaded environment (Puma, Sidekiq), each thread maintains its own error state independently. If you callSmarterCSV.processtwice in the same thread, the second call's errors replace the first's. For long-running or complex pipelines where you need to aggregate errors across multiple files, use the Reader API.
⚠️ Fibers:SmarterCSV.errorsusesThread.currentfor storage, which is shared across all fibers running in the same thread. If you process CSV files concurrently in fibers (e.g. withAsync,Falcon, or manualFiberscheduling),SmarterCSV.errorsmay return stale or wrong results. UseSmarterCSV::Readerdirectly — errors are scoped to the reader instance and are always correct regardless of fiber context.
For full control — including access to headers, raw headers, and errors from a specific
call — use SmarterCSV::Reader directly:
| Attribute | Description |
|---|---|
reader.errors[:bad_row_count] |
Total bad rows encountered (all modes) |
reader.errors[:bad_rows] |
Array of error records (:collect mode only) |
reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
reader.process
puts reader.errors[:bad_row_count]
puts reader.headers.inspectBad row quarantine works seamlessly with chunk_size. Skipped rows are simply not added to the
current chunk — chunk sizes remain consistent:
reader = SmarterCSV::Reader.new('large_file.csv',
chunk_size: 500,
on_bad_row: :collect,
)
reader.process do |chunk, index|
MyModel.import(chunk)
end
puts "Bad rows: #{reader.errors[:bad_row_count]}"Real-world CSV files sometimes contain unexpectedly large fields — either intentionally (a DoS attempt) or accidentally (a forgotten closing quote, a JSON blob in a cell, a notes field that ran away). Without a limit, SmarterCSV will happily stitch together physical lines until it either finds the closing quote or reaches end-of-file, potentially consuming hundreds of megabytes.
field_size_limit sets a hard cap (in bytes) on the size of any individual extracted field.
The default is nil (no limit). When a field exceeds the limit a
SmarterCSV::FieldSizeLimitExceeded exception is raised — and because it inherits from
SmarterCSV::Error, the on_bad_row option handles it exactly like any other parse error.
1. Huge inline field — a single-line field containing a large payload (e.g. a JSON blob, a base64-encoded file, or a runaway notes column):
id,payload
1,"{... 500 KB of JSON ...}"2. Quoted field spanning many embedded newlines — a legitimate multiline field in a poorly exported file that happens to be enormous:
ticket_id,notes
42,"Customer wrote:
... (thousands of lines of chat history) ...
"3. Never-closing quoted field — a missing closing quote causes the parser to stitch every subsequent physical line into one logical row until EOF:
id,comment
1,"this quote never closes
2,this entire row is now inside the field
3,and this one too ...Without field_size_limit, case 3 reads the entire rest of the file into memory. With the
limit set, the stitch loop raises FieldSizeLimitExceeded as soon as the accumulating buffer
crosses the threshold.
# Raise immediately on any oversized field (default on_bad_row: :raise)
SmarterCSV.process('data.csv', field_size_limit: 1_000_000) # 1 MB per field
# Skip oversized rows and continue
SmarterCSV.process('data.csv', field_size_limit: 1_000_000, on_bad_row: :skip)
# Collect oversized rows for inspection
reader = SmarterCSV::Reader.new('data.csv',
field_size_limit: 1_000_000,
on_bad_row: :collect,
)
result = reader.process
reader.errors[:bad_rows].each do |rec|
Rails.logger.warn "Oversized field on row #{rec[:csv_line_number]}: #{rec[:error_message]}"
endThe limit is checked against String#bytesize (raw byte count), not character count.
For ASCII content they are identical. For multi-byte UTF-8 content (e.g. CJK characters)
bytesize is larger than the character count — so the limit is a memory cap, not a
character cap, which is what matters for DoS protection.
field_size_limit is zero-overhead when not set (the default nil short-circuits all
checks). When set, a single integer comparison is performed per logical row; the per-field
scan only runs when the raw line is large enough to potentially contain an oversized field.
Normal rows (where the entire line fits within the limit) bypass per-field checking entirely.
PREVIOUS: Value Converters | NEXT: Instrumentation Hooks | UP: README