Skip to content

Latest commit

 

History

History
79 lines (54 loc) · 5.38 KB

File metadata and controls

79 lines (54 loc) · 5.38 KB

Contents


SmarterCSV Introduction

smarter_csv is a Ruby gem for fast & convenient importing and exporting of CSV files. It has intelligent defaults and auto-discovery of column and row separators. Importing returns Rails-ready hashes — suitable for direct use with ActiveRecord, Sidekiq, parallel processing, or S3 workflows. Exporting takes hashes or arrays of hashes and writes properly formatted CSV.

Why another CSV library?

Inconvenient. Ruby's built-in csv library returns arrays of arrays, which means your application code must handle column indexing, header normalization, type conversion, and whitespace stripping manually. It also has no built-in support for chunked or parallel processing of large files.

Hidden failure modes. CSV.read has 10 ways to silently corrupt or lose data — no exception, no warning, no log line. Duplicate headers, blank header cells, extra columns, BOMs, whitespace, inconsistent empty-field representation, runaway quoted fields, and encoding issues all fail silently. See Ruby CSV Pitfalls for reproducible examples and the SmarterCSV fix for each.

Slow. On top of everything else, it is up to 129× slower than SmarterCSV for equivalent end-to-end work.

SmarterCSV 1.16.0 vs Ruby CSV 3.3.5 speedup

SmarterCSV was created to solve exactly these problems: nightly imports of large datasets that needed to be upserted into a database, processed in parallel, and remain robust against real-world variations in input data.

Benefits of using SmarterCSV

  • Performance: SmarterCSV's C extension accelerates the full ingestion pipeline — parsing, hash construction, and value conversions — not just tokenization. Real-world benchmarks against CSV.table (the closest equivalent) show 7×–129× faster end-to-end throughput.

  • Rails-ready output: Each CSV row is returned as a Ruby hash with symbol keys, numeric conversion, and whitespace stripping applied automatically. No post-processing boilerplate needed — records can be passed directly to ActiveRecord, insert_all, Sidekiq, message queues, or JSON serializers.

  • Intelligent defaults and robustness: SmarterCSV auto-detects row and column separators, handles BOMs, strips extra whitespace, and tolerates common real-world inconsistencies — all without manual configuration. This makes imports robust against data you don't fully control, such as user-uploaded files or third-party exports.

  • Flexible header and value transformations: Headers are automatically downcased, symbolized, and normalized. You can remap or drop columns with key_mapping, override headers entirely with user_provided_headers, and apply per-field value converters for custom type coercion (dates, booleans, currency, etc.).

  • Batch and streaming processing: chunk_size enables memory-efficient batch processing of arbitrarily large files — each chunk is an array of hashes ready for insert_all, Sidekiq, or other data sinks. The Reader#each enumerator includes Enumerable, giving you lazy evaluation, each_slice, select, map, and more.

  • Bad row quarantine: Malformed rows can be collected or skipped instead of crashing the entire import. on_bad_row: :collect lets you inspect and log bad rows after processing completes.

Additional Features

  • Header validation: Use required_keys to raise an error before any data rows are processed if expected columns are missing. Works with post-transformation key names, so it's safe to combine with key_mapping. See Header Validations.

  • Instrumentation hooks: on_start, on_chunk, and on_complete callbacks give you visibility into import progress — useful for logging, progress bars, and alerting in long-running jobs. See Instrumentation Hooks.

  • Resumable imports: The chunk_index parameter pairs naturally with Rails 8.1's ActiveJob::Continuable for jobs that can pause and resume mid-import without reprocessing already-completed chunks. See Examples.

  • CSV writing: SmarterCSV.generate writes arrays of hashes to CSV, with support for header renaming and value converters on output. See The Basic Write API.


NEXT: Migrating from Ruby CSV | UP: README