Conversation
Replace single-threaded scatter with proper multi-threaded implementation
using the same header-peek pattern as the single-barcode collate.
Multi-threaded scatter:
- Main thread reads raw chunk bytes, pushes to ArrayQueue
- N worker threads pop chunks, use two-pass record reading:
1. from_bytes_collatable_header() — reads only header (barcodes, UMI)
2. Skip alignment data for unmatched records (fast)
3. from_bytes_with_header_retain_ori() — reads alignments only for
matched records with orientation filtering
- Thread-local buffers with periodic flush to temp bucket files
Tiered cell barcode correction:
- Tier 1 (identity): HashSet of barcodes where raw == corrected in ALL
samples. Single lookup, covers >99% of valid barcodes.
- Tier 2 (per-sample): Vec<HashMap> of barcodes that correct to a
DIFFERENT barcode. Small maps, only consulted on identity miss.
- Handles sample-dependent correction correctly (same raw BC may correct
to different targets in different samples).
Performance on real data:
- Dataset 1 (114M reads): 15s total (was 60s single-threaded, 4x speedup)
- Dataset 2 (1.1B reads): 9min total (was 19min, 2.1x speedup)
Scatter: 77s (was 14min), Gather: ~100s
All 15 tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminate the bottleneck of loading and scanning the massive permit_map.bin (2.2GB / 139M entries for unfiltered-pl mode). Instead: 1. Load only permit_freq.bin per sample (26MB for 1.6M valid BCs) 2. Identity tier: HashSet of all valid barcodes — self-correcting, covers >99% of real reads. Built in milliseconds from permit_freq keys. 3. Correction tier: per-sample BarcodeLookupMap built from valid barcodes. On-the-fly 1-edit neighbor lookup during scatter, no pre-materialized HashMap. BarcodeLookupMap uses prefix-indexed binary search — fast and memory-efficient. Also cleaned up dead code from old worker loop. Performance on 1.1B read dataset: - Before (single-threaded, HashMap): 19 min - Previous (multi-threaded, HashMap): 9 min (2.5min map loading) - Now (multi-threaded, LookupMap): 2m40s (<1s map loading) - Map loading: <1s (was 2.5min — 150x faster) - Scatter: 58s (7 workers) - Gather: 99s 7.2x total speedup from the original single-threaded version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The valid barcode set differs per sample — the same barcode may be a true cell in one sample but absent in another. Using a global union of all valid barcodes for identity correction is incorrect: a barcode that is valid in sample A but not in sample B would incorrectly self-correct in sample B instead of being looked up for 1-edit correction. Fix: replace the global HashSet<u64> identity tier with a Vec<HashSet<u64>> indexed by sample_idx. The worker checks per_sample_identity[sample_idx].contains(&cell_bc) before falling back to the per-sample BarcodeLookupMap. Performance unchanged (~2m50s on 1.1B reads) since the per-sample lookup is the same cost as the global lookup (Vec index + HashSet contains). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gather phase was the remaining bottleneck — processing 470 temp buckets sequentially took 105s (61% of total runtime). Now uses the same multi-threaded pattern as the single-barcode collate: workers pop temp buckets from an ArrayQueue, call collate_temporary_bucket_twopass_generic in parallel, delete temp files. Manifest is built after gather completes, from the tsv_map composite keys (which already encode sample→cell mappings). Performance on 1.1B read dataset: - Gather: 20s (was 105s — 5.3x faster) - Total: 1m35s (was 2m51s — 1.8x faster) - Overall vs original single-threaded: 12x speedup (19min → 1.5min) Breakdown: Setup + map loading: <1s Scatter (7 workers): 71s Gather (7 workers): 20s Total: 95s wall clock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10x Flex probe barcodes have 8 rotation variants per sample (from circular probe ligation products read at different register positions). Previously, each rotation was treated as a separate sample, splitting cell counts across 8 entries per true sample. Now, when the barcode file has 3 columns (observed_bc, canonical_bc, sample_name), all rotation variants are mapped to their canonical form during generate-permit-list. This produces 32 canonical samples instead of 256 rotation entries. Changes: - New SampleBarcodeInfo struct holds canonical barcodes, rotation->canonical mapping, and canonical->name mapping - load_sample_barcode_list() parses 3-column TSV format, deduplicates to canonical barcodes - build_sample_permit_map() maps all rotation barcodes to canonical form - get_sample_names() extracts names from barcode file (column 3) - Backward compatible: 1-column and 2-column formats still work Comparison with CellRanger on 4-plex Human Colorectal/Kidney dataset: Per-gene total count correlation: 1.0000 Per-cell gene correlation median: 0.9948 AF/CR UMI ratio median: 1.012 Exact element match: 88.4% Cells with per-cell r > 0.95: 94.5% Shared barcodes: 511,470 / 530,480 (96.4%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace unsafe raw pointer slicing with the safe split_at_mut pattern (same as the single-barcode collate uses). The backing buffer is progressively split into non-overlapping mutable slices, each wrapped in a Cursor for per-bucket writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add correct_unmapped_counts_multi_bc() which reads the raw unmapped barcode count file (keyed by composite sample_bc|cell_bc from piscem), applies both sample BC correction (via sample_permit_map) and cell BC correction (via tiered identity + BarcodeLookupMap), and writes unmapped_bc_count_collated.bin keyed by corrected cell barcode. The collated file is keyed by cell BC (not composite) so quant's existing lookup by collate_key() works directly. If the same cell BC appears in multiple samples, their unmapped counts are summed — this is acceptable since the mapping rate is a per-cell metric. This replaces the workaround of creating an empty unmapped file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update correct_unmapped_counts_multi_bc to use libradicl's UnmappedBcFormat reader instead of parsing raw (u64, u32) pairs. The new format has a self-describing header declaring the number and widths of barcode fields. The reader: 1. Reads the header to determine per-field widths 2. Reads structured records with per-field barcode values 3. Corrects sample BC (via sample_permit_map) and cell BC (via tiered identity + BarcodeLookupMap) 4. Writes collated file as bincode HashMap<u64, u32> keyed by corrected cell BC (for quant compatibility) Handles empty files and format errors gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ad-hoc HashMap<u64, u32> and bincode serialization with libradicl's CollatedUnmappedCounts for unmapped barcode frequency data. collate.rs: - correct_unmapped_counts (single-barcode): reads new self-describing format, falls back to legacy (u64, u32) pairs for old files. Writes CollatedUnmappedCounts::Single via self-describing format. - correct_unmapped_counts_multi_bc: produces CollatedUnmappedCounts::Multi keyed by (corrected_sample, corrected_cell) for per-sample accuracy. quant.rs: - Loads unmapped_bc_count_collated.bin via CollatedUnmappedCounts::read_from_file - Worker uses get_single(cell_bc) for lookup — zero-overhead for single-barcode - Multi-barcode: get_single falls back to sum across samples (acceptable for mapping rate computation; full per-sample lookup via get_multi available when quant gains sample-aware chunk processing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three main optimizations that together reduce quant time from >30 min to ~28s on a 5.6M cell × 19K gene 4-plex Flex dataset: 1. Sparse fast path for small cells (< 100 reads): new quantify_small_cell_sparse() implements full cr-like winner-take-all UMI resolution without any HashMap or dense vector overhead. Outputs sparse (gene, umi) pairs via sorted-vec voting, then builds expressed_vec/expressed_ind directly via run-length counting — avoiding both the O(num_genes) dense vector zeroing and the O(num_genes) scan to extract non-zero entries per cell. 2. HashMap capacity management: eqid_map and gene_eqc use capacity-based recreate (replace with fresh HashMap when capacity exceeds 256) instead of clear(), preventing O(capacity) iteration costs from accumulating across cells. gene_eqc clearing is also skipped entirely for fast-path cells that never touch it. 3. Thread-local MTX triplet accumulation: triplets are collected in per-worker Vec buffers and merged after all workers finish, replacing the previous approach of calling trimat.add_triplet() under a global mutex. Also caps initial trimat capacity at 256M entries to prevent overallocation on large datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sparse fast path now handles USA (spliced/unspliced/ambiguous) mode by applying the same splicing-aware resolution as extract_counts in utils.rs. When cr-like voting produces a multi-gene tie, the tied gene set is resolved: same-gene S+U pairs go to the ambiguous slot, cross-gene ties prefer the spliced variant, and truly ambiguous ties are discarded. This matches the behavior of the full gene_eqc path exactly, verified on both Flex (non-USA) and 10x v3 (USA) datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The multi-barcode permit list generation was not using the external cell barcode whitelist (--unfiltered-pl). All observed cell barcodes were treated as valid, inflating cell counts ~10x (1.5M vs 165K for the Colorectal sample) and degrading concordance with CellRanger. Now, when --unfiltered-pl is provided: - First pass counts only cell BCs present in the whitelist - Non-matching BCs are collected per-sample for 1-edit rescue - Per-sample BarcodeLookupMap corrects unmatched BCs to their nearest whitelist neighbor (same algorithm as single-barcode path) Results on 4-plex Flex dataset (BC001/Colorectal sample): - UMI ratio AF/CR: 1.18 -> 1.05 - Per-gene Pearson: 0.9998 -> 0.99996 - Per-cell Pearson: 0.81 -> 0.92 - Per-cell median Pearson: 1.0 (perfect for most cells) - Zero AF-only barcodes (all now in CellRanger whitelist) - GPL runtime: 194s -> 104s (no more 1-edit maps on 1.5M junk BCs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first pass that reads 809M records from the RAD file to count per-sample cell barcodes is now parallelized using ParallelChunkReader with worker threads. Each worker processes MetaChunks independently, updating shared DashMap histograms and thread-local unmatched BC buffers (flushed after completion to reduce lock contention). GPL time on 4-plex Flex dataset: 104s -> 30s (3.5x speedup). CPU utilization: 99% -> 746% (good parallel scaling on 8 threads). Results are identical to the sequential version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When quantifying multi-barcode (e.g. Flex) data, featureDump.txt now includes a sample_name column after CB, identifying which sample each cell belongs to. This is populated during the barcode rewrite step using the collation manifest's sample-to-chunk mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, barcode labels and featureDump sample_name were added by a post-processing rewrite step that assumed output rows matched collated file order. This assumption relies on workers acquiring the output mutex in file order, which is not guaranteed with multi-threaded processing. Now, each worker looks up the sample name from a manifest-derived cell_num → sample_idx map at write time. The cell_num (chunk index in the collated file) is deterministic regardless of thread scheduling. This removes the rewrite_barcodes_with_sample_prefix function entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EDS (Efficient Dense Storage) output is no longer used. This removes: - All EDS write paths in quant.rs (sce::eds calls, eds_file, eds_bytes) - The sce crate dependency from Cargo.toml - The use_mtx conditional throughout the codebase (MTX is now default) - The BufferedGzFile type and EDS-specific bootstrap file creation The --use-eds flag is kept as a hidden flag that prints a deprecation error and exits. The --use-mtx flag is kept as a no-op for backwards compatibility. Note: Bootstrap output was previously written in EDS format. Bootstrap computation still runs but output is not yet written in the new format. This will be addressed in a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bootstrap summary stats (mean and variance across replicates) are now written as bootstraps_mean.mtx and bootstraps_var.mtx in the same sparse MTX format as the main count matrix. Each worker accumulates bootstrap triplets thread-locally and they are merged and written after all workers complete. When --summary-stat is not set, full per-replicate counts are computed but only the summary statistics are written. A note is printed informing users that full per-replicate output will be available in a future release via AnnData/h5ad layers, which provide efficient storage for the 3D (cells x genes x replicates) array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add multi-barcode RAD pipeline support and release/dependency cleanup
Merge develop-refactor into develop
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR merges
developintomaster.Note: this repository uses
masteras the primary release branch rather thanmain, so this PR targetsmaster.Compared with
master,developis currently a strict superset and includes a large set of changes across four broad areas:Major change areas
1. scATAC pipeline work
Relevant files include:
src/atac/cellfilter.rssrc/atac/collate.rssrc/atac/deduplicate.rssrc/atac/sort.rssrc/atac/run.rssrc/atac/prog_opts.rs2. Generic quantification / refactor / long-read work
cellfilter,collate, andquantover record / payload types.Key files touched include:
src/quant.rssrc/em.rssrc/eq_class.rssrc/pugutils.rssrc/utils.rssrc/convert.rs3. Multi-barcode / multi-sample RAD support
generate-permit-list, including sample barcode handling and per-sample permit generation.sample_nametofeatureDump.txtfor multi-barcode outputs.Correctness / performance follow-ups in this series include:
BarcodeLookupMaptiered correctionCollatedUnmappedCountsand self-describing unmapped-count formats across collate / quantnum_chunksbackpatching--use-eds4. Release / packaging / CI updates
bump_and_publish.shfor version bumping, dry-run validation, optional publish, tagging, and pushing.dist-workspace.toml.piscemflow.Relevant files include:
bump_and_publish.shdist-workspace.toml.cargo/config.toml.cargo/config-portable.toml.github/build-setup.yml.github/workflows/release.yml5. Dependency and build modernization
noodles/petgraphlines.randusage to0.10andrand_distrto0.6.statrsusage by replacing the normal log-density delta with its exact closed-form expression.nalgebrausage by replacing bootstrap count vectors with plainVecs.rand 0.8path from the resolved dependency graph.Validation
Validated locally on
developwith:cargo checkcargo clippy --all-targets --all-features -- -D warningsdist generate --mode=ci --allow-dirtydist manifest --artifacts=all --output-format=json --no-local-paths --allow-dirtyScope
Relative to
master, the current diff is substantial and includes:At the time of PR creation,
developis strictly ahead ofmaster.