Skip to content

Merge develop into master#161

Merged
rob-p merged 86 commits intomasterfrom
develop
Mar 19, 2026
Merged

Merge develop into master#161
rob-p merged 86 commits intomasterfrom
develop

Conversation

@rob-p
Copy link
Copy Markdown
Contributor

@rob-p rob-p commented Mar 19, 2026

Summary

This PR merges develop into master.

Note: this repository uses master as the primary release branch rather than main, so this PR targets master.

Compared with master, develop is currently a strict superset and includes a large set of changes across four broad areas:

  1. scATAC support and related pipeline work
  2. generic / refactored quantification and long-read support
  3. multi-barcode / multi-sample RAD support for Flex-style data
  4. release-flow, packaging, CI, and dependency modernization

Major change areas

1. scATAC pipeline work

  • Add ATAC-specific modules and command-line support.
  • Implement ATAC cell filtering, collation, deduplication, sorting, and runtime glue.
  • Add ATAC documentation updates.
  • Improve compression / parallel-read handling and related diagnostics.

Relevant files include:

  • src/atac/cellfilter.rs
  • src/atac/collate.rs
  • src/atac/deduplicate.rs
  • src/atac/sort.rs
  • src/atac/run.rs
  • src/atac/prog_opts.rs

2. Generic quantification / refactor / long-read work

  • Refactor quantification internals and EQ-class payload handling.
  • Generalize parts of cellfilter, collate, and quant over record / payload types.
  • Add long-read EM support and supporting read-record changes.
  • Rework probability-map / stride handling and related tests.
  • Improve organization and reduce redundancy in long-read and quant code.

Key files touched include:

  • src/quant.rs
  • src/em.rs
  • src/eq_class.rs
  • src/pugutils.rs
  • src/utils.rs
  • src/convert.rs

3. Multi-barcode / multi-sample RAD support

  • Add multi-barcode record-type detection and CLI surface for 10x Flex-style data.
  • Implement multi-barcode generate-permit-list, including sample barcode handling and per-sample permit generation.
  • Add hierarchical / two-round multi-barcode collation and sample-aware grouping.
  • Add multi-barcode quantification dispatch and combined matrix output support.
  • Write sample-prefixed barcodes inline during quantification.
  • Add sample_name to featureDump.txt for multi-barcode outputs.
  • Add synthetic integration tests for the multi-barcode pipeline.

Correctness / performance follow-ups in this series include:

  • external whitelist filtering for multi-barcode cell barcodes
  • probe-barcode rotation collapsing to canonical samples
  • per-sample identity sets plus BarcodeLookupMap tiered correction
  • restored / improved multi-threaded scatter and parallelized gather / first-pass stages
  • USA support in the sparse fast path for small cells
  • CollatedUnmappedCounts and self-describing unmapped-count formats across collate / quant
  • removal of unsafe buffer slicing in two-round collation
  • fixes to barcode lookup and num_chunks backpatching
  • replacement of EDS bootstrap output with sparse MTX summary-stat output, deprecating --use-eds

4. Release / packaging / CI updates

  • Add bump_and_publish.sh for version bumping, dry-run validation, optional publish, tagging, and pushing.
  • Move cargo-dist configuration into dist-workspace.toml.
  • Add source tarball + SHA generation for release artifacts.
  • Add portable cargo config and CI build setup modeled on the piscem flow.
  • Update the generated release workflow to the newer cargo-dist/actions layout.
  • Update release/sanitize automation and related packaging files.

Relevant files include:

  • bump_and_publish.sh
  • dist-workspace.toml
  • .cargo/config.toml
  • .cargo/config-portable.toml
  • .github/build-setup.yml
  • .github/workflows/release.yml

5. Dependency and build modernization

  • Upgrade the codebase to Rust 2024 edition.
  • Update core dependencies, including newer noodles / petgraph lines.
  • Upgrade direct rand usage to 0.10 and rand_distr to 0.6.
  • Remove direct statrs usage by replacing the normal log-density delta with its exact closed-form expression.
  • Remove direct nalgebra usage by replacing bootstrap count vectors with plain Vecs.
  • This also removes the old rand 0.8 path from the resolved dependency graph.

Validation

Validated locally on develop with:

  • cargo check
  • cargo clippy --all-targets --all-features -- -D warnings
  • dist generate --mode=ci --allow-dirty
  • dist manifest --artifacts=all --output-format=json --no-local-paths --allow-dirty

Scope

Relative to master, the current diff is substantial and includes:

  • ATAC pipeline additions
  • generic quant / long-read refactor work
  • multi-barcode / multi-sample RAD support
  • release and CI modernization
  • dependency cleanup and packaging updates

At the time of PR creation, develop is strictly ahead of master.

rob-p and others added 29 commits March 16, 2026 14:11
Replace single-threaded scatter with proper multi-threaded implementation
using the same header-peek pattern as the single-barcode collate.

Multi-threaded scatter:
- Main thread reads raw chunk bytes, pushes to ArrayQueue
- N worker threads pop chunks, use two-pass record reading:
  1. from_bytes_collatable_header() — reads only header (barcodes, UMI)
  2. Skip alignment data for unmatched records (fast)
  3. from_bytes_with_header_retain_ori() — reads alignments only for
     matched records with orientation filtering
- Thread-local buffers with periodic flush to temp bucket files

Tiered cell barcode correction:
- Tier 1 (identity): HashSet of barcodes where raw == corrected in ALL
  samples. Single lookup, covers >99% of valid barcodes.
- Tier 2 (per-sample): Vec<HashMap> of barcodes that correct to a
  DIFFERENT barcode. Small maps, only consulted on identity miss.
- Handles sample-dependent correction correctly (same raw BC may correct
  to different targets in different samples).

Performance on real data:
- Dataset 1 (114M reads): 15s total (was 60s single-threaded, 4x speedup)
- Dataset 2 (1.1B reads): 9min total (was 19min, 2.1x speedup)
  Scatter: 77s (was 14min), Gather: ~100s

All 15 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminate the bottleneck of loading and scanning the massive permit_map.bin
(2.2GB / 139M entries for unfiltered-pl mode). Instead:

1. Load only permit_freq.bin per sample (26MB for 1.6M valid BCs)
2. Identity tier: HashSet of all valid barcodes — self-correcting, covers
   >99% of real reads. Built in milliseconds from permit_freq keys.
3. Correction tier: per-sample BarcodeLookupMap built from valid barcodes.
   On-the-fly 1-edit neighbor lookup during scatter, no pre-materialized
   HashMap. BarcodeLookupMap uses prefix-indexed binary search — fast and
   memory-efficient.

Also cleaned up dead code from old worker loop.

Performance on 1.1B read dataset:
- Before (single-threaded, HashMap):  19 min
- Previous (multi-threaded, HashMap):  9 min (2.5min map loading)
- Now (multi-threaded, LookupMap):     2m40s (<1s map loading)
  - Map loading: <1s (was 2.5min — 150x faster)
  - Scatter: 58s (7 workers)
  - Gather: 99s

7.2x total speedup from the original single-threaded version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The valid barcode set differs per sample — the same barcode may be a
true cell in one sample but absent in another. Using a global union
of all valid barcodes for identity correction is incorrect: a barcode
that is valid in sample A but not in sample B would incorrectly
self-correct in sample B instead of being looked up for 1-edit
correction.

Fix: replace the global HashSet<u64> identity tier with a
Vec<HashSet<u64>> indexed by sample_idx. The worker checks
per_sample_identity[sample_idx].contains(&cell_bc) before
falling back to the per-sample BarcodeLookupMap.

Performance unchanged (~2m50s on 1.1B reads) since the per-sample
lookup is the same cost as the global lookup (Vec index + HashSet
contains).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gather phase was the remaining bottleneck — processing 470 temp
buckets sequentially took 105s (61% of total runtime). Now uses the
same multi-threaded pattern as the single-barcode collate: workers
pop temp buckets from an ArrayQueue, call
collate_temporary_bucket_twopass_generic in parallel, delete temp files.

Manifest is built after gather completes, from the tsv_map composite
keys (which already encode sample→cell mappings).

Performance on 1.1B read dataset:
- Gather: 20s (was 105s — 5.3x faster)
- Total: 1m35s (was 2m51s — 1.8x faster)
- Overall vs original single-threaded: 12x speedup (19min → 1.5min)

Breakdown:
  Setup + map loading:  <1s
  Scatter (7 workers):  71s
  Gather (7 workers):   20s
  Total:                95s wall clock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10x Flex probe barcodes have 8 rotation variants per sample (from
circular probe ligation products read at different register positions).
Previously, each rotation was treated as a separate sample, splitting
cell counts across 8 entries per true sample.

Now, when the barcode file has 3 columns (observed_bc, canonical_bc,
sample_name), all rotation variants are mapped to their canonical form
during generate-permit-list. This produces 32 canonical samples instead
of 256 rotation entries.

Changes:
- New SampleBarcodeInfo struct holds canonical barcodes, rotation->canonical
  mapping, and canonical->name mapping
- load_sample_barcode_list() parses 3-column TSV format, deduplicates
  to canonical barcodes
- build_sample_permit_map() maps all rotation barcodes to canonical form
- get_sample_names() extracts names from barcode file (column 3)
- Backward compatible: 1-column and 2-column formats still work

Comparison with CellRanger on 4-plex Human Colorectal/Kidney dataset:
  Per-gene total count correlation:  1.0000
  Per-cell gene correlation median:  0.9948
  AF/CR UMI ratio median:           1.012
  Exact element match:              88.4%
  Cells with per-cell r > 0.95:     94.5%
  Shared barcodes:                  511,470 / 530,480 (96.4%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace unsafe raw pointer slicing with the safe split_at_mut pattern
(same as the single-barcode collate uses). The backing buffer is
progressively split into non-overlapping mutable slices, each wrapped
in a Cursor for per-bucket writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add correct_unmapped_counts_multi_bc() which reads the raw unmapped
barcode count file (keyed by composite sample_bc|cell_bc from piscem),
applies both sample BC correction (via sample_permit_map) and cell BC
correction (via tiered identity + BarcodeLookupMap), and writes
unmapped_bc_count_collated.bin keyed by corrected cell barcode.

The collated file is keyed by cell BC (not composite) so quant's
existing lookup by collate_key() works directly. If the same cell BC
appears in multiple samples, their unmapped counts are summed — this
is acceptable since the mapping rate is a per-cell metric.

This replaces the workaround of creating an empty unmapped file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update correct_unmapped_counts_multi_bc to use libradicl's
UnmappedBcFormat reader instead of parsing raw (u64, u32) pairs.

The new format has a self-describing header declaring the number and
widths of barcode fields. The reader:
1. Reads the header to determine per-field widths
2. Reads structured records with per-field barcode values
3. Corrects sample BC (via sample_permit_map) and cell BC (via
   tiered identity + BarcodeLookupMap)
4. Writes collated file as bincode HashMap<u64, u32> keyed by
   corrected cell BC (for quant compatibility)

Handles empty files and format errors gracefully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ad-hoc HashMap<u64, u32> and bincode serialization with
libradicl's CollatedUnmappedCounts for unmapped barcode frequency data.

collate.rs:
- correct_unmapped_counts (single-barcode): reads new self-describing
  format, falls back to legacy (u64, u32) pairs for old files.
  Writes CollatedUnmappedCounts::Single via self-describing format.
- correct_unmapped_counts_multi_bc: produces CollatedUnmappedCounts::Multi
  keyed by (corrected_sample, corrected_cell) for per-sample accuracy.

quant.rs:
- Loads unmapped_bc_count_collated.bin via CollatedUnmappedCounts::read_from_file
- Worker uses get_single(cell_bc) for lookup — zero-overhead for single-barcode
- Multi-barcode: get_single falls back to sum across samples (acceptable
  for mapping rate computation; full per-sample lookup via get_multi
  available when quant gains sample-aware chunk processing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three main optimizations that together reduce quant time from >30 min
to ~28s on a 5.6M cell × 19K gene 4-plex Flex dataset:

1. Sparse fast path for small cells (< 100 reads): new
   quantify_small_cell_sparse() implements full cr-like winner-take-all
   UMI resolution without any HashMap or dense vector overhead. Outputs
   sparse (gene, umi) pairs via sorted-vec voting, then builds
   expressed_vec/expressed_ind directly via run-length counting —
   avoiding both the O(num_genes) dense vector zeroing and the
   O(num_genes) scan to extract non-zero entries per cell.

2. HashMap capacity management: eqid_map and gene_eqc use
   capacity-based recreate (replace with fresh HashMap when capacity
   exceeds 256) instead of clear(), preventing O(capacity) iteration
   costs from accumulating across cells. gene_eqc clearing is also
   skipped entirely for fast-path cells that never touch it.

3. Thread-local MTX triplet accumulation: triplets are collected in
   per-worker Vec buffers and merged after all workers finish, replacing
   the previous approach of calling trimat.add_triplet() under a global
   mutex. Also caps initial trimat capacity at 256M entries to prevent
   overallocation on large datasets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sparse fast path now handles USA (spliced/unspliced/ambiguous) mode
by applying the same splicing-aware resolution as extract_counts in
utils.rs. When cr-like voting produces a multi-gene tie, the tied gene
set is resolved: same-gene S+U pairs go to the ambiguous slot,
cross-gene ties prefer the spliced variant, and truly ambiguous ties
are discarded. This matches the behavior of the full gene_eqc path
exactly, verified on both Flex (non-USA) and 10x v3 (USA) datasets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The multi-barcode permit list generation was not using the external
cell barcode whitelist (--unfiltered-pl). All observed cell barcodes
were treated as valid, inflating cell counts ~10x (1.5M vs 165K for
the Colorectal sample) and degrading concordance with CellRanger.

Now, when --unfiltered-pl is provided:
- First pass counts only cell BCs present in the whitelist
- Non-matching BCs are collected per-sample for 1-edit rescue
- Per-sample BarcodeLookupMap corrects unmatched BCs to their nearest
  whitelist neighbor (same algorithm as single-barcode path)

Results on 4-plex Flex dataset (BC001/Colorectal sample):
- UMI ratio AF/CR: 1.18 -> 1.05
- Per-gene Pearson: 0.9998 -> 0.99996
- Per-cell Pearson: 0.81 -> 0.92
- Per-cell median Pearson: 1.0 (perfect for most cells)
- Zero AF-only barcodes (all now in CellRanger whitelist)
- GPL runtime: 194s -> 104s (no more 1-edit maps on 1.5M junk BCs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first pass that reads 809M records from the RAD file to count
per-sample cell barcodes is now parallelized using ParallelChunkReader
with worker threads. Each worker processes MetaChunks independently,
updating shared DashMap histograms and thread-local unmatched BC
buffers (flushed after completion to reduce lock contention).

GPL time on 4-plex Flex dataset: 104s -> 30s (3.5x speedup).
CPU utilization: 99% -> 746% (good parallel scaling on 8 threads).
Results are identical to the sequential version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When quantifying multi-barcode (e.g. Flex) data, featureDump.txt now
includes a sample_name column after CB, identifying which sample each
cell belongs to. This is populated during the barcode rewrite step
using the collation manifest's sample-to-chunk mapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, barcode labels and featureDump sample_name were added by a
post-processing rewrite step that assumed output rows matched collated
file order. This assumption relies on workers acquiring the output
mutex in file order, which is not guaranteed with multi-threaded
processing.

Now, each worker looks up the sample name from a manifest-derived
cell_num → sample_idx map at write time. The cell_num (chunk index in
the collated file) is deterministic regardless of thread scheduling.
This removes the rewrite_barcodes_with_sample_prefix function entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EDS (Efficient Dense Storage) output is no longer used. This removes:
- All EDS write paths in quant.rs (sce::eds calls, eds_file, eds_bytes)
- The sce crate dependency from Cargo.toml
- The use_mtx conditional throughout the codebase (MTX is now default)
- The BufferedGzFile type and EDS-specific bootstrap file creation

The --use-eds flag is kept as a hidden flag that prints a deprecation
error and exits. The --use-mtx flag is kept as a no-op for backwards
compatibility.

Note: Bootstrap output was previously written in EDS format. Bootstrap
computation still runs but output is not yet written in the new format.
This will be addressed in a follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bootstrap summary stats (mean and variance across replicates) are now
written as bootstraps_mean.mtx and bootstraps_var.mtx in the same
sparse MTX format as the main count matrix. Each worker accumulates
bootstrap triplets thread-locally and they are merged and written
after all workers complete.

When --summary-stat is not set, full per-replicate counts are computed
but only the summary statistics are written. A note is printed
informing users that full per-replicate output will be available in a
future release via AnnData/h5ad layers, which provide efficient
storage for the 3D (cells x genes x replicates) array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add multi-barcode RAD pipeline support and release/dependency cleanup
Merge develop-refactor into develop
@rob-p rob-p merged commit cfc1883 into master Mar 19, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant