Merge develop into master by rob-p · Pull Request #161 · COMBINE-lab/alevin-fry

rob-p · 2026-03-19T02:23:00Z

Summary

This PR merges develop into master.

Note: this repository uses master as the primary release branch rather than main, so this PR targets master.

Compared with master, develop is currently a strict superset and includes a large set of changes across four broad areas:

scATAC support and related pipeline work
generic / refactored quantification and long-read support
multi-barcode / multi-sample RAD support for Flex-style data
release-flow, packaging, CI, and dependency modernization

Major change areas

1. scATAC pipeline work

Add ATAC-specific modules and command-line support.
Implement ATAC cell filtering, collation, deduplication, sorting, and runtime glue.
Add ATAC documentation updates.
Improve compression / parallel-read handling and related diagnostics.

Relevant files include:

src/atac/cellfilter.rs
src/atac/collate.rs
src/atac/deduplicate.rs
src/atac/sort.rs
src/atac/run.rs
src/atac/prog_opts.rs

2. Generic quantification / refactor / long-read work

Refactor quantification internals and EQ-class payload handling.
Generalize parts of cellfilter, collate, and quant over record / payload types.
Add long-read EM support and supporting read-record changes.
Rework probability-map / stride handling and related tests.
Improve organization and reduce redundancy in long-read and quant code.

Key files touched include:

src/quant.rs
src/em.rs
src/eq_class.rs
src/pugutils.rs
src/utils.rs
src/convert.rs

3. Multi-barcode / multi-sample RAD support

Add multi-barcode record-type detection and CLI surface for 10x Flex-style data.
Implement multi-barcode generate-permit-list, including sample barcode handling and per-sample permit generation.
Add hierarchical / two-round multi-barcode collation and sample-aware grouping.
Add multi-barcode quantification dispatch and combined matrix output support.
Write sample-prefixed barcodes inline during quantification.
Add sample_name to featureDump.txt for multi-barcode outputs.
Add synthetic integration tests for the multi-barcode pipeline.

Correctness / performance follow-ups in this series include:

external whitelist filtering for multi-barcode cell barcodes
probe-barcode rotation collapsing to canonical samples
per-sample identity sets plus BarcodeLookupMap tiered correction
restored / improved multi-threaded scatter and parallelized gather / first-pass stages
USA support in the sparse fast path for small cells
CollatedUnmappedCounts and self-describing unmapped-count formats across collate / quant
removal of unsafe buffer slicing in two-round collation
fixes to barcode lookup and num_chunks backpatching
replacement of EDS bootstrap output with sparse MTX summary-stat output, deprecating --use-eds

4. Release / packaging / CI updates

Add bump_and_publish.sh for version bumping, dry-run validation, optional publish, tagging, and pushing.
Move cargo-dist configuration into dist-workspace.toml.
Add source tarball + SHA generation for release artifacts.
Add portable cargo config and CI build setup modeled on the piscem flow.
Update the generated release workflow to the newer cargo-dist/actions layout.
Update release/sanitize automation and related packaging files.

Relevant files include:

bump_and_publish.sh
dist-workspace.toml
.cargo/config.toml
.cargo/config-portable.toml
.github/build-setup.yml
.github/workflows/release.yml

5. Dependency and build modernization

Upgrade the codebase to Rust 2024 edition.
Update core dependencies, including newer noodles / petgraph lines.
Upgrade direct rand usage to 0.10 and rand_distr to 0.6.
Remove direct statrs usage by replacing the normal log-density delta with its exact closed-form expression.
Remove direct nalgebra usage by replacing bootstrap count vectors with plain Vecs.
This also removes the old rand 0.8 path from the resolved dependency graph.

Validation

Validated locally on develop with:

cargo check
cargo clippy --all-targets --all-features -- -D warnings
dist generate --mode=ci --allow-dirty
dist manifest --artifacts=all --output-format=json --no-local-paths --allow-dirty

Scope

Relative to master, the current diff is substantial and includes:

ATAC pipeline additions
generic quant / long-read refactor work
multi-barcode / multi-sample RAD support
release and CI modernization
dependency cleanup and packaging updates

At the time of PR creation, develop is strictly ahead of master.

…lves

Replace single-threaded scatter with proper multi-threaded implementation using the same header-peek pattern as the single-barcode collate. Multi-threaded scatter: - Main thread reads raw chunk bytes, pushes to ArrayQueue - N worker threads pop chunks, use two-pass record reading: 1. from_bytes_collatable_header() — reads only header (barcodes, UMI) 2. Skip alignment data for unmatched records (fast) 3. from_bytes_with_header_retain_ori() — reads alignments only for matched records with orientation filtering - Thread-local buffers with periodic flush to temp bucket files Tiered cell barcode correction: - Tier 1 (identity): HashSet of barcodes where raw == corrected in ALL samples. Single lookup, covers >99% of valid barcodes. - Tier 2 (per-sample): Vec<HashMap> of barcodes that correct to a DIFFERENT barcode. Small maps, only consulted on identity miss. - Handles sample-dependent correction correctly (same raw BC may correct to different targets in different samples). Performance on real data: - Dataset 1 (114M reads): 15s total (was 60s single-threaded, 4x speedup) - Dataset 2 (1.1B reads): 9min total (was 19min, 2.1x speedup) Scatter: 77s (was 14min), Gather: ~100s All 15 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Eliminate the bottleneck of loading and scanning the massive permit_map.bin (2.2GB / 139M entries for unfiltered-pl mode). Instead: 1. Load only permit_freq.bin per sample (26MB for 1.6M valid BCs) 2. Identity tier: HashSet of all valid barcodes — self-correcting, covers >99% of real reads. Built in milliseconds from permit_freq keys. 3. Correction tier: per-sample BarcodeLookupMap built from valid barcodes. On-the-fly 1-edit neighbor lookup during scatter, no pre-materialized HashMap. BarcodeLookupMap uses prefix-indexed binary search — fast and memory-efficient. Also cleaned up dead code from old worker loop. Performance on 1.1B read dataset: - Before (single-threaded, HashMap): 19 min - Previous (multi-threaded, HashMap): 9 min (2.5min map loading) - Now (multi-threaded, LookupMap): 2m40s (<1s map loading) - Map loading: <1s (was 2.5min — 150x faster) - Scatter: 58s (7 workers) - Gather: 99s 7.2x total speedup from the original single-threaded version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The valid barcode set differs per sample — the same barcode may be a true cell in one sample but absent in another. Using a global union of all valid barcodes for identity correction is incorrect: a barcode that is valid in sample A but not in sample B would incorrectly self-correct in sample B instead of being looked up for 1-edit correction. Fix: replace the global HashSet<u64> identity tier with a Vec<HashSet<u64>> indexed by sample_idx. The worker checks per_sample_identity[sample_idx].contains(&cell_bc) before falling back to the per-sample BarcodeLookupMap. Performance unchanged (~2m50s on 1.1B reads) since the per-sample lookup is the same cost as the global lookup (Vec index + HashSet contains). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The gather phase was the remaining bottleneck — processing 470 temp buckets sequentially took 105s (61% of total runtime). Now uses the same multi-threaded pattern as the single-barcode collate: workers pop temp buckets from an ArrayQueue, call collate_temporary_bucket_twopass_generic in parallel, delete temp files. Manifest is built after gather completes, from the tsv_map composite keys (which already encode sample→cell mappings). Performance on 1.1B read dataset: - Gather: 20s (was 105s — 5.3x faster) - Total: 1m35s (was 2m51s — 1.8x faster) - Overall vs original single-threaded: 12x speedup (19min → 1.5min) Breakdown: Setup + map loading: <1s Scatter (7 workers): 71s Gather (7 workers): 20s Total: 95s wall clock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

10x Flex probe barcodes have 8 rotation variants per sample (from circular probe ligation products read at different register positions). Previously, each rotation was treated as a separate sample, splitting cell counts across 8 entries per true sample. Now, when the barcode file has 3 columns (observed_bc, canonical_bc, sample_name), all rotation variants are mapped to their canonical form during generate-permit-list. This produces 32 canonical samples instead of 256 rotation entries. Changes: - New SampleBarcodeInfo struct holds canonical barcodes, rotation->canonical mapping, and canonical->name mapping - load_sample_barcode_list() parses 3-column TSV format, deduplicates to canonical barcodes - build_sample_permit_map() maps all rotation barcodes to canonical form - get_sample_names() extracts names from barcode file (column 3) - Backward compatible: 1-column and 2-column formats still work Comparison with CellRanger on 4-plex Human Colorectal/Kidney dataset: Per-gene total count correlation: 1.0000 Per-cell gene correlation median: 0.9948 AF/CR UMI ratio median: 1.012 Exact element match: 88.4% Cells with per-cell r > 0.95: 94.5% Shared barcodes: 511,470 / 530,480 (96.4%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace unsafe raw pointer slicing with the safe split_at_mut pattern (same as the single-barcode collate uses). The backing buffer is progressively split into non-overlapping mutable slices, each wrapped in a Cursor for per-bucket writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add correct_unmapped_counts_multi_bc() which reads the raw unmapped barcode count file (keyed by composite sample_bc|cell_bc from piscem), applies both sample BC correction (via sample_permit_map) and cell BC correction (via tiered identity + BarcodeLookupMap), and writes unmapped_bc_count_collated.bin keyed by corrected cell barcode. The collated file is keyed by cell BC (not composite) so quant's existing lookup by collate_key() works directly. If the same cell BC appears in multiple samples, their unmapped counts are summed — this is acceptable since the mapping rate is a per-cell metric. This replaces the workaround of creating an empty unmapped file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update correct_unmapped_counts_multi_bc to use libradicl's UnmappedBcFormat reader instead of parsing raw (u64, u32) pairs. The new format has a self-describing header declaring the number and widths of barcode fields. The reader: 1. Reads the header to determine per-field widths 2. Reads structured records with per-field barcode values 3. Corrects sample BC (via sample_permit_map) and cell BC (via tiered identity + BarcodeLookupMap) 4. Writes collated file as bincode HashMap<u64, u32> keyed by corrected cell BC (for quant compatibility) Handles empty files and format errors gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace ad-hoc HashMap<u64, u32> and bincode serialization with libradicl's CollatedUnmappedCounts for unmapped barcode frequency data. collate.rs: - correct_unmapped_counts (single-barcode): reads new self-describing format, falls back to legacy (u64, u32) pairs for old files. Writes CollatedUnmappedCounts::Single via self-describing format. - correct_unmapped_counts_multi_bc: produces CollatedUnmappedCounts::Multi keyed by (corrected_sample, corrected_cell) for per-sample accuracy. quant.rs: - Loads unmapped_bc_count_collated.bin via CollatedUnmappedCounts::read_from_file - Worker uses get_single(cell_bc) for lookup — zero-overhead for single-barcode - Multi-barcode: get_single falls back to sum across samples (acceptable for mapping rate computation; full per-sample lookup via get_multi available when quant gains sample-aware chunk processing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three main optimizations that together reduce quant time from >30 min to ~28s on a 5.6M cell × 19K gene 4-plex Flex dataset: 1. Sparse fast path for small cells (< 100 reads): new quantify_small_cell_sparse() implements full cr-like winner-take-all UMI resolution without any HashMap or dense vector overhead. Outputs sparse (gene, umi) pairs via sorted-vec voting, then builds expressed_vec/expressed_ind directly via run-length counting — avoiding both the O(num_genes) dense vector zeroing and the O(num_genes) scan to extract non-zero entries per cell. 2. HashMap capacity management: eqid_map and gene_eqc use capacity-based recreate (replace with fresh HashMap when capacity exceeds 256) instead of clear(), preventing O(capacity) iteration costs from accumulating across cells. gene_eqc clearing is also skipped entirely for fast-path cells that never touch it. 3. Thread-local MTX triplet accumulation: triplets are collected in per-worker Vec buffers and merged after all workers finish, replacing the previous approach of calling trimat.add_triplet() under a global mutex. Also caps initial trimat capacity at 256M entries to prevent overallocation on large datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The sparse fast path now handles USA (spliced/unspliced/ambiguous) mode by applying the same splicing-aware resolution as extract_counts in utils.rs. When cr-like voting produces a multi-gene tie, the tied gene set is resolved: same-gene S+U pairs go to the ambiguous slot, cross-gene ties prefer the spliced variant, and truly ambiguous ties are discarded. This matches the behavior of the full gene_eqc path exactly, verified on both Flex (non-USA) and 10x v3 (USA) datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The multi-barcode permit list generation was not using the external cell barcode whitelist (--unfiltered-pl). All observed cell barcodes were treated as valid, inflating cell counts ~10x (1.5M vs 165K for the Colorectal sample) and degrading concordance with CellRanger. Now, when --unfiltered-pl is provided: - First pass counts only cell BCs present in the whitelist - Non-matching BCs are collected per-sample for 1-edit rescue - Per-sample BarcodeLookupMap corrects unmatched BCs to their nearest whitelist neighbor (same algorithm as single-barcode path) Results on 4-plex Flex dataset (BC001/Colorectal sample): - UMI ratio AF/CR: 1.18 -> 1.05 - Per-gene Pearson: 0.9998 -> 0.99996 - Per-cell Pearson: 0.81 -> 0.92 - Per-cell median Pearson: 1.0 (perfect for most cells) - Zero AF-only barcodes (all now in CellRanger whitelist) - GPL runtime: 194s -> 104s (no more 1-edit maps on 1.5M junk BCs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The first pass that reads 809M records from the RAD file to count per-sample cell barcodes is now parallelized using ParallelChunkReader with worker threads. Each worker processes MetaChunks independently, updating shared DashMap histograms and thread-local unmatched BC buffers (flushed after completion to reduce lock contention). GPL time on 4-plex Flex dataset: 104s -> 30s (3.5x speedup). CPU utilization: 99% -> 746% (good parallel scaling on 8 threads). Results are identical to the sequential version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When quantifying multi-barcode (e.g. Flex) data, featureDump.txt now includes a sample_name column after CB, identifying which sample each cell belongs to. This is populated during the barcode rewrite step using the collation manifest's sample-to-chunk mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously, barcode labels and featureDump sample_name were added by a post-processing rewrite step that assumed output rows matched collated file order. This assumption relies on workers acquiring the output mutex in file order, which is not guaranteed with multi-threaded processing. Now, each worker looks up the sample name from a manifest-derived cell_num → sample_idx map at write time. The cell_num (chunk index in the collated file) is deterministic regardless of thread scheduling. This removes the rewrite_barcodes_with_sample_prefix function entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EDS (Efficient Dense Storage) output is no longer used. This removes: - All EDS write paths in quant.rs (sce::eds calls, eds_file, eds_bytes) - The sce crate dependency from Cargo.toml - The use_mtx conditional throughout the codebase (MTX is now default) - The BufferedGzFile type and EDS-specific bootstrap file creation The --use-eds flag is kept as a hidden flag that prints a deprecation error and exits. The --use-mtx flag is kept as a no-op for backwards compatibility. Note: Bootstrap output was previously written in EDS format. Bootstrap computation still runs but output is not yet written in the new format. This will be addressed in a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bootstrap summary stats (mean and variance across replicates) are now written as bootstraps_mean.mtx and bootstraps_var.mtx in the same sparse MTX format as the main count matrix. Each worker accumulates bootstrap triplets thread-locally and they are merged and written after all workers complete. When --summary-stat is not set, full per-replicate counts are computed but only the summary statistics are written. A note is printed informing users that full per-replicate output will be available in a future release via AnnData/h5ad layers, which provide efficient storage for the 3D (cells x genes x replicates) array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add multi-barcode RAD pipeline support and release/dependency cleanup

Merge develop-refactor into develop

Rob Patro and others added 30 commits May 22, 2025 16:14

add ability to filter alignments in bam to rad conversion

c3e8dfa

add more detail to empty chunk unwrap error

66f3987

preparing to test generic collate

9a9757b

fully compiles

4277173

starting to genericize cellfilter

6435b66

need a bit more genericization

016c4a0

generic cellfilter

81580ef

working on long read

08f53bb

generic quant

3d4bdd3

update deps

15d37f0

added read record with position

09f6ae8

make clippy happy

0bf5f42

update libradicl branch

390dffd

bring version into line with main

30dee60

update edition, statrs is a mess

1773ba4

get rid of statrs so we can use new rand; implement multinomial ourse…

dfffcab

…lves

switch to WeightedIndex from WeightedAliasIndex

ab14017

working on refactor

c5a3789

refactoring in quant

99482bf

check barcode lengths

11fb9fe

update petgraph

5a58319

update noodles

3695809

make clippy happy

f413427

add const IS_LONG to prepare for folding in long read changes

a186e5a

more prep for long read

cb657d8

better macro

beba6bc

working on how to represent the probabilities

a8a62ee

feels like too much indexing; write some tests

dec38cf

start adding tests for prob map

ae950f9

more tests

932a30a

rob-p and others added 29 commits March 16, 2026 14:11

add files

1a28686

Merge pull request #159 from COMBINE-lab/multi-sample-rad

5a2363e

Add multi-barcode RAD pipeline support and release/dependency cleanup

update url

9ec7acc

update url

974036a

update url

40d2367

update url

01366de

Merge pull request #160 from COMBINE-lab/develop-refactor

935a946

Merge develop-refactor into develop

update packages

5eb8706

update gitignore

26d7460

delete

0f4d115

update bump_and_publish.sh script

7ae7744

Merge branch 'master' into develop

9ce77ca

rob-p merged commit cfc1883 into master Mar 19, 2026
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge develop into master#161

Merge develop into master#161
rob-p merged 86 commits intomasterfrom
develop

rob-p commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rob-p commented Mar 19, 2026

Summary

Major change areas

1. scATAC pipeline work

2. Generic quantification / refactor / long-read work

3. Multi-barcode / multi-sample RAD support

4. Release / packaging / CI updates

5. Dependency and build modernization

Validation

Scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant