Merge develop-refactor into develop by rob-p · Pull Request #160 · COMBINE-lab/alevin-fry

rob-p · 2026-03-19T00:01:41Z

Summary

This PR merges develop-refactor back into develop.

Compared with develop, this branch is currently a strict superset and includes two major bodies of work:

The refactor / long-read / EQ-class probability series that originally landed on develop-refactor
The merged multi-barcode / multi-sample RAD pipeline work from multi-sample-rad (PR Add multi-barcode RAD pipeline support and release/dependency cleanup #159), plus related release-flow and dependency cleanup

Major change areas

1. Quant / EM / EQ-class refactor work

Refactor quantification internals and EQ-class payload handling.
Add long-read EM support and related code-path cleanup.
Rework probability-map / stride handling and related tests.
Improve code structure around generic EQ-class payloads and shared quant logic.
Update supporting dependencies such as noodles and petgraph along the way.

2. Multi-barcode / multi-sample RAD support

Add multi-barcode record-type detection and CLI surface for 10x Flex-style data.
Implement multi-barcode generate-permit-list, including sample-barcode handling and per-sample cell permit generation.
Add hierarchical / two-round multi-barcode collation and sample-aware grouping.
Add multi-barcode quantification dispatch and combined-matrix output support.
Add integration tests for the synthetic multi-barcode pipeline.

3. Correctness and performance improvements

Filter multi-barcode cell barcodes against external whitelists.
Collapse probe-barcode rotations to canonical sample identifiers.
Replace global-union cell correction with per-sample identity sets plus BarcodeLookupMap tiered correction.
Restore / improve multi-threaded scatter and parallelize gather / first-pass work where applicable.
Support USA mode in the sparse fast path for small cells.
Use CollatedUnmappedCounts and self-describing unmapped-count formats throughout collation / quant.
Remove unsafe buffer slicing in two-round collation and fix barcode lookup / num_chunks backpatch issues.
Replace EDS bootstrap output with sparse MTX summary-stat output and deprecate the old --use-eds path.
Write sample-prefixed barcodes inline during quantification instead of post-processing.
Add sample_name to featureDump.txt for multi-barcode outputs.

4. Release / packaging / CI updates

Add bump_and_publish.sh for version bumping, dry-run validation, optional publish, tagging, and pushing.
Move cargo-dist configuration into dist-workspace.toml.
Add source tarball + SHA generation for release artifacts.
Add portable cargo config and CI build setup modeled on the piscem flow.
Update the generated release workflow to the newer cargo-dist/actions layout.

5. Dependency cleanup

Upgrade direct rand usage to 0.10 and rand_distr to 0.6.
Remove direct statrs usage by replacing the normal log-density delta with its exact closed-form expression.
Remove direct nalgebra usage by replacing bootstrap count vectors with plain Vecs.
This removes the old rand 0.8 path from the resolved dependency graph.

Validation

Validated locally on this branch with:

cargo check
cargo clippy --all-targets --all-features -- -D warnings
dist generate --mode=ci --allow-dirty
dist manifest --artifacts=all --output-format=json --no-local-paths --allow-dirty

Notes

develop-refactor is strictly ahead of develop at the time of this PR.
The branch includes the merge of PR Add multi-barcode RAD pipeline support and release/dependency cleanup #159 (multi-sample-rad) on top of the earlier refactor work.
Local untracked test/data artifacts are not part of this PR.

Add do_collate_multi_bc() for hierarchical collation of multi-barcode RAD files (e.g., 10x Flex with sample + cell barcodes). Single-pass architecture with dual barcode correction: 1. Loads sample_permit_map.bin and per-sample cell permit maps 2. Scatter phase: reads chunks, corrects sample BC (level 0) and cell BC (last level) using per-sample correction maps 3. Routes records to temp buckets keyed by composite (sample_idx << 48 | cell_bc) for hierarchical ordering 4. Gather phase: uses existing collate_temporary_bucket_twopass_generic to merge temp buckets into final collated RAD file 5. Writes collation_manifest.bin sidecar with sample -> chunk range mapping Output: single collated RAD file with chunks ordered as [sample_0/cell_0, sample_0/cell_1, ..., sample_1/cell_0, ...] preserving multi-barcode record format for downstream processing. Key design choices: - Workers correct BOTH barcodes in one pass (no intermediate files) - Per-sample cell permit maps loaded from generate-permit-list output - Composite key ensures correct hierarchical ordering in output - Reuses libradicl's collate_temporary_bucket_twopass_generic for gather All 11 existing tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Connect multi-barcode collated RAD files to the existing quantification pipeline via do_quantify<MultiBarcodeReadRecord>. Since MultiBarcodeReadRecordT implements all required traits (MappedRecord, CollatableMappedRecord, KnownSize, UmiTaggedRecord), and collate_key() returns the cell barcode, the existing per-chunk quantification machinery works directly on multi-barcode collated files. Changes: - quant.rs: dispatch RnaShortMultiBC to do_quantify with MultiBarcodeReadRecord and BasicEqClassPayload - utils.rs: add OptionalAlignmentExtras impl for MultiBarcodeReadRecordT (returns None, same as standard short-read records) Current behavior: all samples quantified into a single output directory. TODO: Use collation_manifest.bin to split into per-sample directories or produce combined matrices with composite labels, controlled by --multi-sample-output flag. All 11 existing tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two-round collation (collate.rs): - Add do_collate_multi_bc_two_round() implementing the modular approach: Round 1: scatter records into per-sample intermediate RAD files Round 2: per sample, run cell-level collation using existing do_collate_with_temp machinery via libradicl functions - Reuses dump_corrected_cb_chunk_to_temp_file_generic and collate_temporary_bucket_twopass_generic for cell-level collation - Writes collation_manifest.bin and cleans up intermediate files - Rename existing single-pass to do_collate_multi_bc_fast() - Dispatch chooses between fast (default) and two-round modes Combined matrix output (quant.rs): - After do_quantify completes for multi-barcode data, post-process quants_mat_rows.txt to add sample name prefixes - rewrite_barcodes_with_sample_prefix() loads collation_manifest.bin, maps each cell's row to its sample group, rewrites labels as "sample_name_ACGTACGT" format - Produces combined output where sample identity is encoded in barcode labels (the "combined" output mode) All 11 existing tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three integration tests exercising the multi-barcode pipeline: 1. test_synthetic_rad_roundtrip: Creates a synthetic multi-barcode RAD file using RadFileWriter, reads it back, verifies header tags (num_barcodes, b0len, b1len), record counts, and per-sample read distribution. 2. test_multi_bc_generate_permit_list: End-to-end test of the generate-permit-list step — creates synthetic RAD + sample barcode list, runs generate_permit_list(), verifies output files (sample_permit_map.bin, sample_info.json, per-sample permit maps). 3. test_collation_manifest_roundtrip: Tests CollationManifest serialization/deserialization via write_to_file/read_from_file. Test infrastructure: - Synthetic multi-barcode RAD builder using libradicl's RadFileWriter - Deterministic barcode/UMI generation from indices - Helper functions: make_multi_bc_prelude(), create_synthetic_multi_bc_rad(), write_sample_bc_list(), packed_to_nuc(), write_tg_map() - Uses tempfile for isolated test directories All 14 tests pass (11 existing + 3 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…detection Two critical fixes discovered during real 10x Flex data testing: 1. Collate entry point: detect multi-barcode mode early (via multi_barcode flag in generate_permit_list.json) to skip permit_freq.bin loading, which doesn't exist for multi-BC mode. 2. Scatter phase: use single-threaded direct Chunk::from_bytes reading instead of buffered worker threads. The multi-threaded approach had a bug where chunk data passed to workers via byte buffers caused record parsing failures. The single-threaded approach reads chunks correctly and completes in ~50s for 114M reads. Successfully tested on real Human Kidney 4k 10x Flex dataset: - 97.8% mapping rate (114.6M/117.2M reads) - 99.5M records scattered, 34,297 output chunks - ~60s total collation time Also added test_read_real_flex_rad integration test that verifies Chunk::from_bytes works on real multi-barcode RAD files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fixes discovered during real 10x Flex dataset end-to-end test: 1. quant.rs: handle multi-barcode file tag names. do_quantify expects "cblen" but multi-barcode files use "b1len". Now falls back to b1len/b0len when cblen is not found. 2. collate.rs: backpatch num_chunks in output file after gather phase. The header was copied from the input file with the wrong chunk count; now the actual output chunk count is written back. 3. collate.rs: create empty unmapped_bc_count_collated.bin during multi-barcode collation (quant expects this file). Successfully tested full end-to-end pipeline on Human Kidney 4k 10x Flex dataset: piscem-rs map-scrna: 97.8% mapping (114.6M/117.2M reads) generate-permit-list: 256 samples, 66K cells collate: 99.5M records, 34,297 output chunks, ~60s quant: 34,297 cells x 19,068 genes, ~5s Output: 246MB MTX matrix with sample-prefixed barcode labels Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace single-threaded scatter with proper multi-threaded implementation using the same header-peek pattern as the single-barcode collate. Multi-threaded scatter: - Main thread reads raw chunk bytes, pushes to ArrayQueue - N worker threads pop chunks, use two-pass record reading: 1. from_bytes_collatable_header() — reads only header (barcodes, UMI) 2. Skip alignment data for unmatched records (fast) 3. from_bytes_with_header_retain_ori() — reads alignments only for matched records with orientation filtering - Thread-local buffers with periodic flush to temp bucket files Tiered cell barcode correction: - Tier 1 (identity): HashSet of barcodes where raw == corrected in ALL samples. Single lookup, covers >99% of valid barcodes. - Tier 2 (per-sample): Vec<HashMap> of barcodes that correct to a DIFFERENT barcode. Small maps, only consulted on identity miss. - Handles sample-dependent correction correctly (same raw BC may correct to different targets in different samples). Performance on real data: - Dataset 1 (114M reads): 15s total (was 60s single-threaded, 4x speedup) - Dataset 2 (1.1B reads): 9min total (was 19min, 2.1x speedup) Scatter: 77s (was 14min), Gather: ~100s All 15 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Eliminate the bottleneck of loading and scanning the massive permit_map.bin (2.2GB / 139M entries for unfiltered-pl mode). Instead: 1. Load only permit_freq.bin per sample (26MB for 1.6M valid BCs) 2. Identity tier: HashSet of all valid barcodes — self-correcting, covers >99% of real reads. Built in milliseconds from permit_freq keys. 3. Correction tier: per-sample BarcodeLookupMap built from valid barcodes. On-the-fly 1-edit neighbor lookup during scatter, no pre-materialized HashMap. BarcodeLookupMap uses prefix-indexed binary search — fast and memory-efficient. Also cleaned up dead code from old worker loop. Performance on 1.1B read dataset: - Before (single-threaded, HashMap): 19 min - Previous (multi-threaded, HashMap): 9 min (2.5min map loading) - Now (multi-threaded, LookupMap): 2m40s (<1s map loading) - Map loading: <1s (was 2.5min — 150x faster) - Scatter: 58s (7 workers) - Gather: 99s 7.2x total speedup from the original single-threaded version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The valid barcode set differs per sample — the same barcode may be a true cell in one sample but absent in another. Using a global union of all valid barcodes for identity correction is incorrect: a barcode that is valid in sample A but not in sample B would incorrectly self-correct in sample B instead of being looked up for 1-edit correction. Fix: replace the global HashSet<u64> identity tier with a Vec<HashSet<u64>> indexed by sample_idx. The worker checks per_sample_identity[sample_idx].contains(&cell_bc) before falling back to the per-sample BarcodeLookupMap. Performance unchanged (~2m50s on 1.1B reads) since the per-sample lookup is the same cost as the global lookup (Vec index + HashSet contains). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The gather phase was the remaining bottleneck — processing 470 temp buckets sequentially took 105s (61% of total runtime). Now uses the same multi-threaded pattern as the single-barcode collate: workers pop temp buckets from an ArrayQueue, call collate_temporary_bucket_twopass_generic in parallel, delete temp files. Manifest is built after gather completes, from the tsv_map composite keys (which already encode sample→cell mappings). Performance on 1.1B read dataset: - Gather: 20s (was 105s — 5.3x faster) - Total: 1m35s (was 2m51s — 1.8x faster) - Overall vs original single-threaded: 12x speedup (19min → 1.5min) Breakdown: Setup + map loading: <1s Scatter (7 workers): 71s Gather (7 workers): 20s Total: 95s wall clock Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

10x Flex probe barcodes have 8 rotation variants per sample (from circular probe ligation products read at different register positions). Previously, each rotation was treated as a separate sample, splitting cell counts across 8 entries per true sample. Now, when the barcode file has 3 columns (observed_bc, canonical_bc, sample_name), all rotation variants are mapped to their canonical form during generate-permit-list. This produces 32 canonical samples instead of 256 rotation entries. Changes: - New SampleBarcodeInfo struct holds canonical barcodes, rotation->canonical mapping, and canonical->name mapping - load_sample_barcode_list() parses 3-column TSV format, deduplicates to canonical barcodes - build_sample_permit_map() maps all rotation barcodes to canonical form - get_sample_names() extracts names from barcode file (column 3) - Backward compatible: 1-column and 2-column formats still work Comparison with CellRanger on 4-plex Human Colorectal/Kidney dataset: Per-gene total count correlation: 1.0000 Per-cell gene correlation median: 0.9948 AF/CR UMI ratio median: 1.012 Exact element match: 88.4% Cells with per-cell r > 0.95: 94.5% Shared barcodes: 511,470 / 530,480 (96.4%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace unsafe raw pointer slicing with the safe split_at_mut pattern (same as the single-barcode collate uses). The backing buffer is progressively split into non-overlapping mutable slices, each wrapped in a Cursor for per-bucket writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add correct_unmapped_counts_multi_bc() which reads the raw unmapped barcode count file (keyed by composite sample_bc|cell_bc from piscem), applies both sample BC correction (via sample_permit_map) and cell BC correction (via tiered identity + BarcodeLookupMap), and writes unmapped_bc_count_collated.bin keyed by corrected cell barcode. The collated file is keyed by cell BC (not composite) so quant's existing lookup by collate_key() works directly. If the same cell BC appears in multiple samples, their unmapped counts are summed — this is acceptable since the mapping rate is a per-cell metric. This replaces the workaround of creating an empty unmapped file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update correct_unmapped_counts_multi_bc to use libradicl's UnmappedBcFormat reader instead of parsing raw (u64, u32) pairs. The new format has a self-describing header declaring the number and widths of barcode fields. The reader: 1. Reads the header to determine per-field widths 2. Reads structured records with per-field barcode values 3. Corrects sample BC (via sample_permit_map) and cell BC (via tiered identity + BarcodeLookupMap) 4. Writes collated file as bincode HashMap<u64, u32> keyed by corrected cell BC (for quant compatibility) Handles empty files and format errors gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace ad-hoc HashMap<u64, u32> and bincode serialization with libradicl's CollatedUnmappedCounts for unmapped barcode frequency data. collate.rs: - correct_unmapped_counts (single-barcode): reads new self-describing format, falls back to legacy (u64, u32) pairs for old files. Writes CollatedUnmappedCounts::Single via self-describing format. - correct_unmapped_counts_multi_bc: produces CollatedUnmappedCounts::Multi keyed by (corrected_sample, corrected_cell) for per-sample accuracy. quant.rs: - Loads unmapped_bc_count_collated.bin via CollatedUnmappedCounts::read_from_file - Worker uses get_single(cell_bc) for lookup — zero-overhead for single-barcode - Multi-barcode: get_single falls back to sum across samples (acceptable for mapping rate computation; full per-sample lookup via get_multi available when quant gains sample-aware chunk processing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Three main optimizations that together reduce quant time from >30 min to ~28s on a 5.6M cell × 19K gene 4-plex Flex dataset: 1. Sparse fast path for small cells (< 100 reads): new quantify_small_cell_sparse() implements full cr-like winner-take-all UMI resolution without any HashMap or dense vector overhead. Outputs sparse (gene, umi) pairs via sorted-vec voting, then builds expressed_vec/expressed_ind directly via run-length counting — avoiding both the O(num_genes) dense vector zeroing and the O(num_genes) scan to extract non-zero entries per cell. 2. HashMap capacity management: eqid_map and gene_eqc use capacity-based recreate (replace with fresh HashMap when capacity exceeds 256) instead of clear(), preventing O(capacity) iteration costs from accumulating across cells. gene_eqc clearing is also skipped entirely for fast-path cells that never touch it. 3. Thread-local MTX triplet accumulation: triplets are collected in per-worker Vec buffers and merged after all workers finish, replacing the previous approach of calling trimat.add_triplet() under a global mutex. Also caps initial trimat capacity at 256M entries to prevent overallocation on large datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The sparse fast path now handles USA (spliced/unspliced/ambiguous) mode by applying the same splicing-aware resolution as extract_counts in utils.rs. When cr-like voting produces a multi-gene tie, the tied gene set is resolved: same-gene S+U pairs go to the ambiguous slot, cross-gene ties prefer the spliced variant, and truly ambiguous ties are discarded. This matches the behavior of the full gene_eqc path exactly, verified on both Flex (non-USA) and 10x v3 (USA) datasets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The multi-barcode permit list generation was not using the external cell barcode whitelist (--unfiltered-pl). All observed cell barcodes were treated as valid, inflating cell counts ~10x (1.5M vs 165K for the Colorectal sample) and degrading concordance with CellRanger. Now, when --unfiltered-pl is provided: - First pass counts only cell BCs present in the whitelist - Non-matching BCs are collected per-sample for 1-edit rescue - Per-sample BarcodeLookupMap corrects unmatched BCs to their nearest whitelist neighbor (same algorithm as single-barcode path) Results on 4-plex Flex dataset (BC001/Colorectal sample): - UMI ratio AF/CR: 1.18 -> 1.05 - Per-gene Pearson: 0.9998 -> 0.99996 - Per-cell Pearson: 0.81 -> 0.92 - Per-cell median Pearson: 1.0 (perfect for most cells) - Zero AF-only barcodes (all now in CellRanger whitelist) - GPL runtime: 194s -> 104s (no more 1-edit maps on 1.5M junk BCs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The first pass that reads 809M records from the RAD file to count per-sample cell barcodes is now parallelized using ParallelChunkReader with worker threads. Each worker processes MetaChunks independently, updating shared DashMap histograms and thread-local unmatched BC buffers (flushed after completion to reduce lock contention). GPL time on 4-plex Flex dataset: 104s -> 30s (3.5x speedup). CPU utilization: 99% -> 746% (good parallel scaling on 8 threads). Results are identical to the sequential version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When quantifying multi-barcode (e.g. Flex) data, featureDump.txt now includes a sample_name column after CB, identifying which sample each cell belongs to. This is populated during the barcode rewrite step using the collation manifest's sample-to-chunk mapping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previously, barcode labels and featureDump sample_name were added by a post-processing rewrite step that assumed output rows matched collated file order. This assumption relies on workers acquiring the output mutex in file order, which is not guaranteed with multi-threaded processing. Now, each worker looks up the sample name from a manifest-derived cell_num → sample_idx map at write time. The cell_num (chunk index in the collated file) is deterministic regardless of thread scheduling. This removes the rewrite_barcodes_with_sample_prefix function entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

EDS (Efficient Dense Storage) output is no longer used. This removes: - All EDS write paths in quant.rs (sce::eds calls, eds_file, eds_bytes) - The sce crate dependency from Cargo.toml - The use_mtx conditional throughout the codebase (MTX is now default) - The BufferedGzFile type and EDS-specific bootstrap file creation The --use-eds flag is kept as a hidden flag that prints a deprecation error and exits. The --use-mtx flag is kept as a no-op for backwards compatibility. Note: Bootstrap output was previously written in EDS format. Bootstrap computation still runs but output is not yet written in the new format. This will be addressed in a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bootstrap summary stats (mean and variance across replicates) are now written as bootstraps_mean.mtx and bootstraps_var.mtx in the same sparse MTX format as the main count matrix. Each worker accumulates bootstrap triplets thread-locally and they are merged and written after all workers complete. When --summary-stat is not set, full per-replicate counts are computed but only the summary statistics are written. A note is printed informing users that full per-replicate output will be available in a future release via AnnData/h5ad layers, which provide efficient storage for the 3D (cells x genes x replicates) array. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add multi-barcode RAD pipeline support and release/dependency cleanup

rob-p and others added 30 commits December 20, 2025 13:24

working on refactor

c5a3789

refactoring in quant

99482bf

check barcode lengths

11fb9fe

update petgraph

5a58319

update noodles

3695809

make clippy happy

f413427

add const IS_LONG to prepare for folding in long read changes

a186e5a

more prep for long read

cb657d8

better macro

beba6bc

working on how to represent the probabilities

a8a62ee

feels like too much indexing; write some tests

dec38cf

start adding tests for prob map

ae950f9

more tests

932a30a

more tests

6e620ec

more tests

7cec83d

tests for ProbMap look good

42eb3c3

added clear

3988480

add Zahra's long read em functions

be1cdc8

should have folded in long read changes; still need to clean up a bit

1f4bac7

parameterize on eq class payload

6d957cb

more progress in reducing redundancy in long-read code

4458e74

generic untested

c5191d7

make clippy happy

def418b

make EqClassPayload methods inline

14df1ec

small cleanup

41c19c4

comments

28d8e4c

tweaks

2212f41

store vec instead of vec of vec in long read eq class payload

bb4899b

require eq label len in payload constructor

399772a

push stride implementation into ProbMap

99bc852

rob-p and others added 29 commits March 15, 2026 18:10

add files

1a28686

Merge pull request #159 from COMBINE-lab/multi-sample-rad

5a2363e

Add multi-barcode RAD pipeline support and release/dependency cleanup

update url

9ec7acc

update url

974036a

update url

40d2367

update url

01366de

rob-p merged commit 935a946 into develop Mar 19, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge develop-refactor into develop#160

Merge develop-refactor into develop#160
rob-p merged 63 commits intodevelopfrom
develop-refactor

rob-p commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rob-p commented Mar 19, 2026

Summary

Major change areas

1. Quant / EM / EQ-class refactor work

2. Multi-barcode / multi-sample RAD support

3. Correctness and performance improvements

4. Release / packaging / CI updates

5. Dependency cleanup

Validation

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant