Skip to content

Merge develop-refactor into develop#160

Merged
rob-p merged 63 commits intodevelopfrom
develop-refactor
Mar 19, 2026
Merged

Merge develop-refactor into develop#160
rob-p merged 63 commits intodevelopfrom
develop-refactor

Conversation

@rob-p
Copy link
Copy Markdown
Contributor

@rob-p rob-p commented Mar 19, 2026

Summary

This PR merges develop-refactor back into develop.

Compared with develop, this branch is currently a strict superset and includes two major bodies of work:

  1. The refactor / long-read / EQ-class probability series that originally landed on develop-refactor
  2. The merged multi-barcode / multi-sample RAD pipeline work from multi-sample-rad (PR Add multi-barcode RAD pipeline support and release/dependency cleanup #159), plus related release-flow and dependency cleanup

Major change areas

1. Quant / EM / EQ-class refactor work

  • Refactor quantification internals and EQ-class payload handling.
  • Add long-read EM support and related code-path cleanup.
  • Rework probability-map / stride handling and related tests.
  • Improve code structure around generic EQ-class payloads and shared quant logic.
  • Update supporting dependencies such as noodles and petgraph along the way.

2. Multi-barcode / multi-sample RAD support

  • Add multi-barcode record-type detection and CLI surface for 10x Flex-style data.
  • Implement multi-barcode generate-permit-list, including sample-barcode handling and per-sample cell permit generation.
  • Add hierarchical / two-round multi-barcode collation and sample-aware grouping.
  • Add multi-barcode quantification dispatch and combined-matrix output support.
  • Add integration tests for the synthetic multi-barcode pipeline.

3. Correctness and performance improvements

  • Filter multi-barcode cell barcodes against external whitelists.
  • Collapse probe-barcode rotations to canonical sample identifiers.
  • Replace global-union cell correction with per-sample identity sets plus BarcodeLookupMap tiered correction.
  • Restore / improve multi-threaded scatter and parallelize gather / first-pass work where applicable.
  • Support USA mode in the sparse fast path for small cells.
  • Use CollatedUnmappedCounts and self-describing unmapped-count formats throughout collation / quant.
  • Remove unsafe buffer slicing in two-round collation and fix barcode lookup / num_chunks backpatch issues.
  • Replace EDS bootstrap output with sparse MTX summary-stat output and deprecate the old --use-eds path.
  • Write sample-prefixed barcodes inline during quantification instead of post-processing.
  • Add sample_name to featureDump.txt for multi-barcode outputs.

4. Release / packaging / CI updates

  • Add bump_and_publish.sh for version bumping, dry-run validation, optional publish, tagging, and pushing.
  • Move cargo-dist configuration into dist-workspace.toml.
  • Add source tarball + SHA generation for release artifacts.
  • Add portable cargo config and CI build setup modeled on the piscem flow.
  • Update the generated release workflow to the newer cargo-dist/actions layout.

5. Dependency cleanup

  • Upgrade direct rand usage to 0.10 and rand_distr to 0.6.
  • Remove direct statrs usage by replacing the normal log-density delta with its exact closed-form expression.
  • Remove direct nalgebra usage by replacing bootstrap count vectors with plain Vecs.
  • This removes the old rand 0.8 path from the resolved dependency graph.

Validation

Validated locally on this branch with:

  • cargo check
  • cargo clippy --all-targets --all-features -- -D warnings
  • dist generate --mode=ci --allow-dirty
  • dist manifest --artifacts=all --output-format=json --no-local-paths --allow-dirty

Notes

rob-p and others added 29 commits March 15, 2026 18:10
Add do_collate_multi_bc() for hierarchical collation of multi-barcode
RAD files (e.g., 10x Flex with sample + cell barcodes).

Single-pass architecture with dual barcode correction:
1. Loads sample_permit_map.bin and per-sample cell permit maps
2. Scatter phase: reads chunks, corrects sample BC (level 0) and
   cell BC (last level) using per-sample correction maps
3. Routes records to temp buckets keyed by composite
   (sample_idx << 48 | cell_bc) for hierarchical ordering
4. Gather phase: uses existing collate_temporary_bucket_twopass_generic
   to merge temp buckets into final collated RAD file
5. Writes collation_manifest.bin sidecar with sample -> chunk range mapping

Output: single collated RAD file with chunks ordered as
  [sample_0/cell_0, sample_0/cell_1, ..., sample_1/cell_0, ...]
preserving multi-barcode record format for downstream processing.

Key design choices:
- Workers correct BOTH barcodes in one pass (no intermediate files)
- Per-sample cell permit maps loaded from generate-permit-list output
- Composite key ensures correct hierarchical ordering in output
- Reuses libradicl's collate_temporary_bucket_twopass_generic for gather

All 11 existing tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Connect multi-barcode collated RAD files to the existing quantification
pipeline via do_quantify<MultiBarcodeReadRecord>.

Since MultiBarcodeReadRecordT implements all required traits (MappedRecord,
CollatableMappedRecord, KnownSize, UmiTaggedRecord), and collate_key()
returns the cell barcode, the existing per-chunk quantification machinery
works directly on multi-barcode collated files.

Changes:
- quant.rs: dispatch RnaShortMultiBC to do_quantify with
  MultiBarcodeReadRecord and BasicEqClassPayload
- utils.rs: add OptionalAlignmentExtras impl for MultiBarcodeReadRecordT
  (returns None, same as standard short-read records)

Current behavior: all samples quantified into a single output directory.
TODO: Use collation_manifest.bin to split into per-sample directories
or produce combined matrices with composite labels, controlled by
--multi-sample-output flag.

All 11 existing tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-round collation (collate.rs):
- Add do_collate_multi_bc_two_round() implementing the modular approach:
  Round 1: scatter records into per-sample intermediate RAD files
  Round 2: per sample, run cell-level collation using existing
           do_collate_with_temp machinery via libradicl functions
- Reuses dump_corrected_cb_chunk_to_temp_file_generic and
  collate_temporary_bucket_twopass_generic for cell-level collation
- Writes collation_manifest.bin and cleans up intermediate files
- Rename existing single-pass to do_collate_multi_bc_fast()
- Dispatch chooses between fast (default) and two-round modes

Combined matrix output (quant.rs):
- After do_quantify completes for multi-barcode data, post-process
  quants_mat_rows.txt to add sample name prefixes
- rewrite_barcodes_with_sample_prefix() loads collation_manifest.bin,
  maps each cell's row to its sample group, rewrites labels as
  "sample_name_ACGTACGT" format
- Produces combined output where sample identity is encoded in
  barcode labels (the "combined" output mode)

All 11 existing tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three integration tests exercising the multi-barcode pipeline:

1. test_synthetic_rad_roundtrip: Creates a synthetic multi-barcode RAD
   file using RadFileWriter, reads it back, verifies header tags
   (num_barcodes, b0len, b1len), record counts, and per-sample read
   distribution.

2. test_multi_bc_generate_permit_list: End-to-end test of the
   generate-permit-list step — creates synthetic RAD + sample barcode
   list, runs generate_permit_list(), verifies output files
   (sample_permit_map.bin, sample_info.json, per-sample permit maps).

3. test_collation_manifest_roundtrip: Tests CollationManifest
   serialization/deserialization via write_to_file/read_from_file.

Test infrastructure:
- Synthetic multi-barcode RAD builder using libradicl's RadFileWriter
- Deterministic barcode/UMI generation from indices
- Helper functions: make_multi_bc_prelude(), create_synthetic_multi_bc_rad(),
  write_sample_bc_list(), packed_to_nuc(), write_tg_map()
- Uses tempfile for isolated test directories

All 14 tests pass (11 existing + 3 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…detection

Two critical fixes discovered during real 10x Flex data testing:

1. Collate entry point: detect multi-barcode mode early (via
   multi_barcode flag in generate_permit_list.json) to skip
   permit_freq.bin loading, which doesn't exist for multi-BC mode.

2. Scatter phase: use single-threaded direct Chunk::from_bytes reading
   instead of buffered worker threads. The multi-threaded approach had
   a bug where chunk data passed to workers via byte buffers caused
   record parsing failures. The single-threaded approach reads chunks
   correctly and completes in ~50s for 114M reads.

Successfully tested on real Human Kidney 4k 10x Flex dataset:
- 97.8% mapping rate (114.6M/117.2M reads)
- 99.5M records scattered, 34,297 output chunks
- ~60s total collation time

Also added test_read_real_flex_rad integration test that verifies
Chunk::from_bytes works on real multi-barcode RAD files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes discovered during real 10x Flex dataset end-to-end test:

1. quant.rs: handle multi-barcode file tag names. do_quantify expects
   "cblen" but multi-barcode files use "b1len". Now falls back to
   b1len/b0len when cblen is not found.

2. collate.rs: backpatch num_chunks in output file after gather phase.
   The header was copied from the input file with the wrong chunk count;
   now the actual output chunk count is written back.

3. collate.rs: create empty unmapped_bc_count_collated.bin during
   multi-barcode collation (quant expects this file).

Successfully tested full end-to-end pipeline on Human Kidney 4k
10x Flex dataset:
  piscem-rs map-scrna: 97.8% mapping (114.6M/117.2M reads)
  generate-permit-list: 256 samples, 66K cells
  collate: 99.5M records, 34,297 output chunks, ~60s
  quant: 34,297 cells x 19,068 genes, ~5s
  Output: 246MB MTX matrix with sample-prefixed barcode labels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace single-threaded scatter with proper multi-threaded implementation
using the same header-peek pattern as the single-barcode collate.

Multi-threaded scatter:
- Main thread reads raw chunk bytes, pushes to ArrayQueue
- N worker threads pop chunks, use two-pass record reading:
  1. from_bytes_collatable_header() — reads only header (barcodes, UMI)
  2. Skip alignment data for unmatched records (fast)
  3. from_bytes_with_header_retain_ori() — reads alignments only for
     matched records with orientation filtering
- Thread-local buffers with periodic flush to temp bucket files

Tiered cell barcode correction:
- Tier 1 (identity): HashSet of barcodes where raw == corrected in ALL
  samples. Single lookup, covers >99% of valid barcodes.
- Tier 2 (per-sample): Vec<HashMap> of barcodes that correct to a
  DIFFERENT barcode. Small maps, only consulted on identity miss.
- Handles sample-dependent correction correctly (same raw BC may correct
  to different targets in different samples).

Performance on real data:
- Dataset 1 (114M reads): 15s total (was 60s single-threaded, 4x speedup)
- Dataset 2 (1.1B reads): 9min total (was 19min, 2.1x speedup)
  Scatter: 77s (was 14min), Gather: ~100s

All 15 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminate the bottleneck of loading and scanning the massive permit_map.bin
(2.2GB / 139M entries for unfiltered-pl mode). Instead:

1. Load only permit_freq.bin per sample (26MB for 1.6M valid BCs)
2. Identity tier: HashSet of all valid barcodes — self-correcting, covers
   >99% of real reads. Built in milliseconds from permit_freq keys.
3. Correction tier: per-sample BarcodeLookupMap built from valid barcodes.
   On-the-fly 1-edit neighbor lookup during scatter, no pre-materialized
   HashMap. BarcodeLookupMap uses prefix-indexed binary search — fast and
   memory-efficient.

Also cleaned up dead code from old worker loop.

Performance on 1.1B read dataset:
- Before (single-threaded, HashMap):  19 min
- Previous (multi-threaded, HashMap):  9 min (2.5min map loading)
- Now (multi-threaded, LookupMap):     2m40s (<1s map loading)
  - Map loading: <1s (was 2.5min — 150x faster)
  - Scatter: 58s (7 workers)
  - Gather: 99s

7.2x total speedup from the original single-threaded version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The valid barcode set differs per sample — the same barcode may be a
true cell in one sample but absent in another. Using a global union
of all valid barcodes for identity correction is incorrect: a barcode
that is valid in sample A but not in sample B would incorrectly
self-correct in sample B instead of being looked up for 1-edit
correction.

Fix: replace the global HashSet<u64> identity tier with a
Vec<HashSet<u64>> indexed by sample_idx. The worker checks
per_sample_identity[sample_idx].contains(&cell_bc) before
falling back to the per-sample BarcodeLookupMap.

Performance unchanged (~2m50s on 1.1B reads) since the per-sample
lookup is the same cost as the global lookup (Vec index + HashSet
contains).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The gather phase was the remaining bottleneck — processing 470 temp
buckets sequentially took 105s (61% of total runtime). Now uses the
same multi-threaded pattern as the single-barcode collate: workers
pop temp buckets from an ArrayQueue, call
collate_temporary_bucket_twopass_generic in parallel, delete temp files.

Manifest is built after gather completes, from the tsv_map composite
keys (which already encode sample→cell mappings).

Performance on 1.1B read dataset:
- Gather: 20s (was 105s — 5.3x faster)
- Total: 1m35s (was 2m51s — 1.8x faster)
- Overall vs original single-threaded: 12x speedup (19min → 1.5min)

Breakdown:
  Setup + map loading:  <1s
  Scatter (7 workers):  71s
  Gather (7 workers):   20s
  Total:                95s wall clock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
10x Flex probe barcodes have 8 rotation variants per sample (from
circular probe ligation products read at different register positions).
Previously, each rotation was treated as a separate sample, splitting
cell counts across 8 entries per true sample.

Now, when the barcode file has 3 columns (observed_bc, canonical_bc,
sample_name), all rotation variants are mapped to their canonical form
during generate-permit-list. This produces 32 canonical samples instead
of 256 rotation entries.

Changes:
- New SampleBarcodeInfo struct holds canonical barcodes, rotation->canonical
  mapping, and canonical->name mapping
- load_sample_barcode_list() parses 3-column TSV format, deduplicates
  to canonical barcodes
- build_sample_permit_map() maps all rotation barcodes to canonical form
- get_sample_names() extracts names from barcode file (column 3)
- Backward compatible: 1-column and 2-column formats still work

Comparison with CellRanger on 4-plex Human Colorectal/Kidney dataset:
  Per-gene total count correlation:  1.0000
  Per-cell gene correlation median:  0.9948
  AF/CR UMI ratio median:           1.012
  Exact element match:              88.4%
  Cells with per-cell r > 0.95:     94.5%
  Shared barcodes:                  511,470 / 530,480 (96.4%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace unsafe raw pointer slicing with the safe split_at_mut pattern
(same as the single-barcode collate uses). The backing buffer is
progressively split into non-overlapping mutable slices, each wrapped
in a Cursor for per-bucket writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add correct_unmapped_counts_multi_bc() which reads the raw unmapped
barcode count file (keyed by composite sample_bc|cell_bc from piscem),
applies both sample BC correction (via sample_permit_map) and cell BC
correction (via tiered identity + BarcodeLookupMap), and writes
unmapped_bc_count_collated.bin keyed by corrected cell barcode.

The collated file is keyed by cell BC (not composite) so quant's
existing lookup by collate_key() works directly. If the same cell BC
appears in multiple samples, their unmapped counts are summed — this
is acceptable since the mapping rate is a per-cell metric.

This replaces the workaround of creating an empty unmapped file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update correct_unmapped_counts_multi_bc to use libradicl's
UnmappedBcFormat reader instead of parsing raw (u64, u32) pairs.

The new format has a self-describing header declaring the number and
widths of barcode fields. The reader:
1. Reads the header to determine per-field widths
2. Reads structured records with per-field barcode values
3. Corrects sample BC (via sample_permit_map) and cell BC (via
   tiered identity + BarcodeLookupMap)
4. Writes collated file as bincode HashMap<u64, u32> keyed by
   corrected cell BC (for quant compatibility)

Handles empty files and format errors gracefully.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ad-hoc HashMap<u64, u32> and bincode serialization with
libradicl's CollatedUnmappedCounts for unmapped barcode frequency data.

collate.rs:
- correct_unmapped_counts (single-barcode): reads new self-describing
  format, falls back to legacy (u64, u32) pairs for old files.
  Writes CollatedUnmappedCounts::Single via self-describing format.
- correct_unmapped_counts_multi_bc: produces CollatedUnmappedCounts::Multi
  keyed by (corrected_sample, corrected_cell) for per-sample accuracy.

quant.rs:
- Loads unmapped_bc_count_collated.bin via CollatedUnmappedCounts::read_from_file
- Worker uses get_single(cell_bc) for lookup — zero-overhead for single-barcode
- Multi-barcode: get_single falls back to sum across samples (acceptable
  for mapping rate computation; full per-sample lookup via get_multi
  available when quant gains sample-aware chunk processing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three main optimizations that together reduce quant time from >30 min
to ~28s on a 5.6M cell × 19K gene 4-plex Flex dataset:

1. Sparse fast path for small cells (< 100 reads): new
   quantify_small_cell_sparse() implements full cr-like winner-take-all
   UMI resolution without any HashMap or dense vector overhead. Outputs
   sparse (gene, umi) pairs via sorted-vec voting, then builds
   expressed_vec/expressed_ind directly via run-length counting —
   avoiding both the O(num_genes) dense vector zeroing and the
   O(num_genes) scan to extract non-zero entries per cell.

2. HashMap capacity management: eqid_map and gene_eqc use
   capacity-based recreate (replace with fresh HashMap when capacity
   exceeds 256) instead of clear(), preventing O(capacity) iteration
   costs from accumulating across cells. gene_eqc clearing is also
   skipped entirely for fast-path cells that never touch it.

3. Thread-local MTX triplet accumulation: triplets are collected in
   per-worker Vec buffers and merged after all workers finish, replacing
   the previous approach of calling trimat.add_triplet() under a global
   mutex. Also caps initial trimat capacity at 256M entries to prevent
   overallocation on large datasets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sparse fast path now handles USA (spliced/unspliced/ambiguous) mode
by applying the same splicing-aware resolution as extract_counts in
utils.rs. When cr-like voting produces a multi-gene tie, the tied gene
set is resolved: same-gene S+U pairs go to the ambiguous slot,
cross-gene ties prefer the spliced variant, and truly ambiguous ties
are discarded. This matches the behavior of the full gene_eqc path
exactly, verified on both Flex (non-USA) and 10x v3 (USA) datasets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The multi-barcode permit list generation was not using the external
cell barcode whitelist (--unfiltered-pl). All observed cell barcodes
were treated as valid, inflating cell counts ~10x (1.5M vs 165K for
the Colorectal sample) and degrading concordance with CellRanger.

Now, when --unfiltered-pl is provided:
- First pass counts only cell BCs present in the whitelist
- Non-matching BCs are collected per-sample for 1-edit rescue
- Per-sample BarcodeLookupMap corrects unmatched BCs to their nearest
  whitelist neighbor (same algorithm as single-barcode path)

Results on 4-plex Flex dataset (BC001/Colorectal sample):
- UMI ratio AF/CR: 1.18 -> 1.05
- Per-gene Pearson: 0.9998 -> 0.99996
- Per-cell Pearson: 0.81 -> 0.92
- Per-cell median Pearson: 1.0 (perfect for most cells)
- Zero AF-only barcodes (all now in CellRanger whitelist)
- GPL runtime: 194s -> 104s (no more 1-edit maps on 1.5M junk BCs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first pass that reads 809M records from the RAD file to count
per-sample cell barcodes is now parallelized using ParallelChunkReader
with worker threads. Each worker processes MetaChunks independently,
updating shared DashMap histograms and thread-local unmatched BC
buffers (flushed after completion to reduce lock contention).

GPL time on 4-plex Flex dataset: 104s -> 30s (3.5x speedup).
CPU utilization: 99% -> 746% (good parallel scaling on 8 threads).
Results are identical to the sequential version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When quantifying multi-barcode (e.g. Flex) data, featureDump.txt now
includes a sample_name column after CB, identifying which sample each
cell belongs to. This is populated during the barcode rewrite step
using the collation manifest's sample-to-chunk mapping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, barcode labels and featureDump sample_name were added by a
post-processing rewrite step that assumed output rows matched collated
file order. This assumption relies on workers acquiring the output
mutex in file order, which is not guaranteed with multi-threaded
processing.

Now, each worker looks up the sample name from a manifest-derived
cell_num → sample_idx map at write time. The cell_num (chunk index in
the collated file) is deterministic regardless of thread scheduling.
This removes the rewrite_barcodes_with_sample_prefix function entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EDS (Efficient Dense Storage) output is no longer used. This removes:
- All EDS write paths in quant.rs (sce::eds calls, eds_file, eds_bytes)
- The sce crate dependency from Cargo.toml
- The use_mtx conditional throughout the codebase (MTX is now default)
- The BufferedGzFile type and EDS-specific bootstrap file creation

The --use-eds flag is kept as a hidden flag that prints a deprecation
error and exits. The --use-mtx flag is kept as a no-op for backwards
compatibility.

Note: Bootstrap output was previously written in EDS format. Bootstrap
computation still runs but output is not yet written in the new format.
This will be addressed in a follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bootstrap summary stats (mean and variance across replicates) are now
written as bootstraps_mean.mtx and bootstraps_var.mtx in the same
sparse MTX format as the main count matrix. Each worker accumulates
bootstrap triplets thread-locally and they are merged and written
after all workers complete.

When --summary-stat is not set, full per-replicate counts are computed
but only the summary statistics are written. A note is printed
informing users that full per-replicate output will be available in a
future release via AnnData/h5ad layers, which provide efficient
storage for the 3D (cells x genes x replicates) array.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add multi-barcode RAD pipeline support and release/dependency cleanup
@rob-p rob-p merged commit 935a946 into develop Mar 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant