Skip to content

feat(archon): add ZBVZeroBubble pipeline schedule support#916

Merged
garrett4wade merged 1 commit intomainfrom
rchardx/zbpp
Feb 10, 2026
Merged

feat(archon): add ZBVZeroBubble pipeline schedule support#916
garrett4wade merged 1 commit intomainfrom
rchardx/zbpp

Conversation

@rchardx
Copy link
Copy Markdown
Collaborator

@rchardx rchardx commented Feb 10, 2026

Description

Add V-style (zero bubble) pipeline scheduling to ArchonEngine.
ZBVZeroBubble splits backward into input-grad and weight-grad steps, enabling near-zero pipeline bubbles with 2 stages per rank.

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

Key changes:

  • V-style stage assignment in _get_stage_indices() (rank 0 gets first and last stages)
  • Schedule-aware _pp_last_stage_rank determination
  • Auto-disable torch.compile and op-level selective AC for V-style schedules (incompatible with split backward)
  • Generalize V-style guards to also cover ScheduleDualPipeV for forward compatibility
  • Add ZBV forward/backward distributed tests

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @rchardx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the ZBVZeroBubble pipeline parallel schedule into the ArchonEngine, enhancing pipeline efficiency by assigning the first and last stages to rank 0. It also intelligently manages potential conflicts with existing optimizations like torch.compile and certain Activation Checkpointing modes, ensuring a robust and performant system. The changes are supported by updated configuration options, comprehensive testing, and clear documentation.

Highlights

  • ZBVZeroBubble Pipeline Schedule Support: Added comprehensive support for the ZBVZeroBubble pipeline schedule, which utilizes a V-style stage assignment where rank 0 handles both the first and last pipeline stages to minimize bubbles.
  • Incompatibility Handling: Implemented auto-detection and handling for incompatibilities between ZBVZeroBubble and torch.compile, op-level selective Activation Checkpointing (AC), and memory-budget AC, ensuring graceful fallback to compatible configurations.
  • CLI and Documentation Updates: Updated the command-line interface (CLI) arguments and documentation to include 'ZBVZeroBubble' as a valid choice for pipeline parallel schedules.
  • Enhanced Testing: Introduced new unit tests for ZBVZeroBubble's Fully Qualified Name (FQN) generation and distributed tests for its forward and backward passes to validate correct functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • areal/api/cli_args.py
    • Added "ZBVZeroBubble" to the pp_schedule choices.
    • Updated the description for pp_layers_per_stage to include ZBVZeroBubble.
  • areal/experimental/engine/archon_engine.py
    • Imported ScheduleZBVZeroBubble and get_schedule_class.
    • Modified _pp_last_stage_rank initialization to defer its setup.
    • Implemented logic to disable torch.compile if ZBVZeroBubble is active.
    • Added logic to fall back op-level selective AC and memory-budget AC to full AC when ZBVZeroBubble is used.
    • Updated _apply_pipeline_parallelism to correctly determine _pp_last_stage_rank based on the pipeline schedule style (V-style for ZBV, loop-style otherwise).
  • areal/experimental/models/archon/pipeline_parallel.py
    • Imported ScheduleDualPipeV and ScheduleZBVZeroBubble.
    • Updated docstrings for pipeline_module_split and pipeline_llm to mention ZBVZeroBubble.
    • Refactored _get_stage_indices to support V-style stage assignment for ZBVZeroBubble and ScheduleDualPipeV.
    • Added validation to ensure V-style schedules require exactly two stages per rank.
    • Added a specific check for ZBVZeroBubble requiring two stages per rank in pipeline_llm.
  • areal/tests/experimental/archon/test_distributed_pp.py
    • Added new test descriptions for ZBVZeroBubble forward and backward tests.
    • Implemented test_pp_zbv_forward_2gpu to validate ZBVZeroBubble forward pass.
    • Implemented test_pp_zbv_backward_2gpu to validate ZBVZeroBubble backward pass.
  • areal/tests/experimental/archon/test_pipeline_parallel.py
    • Added TestZBVFqnGeneration class to verify FQN distribution for ZBV pipeline configurations.
  • areal/tests/experimental/archon/torchrun/run_pp_tests.py
    • Modified test_pp_forward to accept pp_schedule as an argument.
    • Updated test_pp_forward to dynamically calculate num_stages and n_layers based on the pp_schedule.
    • Implemented ZBV-specific forward pass logic using build_pipeline_schedule for ZBVZeroBubble.
    • Adjusted the broadcast of results in test_pp_forward to account for V-style schedules.
    • Modified test_pp_backward to accept pp_schedule as an argument.
    • Updated test_pp_backward to dynamically calculate num_stages and n_layers based on the pp_schedule.
    • Implemented ZBV-specific backward pass logic using build_pipeline_schedule for ZBVZeroBubble.
    • Added --pp_schedule argument parsing to the main function.
    • Passed pp_schedule argument to test_fn in main.
  • docs/cli_reference.md
    • Updated the pp_schedule parameter description to include ZBVZeroBubble as a choice.
    • Updated the pp_layers_per_stage parameter description to mention ZBVZeroBubble.
Activity
  • The pull request introduces a new feature and includes comprehensive changes across multiple files, including core engine logic, pipeline parallel configuration, and extensive testing. No specific human review comments or progress updates are available in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the ZBVZeroBubble pipeline schedule in the ArchonEngine. The changes are comprehensive, covering updates to CLI arguments, engine initialization to handle schedule-specific incompatibilities, and modifications to the pipeline splitting logic for V-style stage assignment. The accompanying tests are thorough and have been updated to cover the new schedule. My review includes a few suggestions to improve code maintainability and address a latent issue in the test suite related to handling different pipeline schedules.

Comment thread areal/tests/experimental/archon/torchrun/run_pp_tests.py Outdated
Comment thread areal/tests/experimental/archon/torchrun/run_pp_tests.py Outdated
Comment thread areal/experimental/engine/archon_engine.py Outdated
Comment thread areal/tests/experimental/archon/torchrun/run_pp_tests.py Outdated
Add V-style (zero bubble) pipeline scheduling to ArchonEngine.
ZBVZeroBubble splits backward into input-grad and weight-grad steps,
enabling near-zero pipeline bubbles with 2 stages per rank.

Key changes:
- V-style stage assignment in _get_stage_indices() (rank 0 gets first
  and last stages)
- Schedule-aware _pp_last_stage_rank determination
- Auto-disable torch.compile and op-level selective AC for V-style
  schedules (incompatible with split backward)
- Generalize V-style guards to also cover ScheduleDualPipeV for
  forward compatibility
- Add ZBV forward/backward distributed tests
@rchardx rchardx added the safe-to-test Ready to run unit-tests in a PR. label Feb 10, 2026
Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit ab411da into main Feb 10, 2026
8 checks passed
@garrett4wade garrett4wade deleted the rchardx/zbpp branch February 10, 2026 08:01
@garrett4wade garrett4wade mentioned this pull request Mar 3, 2026
21 tasks
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
…I#916)

Add V-style (zero bubble) pipeline scheduling to ArchonEngine.
ZBVZeroBubble splits backward into input-grad and weight-grad steps,
enabling near-zero pipeline bubbles with 2 stages per rank.

Key changes:
- V-style stage assignment in _get_stage_indices() (rank 0 gets first
  and last stages)
- Schedule-aware _pp_last_stage_rank determination
- Auto-disable torch.compile and op-level selective AC for V-style
  schedules (incompatible with split backward)
- Generalize V-style guards to also cover ScheduleDualPipeV for
  forward compatibility
- Add ZBV forward/backward distributed tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants