Skip to content

Refactor sharding dump to support custom mesh rule and diverse sharding alternatives#3639

Merged
copybara-service[bot] merged 5 commits intomainfrom
chengnuojin-no-exp5
Apr 13, 2026
Merged

Refactor sharding dump to support custom mesh rule and diverse sharding alternatives#3639
copybara-service[bot] merged 5 commits intomainfrom
chengnuojin-no-exp5

Conversation

@NuojCheng
Copy link
Copy Markdown
Collaborator

@NuojCheng NuojCheng commented Apr 10, 2026

Description

TL;DR: This PR refactors the sharding dump test cases to reduce redundancy, eliminates the rigid Cartesian product structure, and adds support for testing custom_mesh_and_rule alongside explicit sharding-related flags.

Background

The current sharding dump test in MaxText is highly valuable: it dumps the logical and physical sharding specs of specific test cases from AOT compilation. For subsequent code changes, AOT is triggered again and the details are compared to ensure sharding behavior hasn't unintentionally broken.

However, the existing test cases lack diversity and are unnecessarily redundant. We currently test a rigid Cartesian product of:

  • MODEL_NAMES = ["deepseek2-16b", "qwen3-0.6b", "gpt-oss-20b"]
  • TOPOLOGIES = ["tpu7x-16", "v6e-16", "v5p-16"]
  • SLICES = [1, 4]

Because these tests all run using the default sharding settings (full FSDP and DP_DCN), the Cartesian approach inflates test runtimes without actually expanding our coverage of different sharding strategies.

What this PR does

This PR drops the previous Cartesian product structure in favor of explicitly defined test combinations. This gives us the flexibility to test a much wider variety of sharding configurations. We now support the following degrees of freedom:

  • MODEL_NAMES = ["deepseek2-16b", "qwen3-0.6b", "gpt-oss-20b"]
  • TOPOLOGIES = ["tpu7x-16", "v6e-16", "v5p-16"]
  • SLICES = [1, 4]
  • custom_mesh_and_rule = [default, "pure-fsdp", "pipeline-large-moe"]
  • Sharding strategy overrides (e.g., ici_fsdp_parallelism=2, ici_expert_parallelism=2, use_ring_of_experts=true)

Example of the new test cases structure:

TEST_CASES = [
    # (model_name, topology, num_slice, custom_mesh_and_rule, sharding_strategy_overrides)
    ("deepseek2-16b", "tpu7x-16", 1, "", ()),
    ("deepseek2-16b", "tpu7x-16", 1, "pure-fsdp", ()),
    ("deepseek2-16b", "v6e-16", 1, "", ("ici_fsdp_parallelism=-1", "ici_expert_parallelism=2")),
    ("deepseek2-16b", "v6e-16", 1, "pipeline-large-moe", (
      "ici_fsdp_parallelism=-1", "ici_expert_parallelism=2", "use_ring_of_experts=true"
      )),
    ("qwen3-0.6b", "tpu7x-16", 1, "", ()),
    ("gpt-oss-20b", "tpu7x-16", 1, "", ()),
    ("gpt-oss-20b", "tpu7x-16", 1, "", ("ici_fsdp_parallelism=-1", "ici_expert_parallelism=2")),
    # Add your explicit combinations above this line
]

By moving to this explicit list, it becomes much easier to add targeted test cases in the future. For example, when a new custom mesh and rule is introduced, a specific test case can now be easily appended to TEST_CASES to ensure future regression protection.

Since customized meshes and rules are protected by sharding dump test after this change, this PR deprecates the previous custom_mesh_and_axes unit test.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment thread src/maxtext/configs/base.yml Outdated
Comment thread tests/utils/sharding_dump.py
@NuojCheng NuojCheng force-pushed the chengnuojin-no-exp5 branch from c69196e to 640eecb Compare April 13, 2026 20:51
@NuojCheng NuojCheng force-pushed the chengnuojin-no-exp5 branch from 640eecb to eed004f Compare April 13, 2026 20:52
@copybara-service copybara-service bot merged commit 6d86e74 into main Apr 13, 2026
38 of 39 checks passed
@copybara-service copybara-service bot deleted the chengnuojin-no-exp5 branch April 13, 2026 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants