Skip to content

[Refactor] Support concurrent inference acorss tasks.#2403

Merged
mzr1996 merged 7 commits intomainfrom
streaming_infer
Mar 20, 2026
Merged

[Refactor] Support concurrent inference acorss tasks.#2403
mzr1996 merged 7 commits intomainfrom
streaming_infer

Conversation

@mzr1996
Copy link
Copy Markdown
Collaborator

@mzr1996 mzr1996 commented Feb 25, 2026

No description provided.

@mzr1996 mzr1996 changed the title [Refactor] Support concurrent inference accorss tasks. [Refactor] Support concurrent inference acorss tasks. Feb 25, 2026
@zhulinJulia24 zhulinJulia24 requested a review from Copilot March 10, 2026 05:50
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors OpenCompass inference/evaluation orchestration to enable concurrent inference across multiple datasets (tasks) and to allow evaluation to “watch” inference progress and start as soon as results are ready.

Changes:

  • Added filesystem-based inference status tracking and a heartbeat mechanism for coordinating infer/eval across processes.
  • Introduced OpenICLInferTaskConcurrent (concurrent dataset inference) and OpenICLEvalWatchTask (evaluation that monitors infer progress).
  • Added parallel (per-sample) chat/gen inferencers and updated OpenAI API wrappers + CLI flow to support concurrency.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
opencompass/utils/run.py Minor formatting/line alignment change.
opencompass/utils/infer_status.py New status manager with file-lock based read/write for infer progress.
opencompass/utils/heartbeat.py New heartbeat writer/reader for eval-watch coordination.
opencompass/utils/init.py Exposes HeartBeatManager / InferStatusManager from utils.
opencompass/tasks/openicl_infer_concurrent.py New concurrent inference task that runs multiple datasets with shared API concurrency tokens.
opencompass/tasks/openicl_eval_watch.py New eval task that waits for infer completion via status + heartbeat timeout.
opencompass/tasks/init.py Registers the new tasks for discovery/import.
opencompass/openicl/icl_inferencer/icl_gen_inferencer_parallel.py New per-sample parallel gen inferencer using a thread pool.
opencompass/openicl/icl_inferencer/icl_chat_inferencer_parallel.py New per-sample parallel chat inferencer using a thread pool.
opencompass/openicl/icl_inferencer/init.py Exports the new parallel inferencers.
opencompass/models/openai_streaming.py Refactors streaming client creation to use a persistent OpenAI client and acquire/release gating.
opencompass/models/openai_api.py Adds key-rotation lock and switches rate limiting to acquire/release for concurrency control.
opencompass/cli/main.py Runs infer and eval-watch concurrently with a heartbeat thread when configured.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +28
def safe_write(file: Path, content: str, work_dir: Path):
sig = '--'.join(file.resolve().relative_to(work_dir.resolve()).parts)
with SoftFileLock(work_dir / '.locks' / sig):
file.write_text(content)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe_write also uses a lock file in work_dir / '.locks' without ensuring the parent directory exists. This can cause status writes to fail (and silently stall eval-watch logic). Create work_dir/.locks before taking the lock.

Copilot uses AI. Check for mistakes.
Comment on lines +268 to +273
# Load next dataset with 5% overhead.
max_pending_samples = max(1, int(max_workers * 1.05))
remaining_total = self._remaining_total(running, max_pending_samples)
while pending and remaining_total < max_pending_samples:
dataset_cfg = pending.pop(0)
task_name = task_abbr_from_cfg({
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset scheduling logic uses progress.remaining() (total samples left) to decide whether to start another dataset. Since remaining is typically the full dataset size (often >> max_pending_samples), this effectively prevents multiple datasets from running concurrently until the current dataset is nearly finished, defeating the purpose of “concurrent inference across tasks”. Consider tracking expected in-flight capacity per dataset (e.g., min(max_workers, remaining)), or simply bounding concurrent datasets explicitly (e.g., max_parallel_ds = ...) and removing the remaining_total gating.

Copilot uses AI. Check for mistakes.
Comment on lines +239 to +243
abbr_counts = {}
for dataset_cfg in pending:
abbr = dataset_cfg.get('abbr', 'task')
abbr_counts[abbr] = abbr_counts.get(abbr, 0) + 1

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abbr_counts is computed but never used. This adds noise and suggests incomplete logic; remove it or use it to disambiguate task names when there are duplicate dataset abbreviations.

Copilot uses AI. Check for mistakes.
Comment on lines 257 to 262
if self.orgs:
with Lock():
self.org_ctr += 1
if self.org_ctr == len(self.orgs):
self.org_ctr = 0
header['OpenAI-Organization'] = self.orgs[self.org_ctr]
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org_ctr rotation is guarded by with Lock():, but Lock() creates a new lock instance each call, so this provides no mutual exclusion across threads. Under concurrent inference this can corrupt org_ctr updates. Define a dedicated instance lock (e.g., self._org_lock = Lock() in __init__) and use it here.

Copilot uses AI. Check for mistakes.
Comment on lines +13 to +15
from opencompass.models.openai_api import OpenAISDK
from opencompass.registry import ICL_INFERENCERS
from opencompass.utils import batched
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAISDK is imported but not used in this module. Removing the unused import avoids unnecessary dependencies and keeps linting clean.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +22
def safe_read(file: Path, work_dir: Path):
sig = '--'.join(file.resolve().relative_to(work_dir.resolve()).parts)
with SoftFileLock(work_dir / '.locks' / sig):
content = file.read_text()
return content
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safe_read acquires a lock under work_dir / '.locks', but this directory is never created. If .locks doesn't exist, SoftFileLock(...) will fail and status reads will break. Ensure work_dir/.locks is created (e.g., (...).mkdir(parents=True, exist_ok=True)) before acquiring the lock, or create it lazily inside safe_read.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +16
def __init__(self, work_dir: str | Path, fname: str = 'infer_heartbeat'):
self.hb_file = Path(work_dir) / fname
self._lock = SoftFileLock(Path(work_dir) / '.locks' / fname)

def start_heartbeat(self, write_interval: float = 5.):
Path(self.hb_file).parent.mkdir(exist_ok=True)
stop_event = threading.Event()
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HeartBeatManager creates a lock at work_dir/.locks/<fname> but start_heartbeat() only creates the heartbeat file's parent dir (and not .locks, nor with parents=True). If .locks doesn't exist, lock acquisition will fail and the heartbeat will never be written, causing OpenICLEvalWatchTask to time out. Create work_dir/.locks (parents=True) before using SoftFileLock.

Copilot uses AI. Check for mistakes.
Comment on lines 103 to 107
self.openai_client = OpenAI(
base_url=self.openai_api_base,
api_key=current_key,
http_client=httpx.Client(**http_client_cfg,
timeout=httpx.Timeout(self.timeout))
if http_client_cfg or True else None,
api_key=self.key,
http_client=http_client,
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAISDKStreaming.__init__ builds an OpenAI client with api_key=self.key, but OpenAI/OpenAISDK store keys in self.keys and do not define self.key. This will raise AttributeError at runtime. Use the key argument (string), or select a key from self.keys (optionally with rotation) when constructing self.openai_client.

Copilot uses AI. Check for mistakes.
Comment on lines +30 to +36
def __init__(
self,
model,
max_out_len: int,
stopping_criteria: List[str] = [],
max_seq_len: Optional[int] = None,
min_out_len: Optional[int] = None,
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopping_criteria defaults to a mutable list ([]). If the list is ever mutated, instances can unintentionally share state. Use None as the default and replace with an empty list inside __init__.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@Myhs-phz Myhs-phz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mzr1996 mzr1996 merged commit 3cdd4c2 into main Mar 20, 2026
11 checks passed
@mzr1996 mzr1996 deleted the streaming_infer branch March 20, 2026 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants