解决 #138 的错误html识别报错#141
Conversation
Agent-Logs-Url: https://github.com/Mr-xn/MTranServer/sessions/f51bebab-7d96-4c04-9ff4-9cccaa57cc40 Co-authored-by: Mr-xn <18260135+Mr-xn@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Mr-xn/MTranServer/sessions/f51bebab-7d96-4c04-9ff4-9cccaa57cc40 Co-authored-by: Mr-xn <18260135+Mr-xn@users.noreply.github.com>
…ability Harden HTML translation against malformed advisory-style input
Agent-Logs-Url: https://github.com/Mr-xn/MTranServer/sessions/a753b62b-24bf-4ff5-9abe-249ffbff8fed Co-authored-by: Mr-xn <18260135+Mr-xn@users.noreply.github.com>
Add manual target selection for Docker image publishing
There was a problem hiding this comment.
Pull request overview
This PR aims to address issue #138 where “HTML mode” translation can fail on inputs that contain HTML-like characters (e.g., <=, &something[0]) but are not actually valid HTML, causing WASM/HTML parsing aborts.
Changes:
- Add HTML-likeness detection + stronger HTML sanitization in
TranslationEngine, and avoid forcing HTML mode for plain-text inputs. - Add a service-level retry path that falls back to plain text when an “HTML parse error” is detected, and refactor engine teardown into
destroyEngine. - Add Bun tests covering the new HTML/plain-text behavior; update Docker publishing workflow to support selectable publish targets on manual runs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/core/engine.ts |
Detects whether text “looks like HTML”, sanitizes stray &/<, adjusts placeholder tagging behavior, and wraps certain HTML-mode abort errors as “HTML parse error”. |
src/services/engine.ts |
Adds destroyEngine() helper and retries HTML parse failures as plain text. |
tests/engine-html.test.ts |
Adds regression tests for plain-text fallback and HTML sanitization behavior. |
.github/workflows/docker.yml |
Adds workflow_dispatch input to choose publish target(s) and conditionally logs into registries / sets images. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| function destroyEngine(fromLang: string, toLang: string, reason: string) { | ||
| const key = `${fromLang}-${toLang}`; | ||
| const info = engines.get(key); | ||
| if (!info) { | ||
| return; | ||
| } | ||
|
|
||
| logger.warn(`${reason}: ${key}`); | ||
| info.engine.destroy(); | ||
| engines.delete(key); | ||
| } |
There was a problem hiding this comment.
destroyEngine deletes the engine entry but doesn’t clear the idle timer stored on EngineInfo. That timeout will still fire later, keeping the engine object alive longer than necessary and potentially logging a second “idle timeout” shutdown after a crash/parse failure. Consider clearTimeout(info.timer) (and unset it) before destroying/deleting the engine.
| @@ -174,16 +187,15 @@ async function translateSingleLanguageText( | |||
| error.message.includes('abort') | |||
| ); | |||
|
|
|||
| if (isHTMLParseError) { | |||
| logger.warn(`HTML parsing failed during translation (${fromLang}->${toLang}), retrying as plain text`); | |||
| destroyEngine(fromLang, toLang, 'Destroying engine after HTML parse failure'); | |||
| return translateSingleLanguageText(fromLang, toLang, text, false); | |||
| } | |||
There was a problem hiding this comment.
isHTMLParseError is checked before isMemoryError, and core engine wraps any HTML-mode abort/aborted as HTML parse error. This means a genuine WASM abort/memory failure in HTML mode will be treated as a parse failure and immediately fall back to plain-text translation (skipping the retry loop). Consider tightening the parse-error signal (e.g., error code/name) or checking memory errors first when the message contains both.
| } catch (error: any) { | ||
| console.error(`WASM Error Context: TextLength=${cleanedText.length}, Options=${JSON.stringify(options)}`); | ||
| if (options.html && error?.message && /aborted|abort/i.test(error.message)) { | ||
| const wrappedError = new Error(`HTML parse error: ${error.message}`); | ||
| (wrappedError as Error & { cause?: unknown }).cause = error; | ||
| throw wrappedError; | ||
| } |
There was a problem hiding this comment.
The HTML error wrapping triggers for any abort/aborted when options.html is true. Since abort is also used for non-HTML failures (e.g., OOM/unreachable), this can mislabel unrelated failures as “HTML parse error” and change recovery behavior upstream. Consider narrowing this condition to a more specific HTML-parse signature from Bergamot, or attach a structured flag (name/code) instead of relying on the substring “abort”.
| private _tagPlaceholders( | ||
| text: string, | ||
| htmlEnabled: boolean = false | ||
| ): { taggedText: string; replacements: string[]; forceHtml: boolean } { | ||
| const placeholderRegex = /(\{\d+\}|\[\d+\])/g; | ||
| if (!placeholderRegex.test(text)) { | ||
| return { taggedText: text, replacements: [], forceHtml: false }; | ||
| } | ||
|
|
||
| const replacements: string[] = []; | ||
| const taggedText = text.replace(placeholderRegex, (match) => { | ||
| const index = replacements.length; | ||
| replacements.push(match); | ||
| return `<mt${index} />`; | ||
| if (htmlEnabled) { | ||
| return `<mt${index} />`; | ||
| } | ||
| return `MT_PLACEHOLDER_${index}_TOKEN`; | ||
| }); | ||
|
|
||
| return { taggedText, replacements, forceHtml: !htmlEnabled }; | ||
| return { taggedText, replacements, forceHtml: false }; | ||
| } |
There was a problem hiding this comment.
_tagPlaceholders still returns forceHtml but always sets it to false, so forceHtml can no longer influence effectiveOptions and is now dead state. Consider removing forceHtml from the return type and the translate() flow (or reintroducing the behavior intentionally) to reduce confusion.
| workflow_dispatch: | ||
| inputs: | ||
| publish_target: | ||
| description: Publish target | ||
| required: true | ||
| type: choice | ||
| default: both | ||
| options: | ||
| - docker | ||
| - github | ||
| - both |
There was a problem hiding this comment.
This workflow adds a new workflow_dispatch input and conditional publishing logic, which is unrelated to the PR’s stated purpose (fixing HTML parse/recognition errors). Consider splitting the workflow change into a separate PR (or update the PR description to explicitly include CI/publishing changes) to keep the fix reviewable and revertable independently.
我也遇到了同样的问题 #138 经过vibe coding 和自测 issues 里的翻译文本 目前正常。此pr 仅供参考