Open
Conversation
Configure Renovate
…, normalisation (#8) * feat: replace setup.py with pyproject.toml * fix: remove_intent dict key, no-match shape, duplicate guard, E741 rename * docs: type hints and docstrings * test: comprehensive test suite * docs: rewrite README * perf/fix: cache regexes, word-count penalty, fix plural hack, tie-breaking - Pre-compile all regexes at add_intent() time; removed per-query re.compile() - lru_cache on word_tokenize calls to avoid repeated tokenization - Replace character-length remainder penalty with word-count fraction - Fix plural candidate detection to use word-boundary regex instead of substring check (prevents "status" being dropped due to "statuses") - Fix regex slot confidence: divide by n_required not len(matches) - Add deterministic tie-breaking: lower remainder word count wins, then alphabetical intent name - Update test expected values to match new (more accurate) confidence scores Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: context gating, keyword exclusion, normalisation, opm.py, intent_names - bracket_expansion.py: add drop_apostrophes, normalize_whitespace, normalize_utterance, normalize_example — training samples and queries are now normalised identically at registration/match time - __init__.py: apply normalize_example to training data at add_intent(), apply normalize_utterance to query in calc_intents(); add full context gating API (set/unset/require/unrequire/exclude/unexclude_context); add exclude_keywords() with word-boundary safety; add intent_names property - opm.py: new OVOS ConfidenceMatcherPipeline plugin with lru_cache(128), session blacklist support, match_high/medium/low, bus listeners for padatious:register_intent, detach_intent, detach_skill, mycroft.skills.train - pyproject.toml: add ovos optional-dependencies group, pipeline entry point - test: 19 new tests covering normalisation, context gating, keyword exclusion, intent_names Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: replace plural hack with lemmatize() helper Add lemmatize(word) to bracket_expansion.py: strips apostrophes entirely and removes trailing 's' (not 'ss') for language-agnostic plural matching. Apply in _match() (replaces the old regex-based plural/singular hack) and in get_utterance_remainder() (lemmatized token comparison so plural forms of matched keywords are consumed from the remainder). "lights" now matches training keyword "light", "what s" tokens (from apostrophe normalisation) match "whats" via shared stem "what". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: apostrophes → space in lemmatize, not empty string Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: accuracy engine, benchmark, normalisation, opm rewrite, CI workflows - Three-pass keyword matching (contiguous → lemma-normalised → non-contiguous) - Non-contiguous match quality 0.8 so direct hits always win - Require all required slots to fire; eliminates partial-required FPs - _score() helper: remainder penalty, coverage bonus, slot bonus; 4dp rounding - lemma_query computed once per calc_intents call; fused required+optional loop - lemmatize() exported; apostrophes → space before lemmatization - normalize_utterance/normalize_example applied at registration and match time - opm.py rewritten for Adapt bus events (register_vocab/register_intent) - benchmark/ package: 284-case dataset, accuracy.py, compare.py (vs Adapt) - README updated with benchmark table (TN/NM column, honest FP commentary) - Standard CI workflows added Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: update workflows to standard — explicit secrets, lint, Python 3.13/3.14 - add lint.yml - build-tests.yml: add Python 3.13/3.14, drop secrets: inherit - release_workflow.yml: explicit PYPI_TOKEN/MATRIX_TOKEN, add permissions - publish_stable.yml: push trigger, explicit secrets, publish_release/sync_dev - coverage.yml: add test_path/install_extras/min_coverage, drop secrets: inherit - license_check.yml, pip_audit.yml: drop secrets: inherit Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Update README.md * Delete .github/workflows/python-support.yml * fix: address PR #8 CodeRabbit feedback and align CI workflows with nebulento - palavreado/__init__.py: read IntentCreator.name directly in remove_intent instead of calling .build() (avoids wasteful allocation) - palavreado/builder.py: add inline Note to all four regex slot methods explaining the intentional empty-bucket design for partial_conf weighting - palavreado/bracket_expansion.py: update expand_parentheses docstring to reflect actual str->List[str] signature (was stale list<str>->list<list<str>>) - pyproject.toml: switch to SPDX license string, add license-files entry, and add explicit Python 3.9-3.13 classifiers to match requires-python - README.md: add Breaking changes section documenting RuntimeError on duplicate add_intent and the remove_intent-first pattern - test/test_palavreado.py: add test_remove_intent_via_creator to lock the IntentCreator overload contract - .github/workflows: add missing opm-check.yml; remove spurious `secrets: inherit` from release-preview and repo-health (matches nebulento pattern); align changelog_max_issues to 50 in release_workflow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: remove deprecated license classifier and clean up builder docstrings Drop the old-style `License :: OSI Approved :: Apache Software License` classifier from pyproject.toml — newer setuptools (PEP 639) rejects it when `license` and `license-files` fields are already present, causing all CI jobs (build, coverage, opm_check, license_check, pip_audit) to fail at the build step. Remove the "Note:" sections from the four regex slot methods in builder.py (require_regex, optional_regex, require_autoregex, optional_autoregex) that described internal "empty-bucket design" details; the docstrings now only describe what each method does and its args/returns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: count misclassifications as both FN and FP in accuracy benchmark Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: correct five verified bugs from code review - __init__.py: lemma_map keys now per-token lemmatized (fixes phrase misses in Pass 2) - __init__.py: required slots check uses full .keys() not 'if s' guard - opm.py: remove lru_cache from _match_intent and _calc_palavreado_intent (stale on mutable state) - opm.py: _regexes keyed by lang+entity_type; wired into require_regex/optional_regex at intent registration; pruned in handle_detach_skill - compare.py: count misclassifications as FP for predicted intent (same fix as accuracy.py) - dataset.py: fix mislabeled cases — 'pause' expanded to 'pause the music'; 'put a timer on for lunch' and 'turn off the lights and set a timer' relabeled to set_timer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: update benchmark results after bug fixes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix+test: regex named-group slots, duplicate import, 16 new tests Fixes: - bracket_expansion.py: remove duplicate 'import re' - __init__.py: regex slots with named groups now mark the slot name in matches so the required-check passes and conf credit fires; previously intents using require_regex with named-group patterns always returned None New tests (16): - TestRegexSlots: named groups fire + populate, slot name in keywords, missing regex = no match, combined regex + keyword slot - TestOptionalOnlyIntent: optional-only intent never fires - TestKeywordExclusionMultiword: blocks on phrase, passes without phrase, no partial-word false match - TestTiebreaking: alphabetical tiebreaker, higher-confidence multi-slot wins - TestScore: perfect score, remainder penalty, zero-word guard, clamping to [0,1], 4dp rounding Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * perf: eliminate redundant work in hot path - Pre-compile excluded keyword regexes at exclude_keywords() call instead of compiling them fresh on every query in _filter - _filter now iterates pre-compiled (kw_lower, rx|None) pairs with early break per intent — no closure allocation, no dynamic re.search pattern build - Tokenize and lemmatize the query once in calc_intents; reuse the list for both the set (query_lemmas) and the string (lemma_query) - Cache per-candidate lemma strings inside _match during the initial classification pass; Pass 2 lemma_map reuses that cache instead of re-lemmatizing every token a second time - Pre-sort regex patterns by length at add_intent() time; matching loop iterates self._sorted_regex[name][slot] directly with no per-query sort - Pre-compute matched-word count (_mw) and remainder word count (_rw) in the yield inside calc_intents; calc_intent reads them directly instead of recomputing via _matched_words() for each comparison Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Human review requested!