-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat:regex corrections #7
Conversation
Warning Rate limit exceeded@JarbasAl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 0 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (7)
WalkthroughThe pull request introduces significant enhancements to the Utterance Corrections plugin, focusing on improving speech-to-text (STT) output corrections. The changes include expanding the README.md with detailed documentation about new features like "Secret Speech", Shortcuts, and Word-Level Corrections. In the plugin's implementation, new attributes for regex corrections and matching strategies have been added to the Changes
Sequence DiagramsequenceDiagram
participant User
participant Plugin as UtteranceCorrectionsPlugin
participant Regex as Regex Corrections
participant WordCorrections as Word Corrections
User->>Plugin: Provide utterance
Plugin->>Regex: Apply regex replacements
Regex-->>Plugin: Transformed utterance
Plugin->>WordCorrections: Apply word-level corrections
WordCorrections-->>Plugin: Final corrected utterance
Plugin->>User: Return corrected utterance
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
ovos_utterance_corrections_transformer/__init__.py (1)
38-38
: Use the logging module instead ofUsing
logging
module to provide better control over logging levels and outputs.Apply this diff to replace the
+import logging ... except re.error as e: - print(f"Invalid regex pattern: {pattern} -> {e}") + logging.error(f"Invalid regex pattern: {pattern} -> {e}")README.md (5)
5-5
: Adjust heading level to maintain proper structureThe heading "Key Features:" is a level 3 heading (
###
), but it should be a level 2 heading (##
) to follow the sequential order after the main title.Apply this diff to correct the heading level:
-### Key Features: +## Key Features:🧰 Tools
🪛 Markdownlint (0.37.0)
5-5: Expected: h2; Actual: h3
Heading levels should only increment by one level at a time(MD001, heading-increment)
5-5: Punctuation: ':'
Trailing punctuation in heading(MD026, no-trailing-punctuation)
75-75
: Fix loose punctuation in the explanationThere is a punctuation issue in the bullet point explanation.
Apply this diff to correct the punctuation:
- - **`sch\\1`**: Replaces `sh` with `sch` and appends the rest of the word. + - **`sch\\1`**: Replaces `sh` with `sch` and appends the rest of the word.[Note: The diff shows no change, which might indicate that the issue is minor or a formatting error.]
🧰 Tools
🪛 LanguageTool
[uncategorized] ~75-~75: Loose punctuation mark.
Context: ...sh
at a word boundary. -sch\\1
: Replacessh
withsch
and appends th...(UNLIKELY_OPENING_PUNCTUATION)
56-56
: Correct grammatical errors in the use case descriptionThere are grammatical mistakes in the sentence. Use "makes this mistake" instead of "does this mistake," and "its" (possessive) instead of "it's" (contraction of "it is").
Apply this diff to correct the sentence:
-> **use case**: whisper STT often does this mistake in it's transcriptions +> **use case**: whisper STT often makes this mistake in its transcriptions🧰 Tools
🪛 LanguageTool
[grammar] ~56-~56: Collocation: With the noun “mistake”, you should use the verb “make”.
Context: ...x"` > use case: whisper STT often does this mistake in it's transcriptions --...(MAKE_A_CHANGE)
[grammar] ~56-~56: Did you mean the possessive pronoun “its”?
Context: ... whisper STT often does this mistake in it's transcriptions --- ## 3. Regex-Based ...(IT_S_ITS)
84-84
: Correct grammatical errors in the use case descriptionSimilar to the previous comment, correct "does this mistake" to "makes this mistake," and "it's" to "its."
Apply this diff:
-> **use case**: citrinet german model often does this mistake in it's transcriptions +> **use case**: citrinet german model often makes this mistake in its transcriptions🧰 Tools
🪛 LanguageTool
[grammar] ~84-~84: Collocation: With the noun “mistake”, you should use the verb “make”.
Context: ...use case**: citrinet german model often does this mistake in it's transcriptions --...(MAKE_A_CHANGE)
[grammar] ~84-~84: Did you mean the possessive pronoun “its”?
Context: ...german model often does this mistake in it's transcriptions --- ## Configuration P...(IT_S_ITS)
106-106
: Consider rephrasing for a more professional toneThe phrase "feel free to" is often overused. Consider rephrasing to make the writing more professional and engaging.
Apply this diff to improve the wording:
-Let us know how you're using this plugin, and feel free to contribute regex examples to this README or new use cases! 🚀 +We welcome your experiences with this plugin and encourage you to contribute regex examples or new use cases to this README! 🚀🧰 Tools
🪛 LanguageTool
[style] ~106-~106: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... know how you're using this plugin, and feel free to contribute regex examples to this READM...(FEEL_FREE_TO_STYLE_ME)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
README.md
(1 hunks)ovos_utterance_corrections_transformer/__init__.py
(2 hunks)
🧰 Additional context used
🪛 LanguageTool
README.md
[grammar] ~56-~56: Collocation: With the noun “mistake”, you should use the verb “make”.
Context: ...x"` > use case: whisper STT often does this mistake in it's transcriptions --...
(MAKE_A_CHANGE)
[grammar] ~56-~56: Did you mean the possessive pronoun “its”?
Context: ... whisper STT often does this mistake in it's transcriptions --- ## 3. Regex-Based ...
(IT_S_ITS)
[uncategorized] ~75-~75: Loose punctuation mark.
Context: ... sh
at a word boundary. - sch\\1
: Replaces sh
with sch
and appends th...
(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~84-~84: Collocation: With the noun “mistake”, you should use the verb “make”.
Context: ...use case**: citrinet german model often does this mistake in it's transcriptions --...
(MAKE_A_CHANGE)
[grammar] ~84-~84: Did you mean the possessive pronoun “its”?
Context: ...german model often does this mistake in it's transcriptions --- ## Configuration P...
(IT_S_ITS)
[style] ~105-~105: This expression usually appears with a “please” in front of it.
Context: ...ommands with shorter phrases or slang. Let us know how you're using this plugin, and feel ...
(INSERT_PLEASE)
[style] ~106-~106: The phrase ‘feel free to’ is used quite frequently. Consider using a less frequent alternative to set your writing apart from others and make it sound more professional.
Context: ... know how you're using this plugin, and feel free to contribute regex examples to this READM...
(FEEL_FREE_TO_STYLE_ME)
🪛 Markdownlint (0.37.0)
README.md
5-5: Expected: h2; Actual: h3
Heading levels should only increment by one level at a time
(MD001, heading-increment)
5-5: Punctuation: ':'
Trailing punctuation in heading
(MD026, no-trailing-punctuation)
73-73: Punctuation: ':'
Trailing punctuation in heading
(MD026, no-trailing-punctuation)
77-77: Punctuation: ':'
Trailing punctuation in heading
(MD026, no-trailing-punctuation)
🪛 Ruff (0.8.2)
ovos_utterance_corrections_transformer/__init__.py
36-36: Undefined name re
(F821)
37-37: Undefined name re
(F821)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
ovos_utterance_corrections_transformer/__init__.py (3)
17-19
: Consider making attributes configurableThe attributes
regex_db
,confidence_threshold
, andmatch_strategy
are currently hardcoded in the__init__
method. Making these attributes configurable through constructor parameters or a configuration file would enhance the flexibility and reusability of the plugin.
26-29
: Process all utterances for full replacementsCurrently, only the first utterance is considered for full utterance replacement. If multiple utterances are provided, consider processing each one to ensure consistency in transformations.
Apply this change to process all utterances:
- if utterances and self.db: - replacement, conf = match_one( - utterances[0], self.db, strategy=self.match_strategy - ) - if conf >= self.confidence_threshold: - return [replacement], context + if utterances and self.db: + for idx, utterance in enumerate(utterances): + replacement, conf = match_one( + utterance, self.db, strategy=self.match_strategy + ) + if conf >= self.confidence_threshold: + utterances[idx] = replacement
34-40
: Pre-compile regex patterns for efficiency and early validationPre-compiling regex patterns when loading them can improve performance and catch invalid patterns at initialization time, reducing runtime overhead and errors.
Apply this diff to pre-compile regex patterns during initialization:
+ # Pre-compile regex patterns + self.compiled_regex_patterns = {} + for pattern, replacement in self.regex_db.items(): + try: + compiled_pattern = re.compile(pattern) + self.compiled_regex_patterns[compiled_pattern] = replacement + except re.error as e: + logging.error(f"Invalid regex pattern during initialization: {pattern} -> {e}")And update the regex replacement loop:
- for idx in range(len(utterances)): - for pattern, replacement in self.regex_db.items(): + for idx, utterance in enumerate(utterances): + for pattern, replacement in self.compiled_regex_patterns.items(): try: - utterances[idx] = re.sub(pattern, replacement, utterances[idx]) + utterances[idx] = pattern.sub(replacement, utterance) except re.error as e: - print(f"Invalid regex pattern: {pattern} -> {e}") + logging.error(f"Error applying regex pattern: {pattern} -> {e}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
ovos_utterance_corrections_transformer/__init__.py (3)
26-30
: Consider documenting the confidence thresholdThe threshold value of 0.85 is significant for matching quality. Consider adding a comment explaining why this value was chosen or documenting it in the configuration schema.
- if conf >= self.config.get("thresh", 0.85): + # Minimum confidence threshold for utterance replacement (default: 0.85) + # Lower values may lead to false positives, higher values may miss valid corrections + if conf >= self.config.get("confidence_threshold", 0.85):
34-42
: Consider caching compiled regex patternsCompiling regex patterns in a loop can be inefficient. Consider caching the compiled patterns.
+ # Cache compiled patterns + _pattern_cache = {} + if utterances and self.regex_db: flags = re.IGNORECASE if self.config.get("ignore_case", True) else 0 for idx in range(len(utterances)): for pattern, replacement in self.regex_db.items(): LOG.debug(f"Applying regex pattern: {pattern}") try: + if pattern not in self._pattern_cache: + self._pattern_cache[pattern] = re.compile(pattern, flags=flags) + utterances[idx] = self._pattern_cache[pattern].sub(replacement, utterances[idx]) - utterances[idx] = re.sub(pattern, replacement, utterances[idx], flags=flags)
44-51
: Reduce duplication of flags calculationThe
flags
calculation is duplicated from the regex replacement section. Consider extracting it to a method or calculating once at the start of transform.+ def _get_regex_flags(self): + return re.IGNORECASE if self.config.get("ignore_case", True) else 0 + def transform(self, utterances: List[str], context: Optional[dict] = None) -> (list, dict): context = context or {} + flags = self._get_regex_flags() # Step 1: Replace full utterance ... # Step 2: Apply regex replacements if utterances and self.regex_db: - flags = re.IGNORECASE if self.config.get("ignore_case", True) else 0 for idx in range(len(utterances)): ... # Step 3: Replace individual words if utterances and self.words_db: - flags = re.IGNORECASE if self.config.get("ignore_case", True) else 0 for idx in range(len(utterances)):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
ovos_utterance_corrections_transformer/__init__.py
(2 hunks)
🔇 Additional comments (1)
ovos_utterance_corrections_transformer/__init__.py (1)
18-18
: Verify regex_corrections.json path and existence
Ensure the regex_corrections.json file is properly documented and created during installation.
✅ Verification successful
File path and usage is properly documented
The regex_corrections.json
file is well documented in the README.md with:
- Clear path location (
~/.local/share/mycroft/
) - Detailed explanation of its purpose for regex-based corrections
- Example configuration with explanation
- Usage scenarios and examples
The file is created on-demand by JsonStorage
when needed, which is a standard pattern for user configuration files. No explicit creation during installation is required as it's a user-configurable file.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if the path is documented in README or installation docs
rg -l "regex_corrections.json" --type md
# Check if there's any code that creates this file during setup
rg -l "regex_corrections.json.*create|write|initialize" --type python
Length of output: 153
Script:
#!/bin/bash
# Check documentation in README
cat README.md
# Search for any Python files mentioning regex_corrections.json
rg -l "regex_corrections.json" -g "*.py"
# Search for initialization or setup code
rg "regex_corrections.json" -g "*.py" -C 2
Length of output: 4002
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
Summary by CodeRabbit
New Features
Bug Fixes