Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable confidence calculations with unit tests #234

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

NeonDaniel
Copy link
Member

@NeonDaniel NeonDaniel commented Sep 17, 2024

Refactors __calc_confidence to calc_confidence for skills to optionally override calculation
Add scalar confidence values as instance variables for skill customization
Refactor confidence calculation to simplify logic, document process, and add logging
Add unit tests for confidence calculations

Summary by CodeRabbit

  • New Features

    • Enhanced confidence calculation mechanism for improved accuracy in query responses.
    • Introduced configurable parameters for confidence scoring.
  • Bug Fixes

    • Corrected the comment for the GENERAL match level in the confidence calculation.
  • Tests

    • Added comprehensive tests for the new confidence calculation logic, covering various query types.
    • Improved tests for noise removal functionality to ensure accurate input normalization.

…ation

Refactor confidence calculation to simplify logic, document process, and add logging
Outline unit tests for confidence calculations
Copy link
Contributor

coderabbitai bot commented Sep 17, 2024

Walkthrough

The changes enhance the confidence calculation mechanism in the CommonQuerySkill class by refactoring the calculation logic and introducing configurable parameters for improved accuracy. The method for calculating confidence has been renamed, and new test methods have been added to evaluate the functionality of noise removal and confidence levels across various query types, ensuring the logic behaves as expected.

Changes

Files Change Summary
ovos_workshop/skills/common_query_skill.py Refactored confidence calculation logic, renamed method from __calc_confidence to calc_confidence, and adjusted relevance calculation to improve accuracy.
test/unittests/skills/test_common_query_skill.py Added test_get_cq, test_remove_noise, and test_calc_confidence methods to test noise removal and the new confidence calculation logic with various query types, enhancing test coverage.

Poem

In the burrow where queries hop,
Confidence blooms, it won't stop.
With multipliers set just right,
Answers shine, oh what a sight!
A skill refined, like a soft, warm fur,
Hopping high, we now confer! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai generate interesting stats about this repository and render them as a table.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

codecov bot commented Sep 17, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 9 lines in your changes missing coverage. Please review.

Project coverage is 49.85%. Comparing base (7c02206) to head (588c727).
Report is 14 commits behind head on dev.

Files with missing lines Patch % Lines
ovos_workshop/skills/common_query_skill.py 66.66% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #234      +/-   ##
==========================================
- Coverage   53.41%   49.85%   -3.57%     
==========================================
  Files          37       37              
  Lines        4362     4379      +17     
==========================================
- Hits         2330     2183     -147     
- Misses       2032     2196     +164     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@NeonDaniel NeonDaniel marked this pull request as ready for review September 17, 2024 17:17
@NeonDaniel NeonDaniel requested a review from a team September 17, 2024 17:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
ovos_workshop/skills/common_query_skill.py (1)

73-75: Address the TODO comment.

Please review the TODO comment and consider adjusting the default values for response_sentences_multiplier and response_words_multiplier. The current values may inflate the confidence score unnecessarily.

Do you want me to suggest alternative default values or open a GitHub issue to track this task?

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between e4b1677 and fab9c42.

Files selected for processing (2)
  • ovos_workshop/skills/common_query_skill.py (5 hunks)
  • test/unittests/skills/test_common_query_skill.py (1 hunks)
Additional comments not posted (6)
test/unittests/skills/test_common_query_skill.py (1)

47-63: Excellent test coverage for the confidence calculation logic!

The test_calc_confidence method effectively tests the calc_confidence function by evaluating the confidence levels for various query types. It covers different scenarios, including generic queries, specific queries, and low-confidence queries.

The assertions are well-defined and ensure that the confidence levels are calculated correctly based on the query type and match level. This test method enhances the overall test coverage and helps maintain the correctness and reliability of the confidence calculation mechanism.

Great job on implementing this comprehensive test method!

ovos_workshop/skills/common_query_skill.py (5)

29-29: LGTM!

The comment correction for the GENERAL level provides better clarity on its purpose.


65-70: Configurable parameters enhance flexibility.

The addition of instance variables for configurable confidence calculation parameters is a great improvement. It allows for fine-tuning the confidence scores based on the specific requirements of each skill.


153-156: Improved confidence calculation and error handling.

The following changes enhance the confidence calculation process:

  • Renaming the method to calc_confidence allows for potential overrides by subclasses, promoting extensibility.
  • The added check and capping of the confidence value at 1.0 ensure the confidence remains within a valid range.
  • Logging a warning when the calculated confidence exceeds 1.0 aids in identifying and debugging potential issues.

Line range hint 201-261: Enhanced confidence calculation with configurable parameters.

The refactoring of the calc_confidence method significantly improves the confidence calculation process:

  • The method now considers multiple factors, such as the percentage of input words consumed, the number of sentences in the response, the relevance of the answer to the input query, and the word count of the response. This provides a more comprehensive assessment of the response's confidence.
  • The use of configurable weights for each factor allows for fine-tuning the impact of each factor on the final confidence score, enabling customization based on the specific requirements of each skill.
  • The modified relevance calculation, which focuses on the percentage of relevant words in the answer relative to the total word count, improves the accuracy of the confidence score by dynamically adjusting based on the actual content of the response.
  • The added logging statements provide valuable insights into the confidence calculation process, aiding in debugging and understanding how the final confidence score is derived.

258-258: LGTM!

The inclusion of the calculated bonuses in the final confidence score enhances the accuracy of the confidence assessment by considering multiple factors and their configured weights.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
test/unittests/skills/test_common_query_skill.py (1)

Line range hint 75-77: Reminder: Implement the missing test methods.

The TODO comments indicate that the test methods for handle_question_query and handle_query_action are yet to be implemented. Please ensure that these methods are tested to improve the overall test coverage of the CommonQuerySkill class.

Do you want me to generate the missing test methods or open a GitHub issue to track this task?

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4bdf55f and 7d30340.

Files selected for processing (1)
  • test/unittests/skills/test_common_query_skill.py (2 hunks)
Additional comments not posted (4)
test/unittests/skills/test_common_query_skill.py (4)

40-49: LGTM!

The test_get_cq method thoroughly tests the __get_cq method by mocking the CQS_match_query_phrase method and verifying its behavior. It also tests the exception case to ensure graceful handling.


52-54: LGTM!

The test_remove_noise method verifies that the remove_noise method correctly normalizes a noisy input string, which is a common use case.


57-73: LGTM!

The test_calc_confidence method thoroughly tests the calc_confidence method by evaluating the confidence levels for different query types. It verifies that the confidence levels are calculated correctly and follow the expected order. The test method is well-structured and uses meaningful variable names and comments to improve readability.


2-2: LGTM!

The import statement for the Mock class is necessary for using it in the test methods and follows the Python convention of placing imports at the beginning of the file.

Support returned float confidence level in addition to enum
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (4)
ovos_workshop/skills/common_query_skill.py (4)

Line range hint 37-75: Configurable parameters for confidence calculation look good!

The introduction of configurable parameters allows for greater flexibility in tuning the confidence calculation. The default values seem reasonable.

However, please address the TODO comment regarding the default values for response_sentences_multiplier and response_words_multiplier. Consider adjusting these values based on empirical testing to ensure optimal confidence calculation.


153-162: Handling of directly reported confidence looks good!

Allowing skills to directly report confidence as a float value provides flexibility and allows skills to override the default confidence calculation if needed. Capping the confidence at 1.0 ensures that the confidence value remains within a valid range.

Please remove the extraneous f prefix from the log message at line 154 as suggested by the static analysis tool Ruff.

-                LOG.debug(f"Confidence directly reported by skill")
+                LOG.debug("Confidence directly reported by skill")
Tools
Ruff

154-154: f-string without any placeholders

Remove extraneous f prefix

(F541)


222-228: Sentence count approximation and noise word removal look good!

Approximating the number of sentences based on periods is a reasonable approach, and ensuring a minimum sentence count of 1 is a good fallback. Removing noise words from the matched portion helps focus on the meaningful part of the user's input.

In the future, consider exploring more robust sentence boundary detection techniques to handle edge cases like abbreviations and decimal numbers more accurately.


231-251: Relevance calculation looks good!

Calculating relevance based on the percentage of relevant words in the answer is a reasonable approach to determine how well the answer addresses the user's input. Capping the answer size at a maximum value helps prevent overly long answers from receiving an unfair confidence boost.

Please address the TODO comment regarding removing SSML from the answer before calculating relevance. This will ensure that the relevance calculation is based on the actual text content of the answer.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 7d30340 and 588c727.

Files selected for processing (2)
  • ovos_workshop/skills/common_query_skill.py (7 hunks)
  • test/unittests/skills/test_common_query_skill.py (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • test/unittests/skills/test_common_query_skill.py
Additional context used
Ruff
ovos_workshop/skills/common_query_skill.py

154-154: f-string without any placeholders

Remove extraneous f prefix

(F541)

Additional comments not posted (5)
ovos_workshop/skills/common_query_skill.py (5)

16-16: LGTM!

The import statement is necessary to support the type hint changes in the __get_cq and CQS_match_query_phrase methods.


29-29: LGTM!

The updated comment provides a clearer description of the GENERAL level in the CQSMatchLevel enum.


176-177: LGTM!

The updated return type hint accurately reflects the changes made to allow skills to directly report confidence as a float value.


253-266: Bonus calculation and confidence capping look good!

Calculating bonuses based on multiple factors allows for a more nuanced confidence calculation that takes into account various aspects of the answer. Applying configurable weights to each bonus factor provides flexibility in tuning the confidence calculation.

Logging a warning and capping the confidence at 1.0 helps identify and handle cases where the calculated confidence exceeds the valid range.


306-306: LGTM!

The updated return type hint accurately reflects the changes made to allow skills to directly report confidence as a float value.

# Approximate the number of sentences in the answer. A trailing `.` will
# split, so reduce length by 1. If no `.` is present, ensure we count
# any response as at least 1 sentence
num_sentences = min(len(answer.split(".")) - 1, 1)
Copy link
Member

@JarbasAl JarbasAl Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not very good, should use quebra_frases instead (already a dependency)

print(quebra_frases.sentence_tokenize(
    "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."))
#['Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.',
#'Did he mind?',
#"Adam Jones Jr. thinks he didn't.",
#"In any case, this isn't true...",
#"Well, with a probability of .9 it isn't."]

wc_mod = float(float(answer_size) / float(WORD_COUNT_DIVISOR)) * 2
# Calculate bonuses based on calculated values and configured weights
consumed_pct_bonus = consumed_pct * self.input_consumed_multiplier
num_sentences_bonus = num_sentences * self.response_sentences_multiplier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be part of the score at all? its a voice assistant, do we prefer a skill reading a full wikipedia page vs giving a straight answer?

@JarbasAl
Copy link
Member

IMHO we should drop this method and just map the enum to fixed float values, let ovos-core do the disambiguation

@NeonDaniel
Copy link
Member Author

NeonDaniel commented Sep 21, 2024

IMHO we should drop this method and just map the enum to fixed float values, let ovos-core do the disambiguation

I see this was added in #61. I'm not sure what the implementation looked like prior to that, but The original Mycroft logic (still used in Neon) still did some adjustment.

I think there needs to be some amount of logic to separate responses that return the same confidence enum for predictable behavior, otherwise responses will be random for questions where multiple skills return a category match

@JarbasAl
Copy link
Member

JarbasAl commented Sep 21, 2024

Not random, there's a reranker plugin that will do much better job than this, there is even a config option to allow ignoring skill confidence completely. Either BM25 or flashrank plugins are available and work well.

This logic came from the MK2 btw, it's also Mycroft just more recent than what neon forked

@JarbasAl
Copy link
Member

@NeonDaniel any updates on this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants