-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cu 86c0p7y10 implement levenshtein distance comparison #336
base: dev
Are you sure you want to change the base?
Cu 86c0p7y10 implement levenshtein distance comparison #336
Conversation
WalkthroughThe changes in Changes
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (2)
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java (2)
356-358
: Avoid magic numbers in similarity functionsIn
LevenshteinSimilarity
, returning0.5
for empty inputs is a magic number that may confuse readers.Define a constant to represent the default similarity score for empty inputs:
private static final double DEFAULT_EMPTY_INPUT_SCORE = 0.0;Then use it in your method:
if (StringUtils.isEmpty(left) || StringUtils.isEmpty(right)) { - return 0.5; + return DEFAULT_EMPTY_INPUT_SCORE; }
80-88
: Alphabetize theSimilarityFunctionName
enum constantsFor better readability and maintainability, consider alphabetizing the constants in the
SimilarityFunctionName
enum.public enum SimilarityFunctionName { - JARO_WINKLER_SIMILARITY, - JARO_SIMILARITY, - JACCARD_SIMILARITY, - SOUNDEX_SIMILARITY, - EXACT_SIMILARITY, - LEVENSHTEIN_SIMILARITY, - LEVENSHTEIN_SIMILARITY_PERCENTAGE + EXACT_SIMILARITY, + JACCARD_SIMILARITY, + JARO_SIMILARITY, + JARO_WINKLER_SIMILARITY, + LEVENSHTEIN_SIMILARITY, + LEVENSHTEIN_SIMILARITY_PERCENTAGE, + SOUNDEX_SIMILARITY }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java (5 hunks)
🧰 Additional context used
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java
Show resolved
Hide resolved
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java
Outdated
Show resolved
Hide resolved
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java
Outdated
Show resolved
Hide resolved
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java
Outdated
Show resolved
Hide resolved
…ackend/LinkerProbabilistic.java Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java (5 hunks)
🧰 Additional context used
🔇 Additional comments (3)
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java (3)
4-4
: LGTM: New imports and similarity function declarationsThe new imports for Soundex and LevenshteinDistance, along with the static field declarations for SoundexSimilarity, LevenshteinSimilarity, and LevenshteinSimilarityPercentage, are correctly added to support the new similarity functions.
Also applies to: 8-8, 32-34
348-363
:⚠️ Potential issueAdjust
LevenshteinSimilarity
to return a normalized similarity scoreThe
LevenshteinSimilarity
class currently returns the raw Levenshtein distance, which represents dissimilarity rather than similarity. To maintain consistency with other similarity functions, it should return a normalized similarity score between 0.0 and 1.0.Consider modifying the
apply
method as follows:- return Double.valueOf(levenshteinDistance.apply(left, right)); + int distance = levenshteinDistance.apply(left, right); + int maxLength = Math.max(left.length(), right.length()); + return 1.0 - ((double) distance / maxLength);This change ensures that a higher score indicates greater similarity, consistent with other similarity functions.
Likely invalid or redundant comment.
80-105
: 🛠️ Refactor suggestionRefactor
getSimilarityFunction
using aMap
for better scalabilityThe
SimilarityFunctionName
enum andgetSimilarityFunction
method update improve type safety and readability. However, the switch statement can be replaced with a more efficient and scalable Map-based approach.Consider refactoring the method as follows:
+private static final Map<SimilarityFunctionName, SimilarityScore<Double>> SIMILARITY_FUNCTION_MAP = Map.of( + SimilarityFunctionName.JARO_WINKLER_SIMILARITY, JARO_WINKLER_SIMILARITY, + SimilarityFunctionName.JARO_SIMILARITY, JARO_SIMILARITY, + SimilarityFunctionName.JACCARD_SIMILARITY, JACCARD_SIMILARITY, + SimilarityFunctionName.SOUNDEX_SIMILARITY, SOUNDEX_SIMILARITY, + SimilarityFunctionName.LEVENSHTEIN_SIMILARITY, LEVENSHTEIN_SIMILARITY, + SimilarityFunctionName.LEVENSHTEIN_SIMILARITY_PERCENTAGE, LEVENSHTEIN_SIMILARITY_PERCENTAGE, + SimilarityFunctionName.EXACT_SIMILARITY, EXACT_SIMILARITY +); -static SimilarityScore<Double> getSimilarityFunction(final SimilarityFunctionName func) { - switch (func) { - case JARO_WINKLER_SIMILARITY: - return JARO_WINKLER_SIMILARITY; - // Other cases... - default: - return EXACT_SIMILARITY; - } +static SimilarityScore<Double> getSimilarityFunction(final SimilarityFunctionName func) { + return SIMILARITY_FUNCTION_MAP.getOrDefault(func, EXACT_SIMILARITY); }This approach is more maintainable and performs better, especially as more similarity functions are added.
Likely invalid or redundant comment.
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java
Show resolved
Hide resolved
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java
Outdated
Show resolved
Hide resolved
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java
Outdated
Show resolved
Hide resolved
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java
Outdated
Show resolved
Hide resolved
…ackend/LinkerProbabilistic.java Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…ackend/LinkerProbabilistic.java Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java (2)
33-34
: Approve changes with suggestion for error handlingThe new static fields and changes to existing fields improve type safety by using the SimilarityFunctionName enum. However, there's a potential for IllegalArgumentException if the similarity score string doesn't match any enum constant.
Consider adding error handling:
-.map(f -> new ProbabilisticField(getSimilarityFunction(SimilarityFunctionName.valueOf(f.similarityScore())), f.comparisonLevels(), f.m(), f.u())) +.map(f -> { + SimilarityFunctionName functionName; + try { + functionName = SimilarityFunctionName.valueOf(f.similarityScore()); + } catch (IllegalArgumentException e) { + LOGGER.error("Invalid similarity function name: {}", f.similarityScore(), e); + functionName = SimilarityFunctionName.EXACT_SIMILARITY; // Fallback to a default + } + return new ProbabilisticField(getSimilarityFunction(functionName), f.comparisonLevels(), f.m(), f.u()); +})This change would provide more robust error handling and logging.
Also applies to: 44-44, 48-48, 52-52, 73-73
298-315
: Consider adjusting empty string handling in SoundexSimilarityThe SoundexSimilarity class is implemented correctly. However, returning 0.0 for empty or null strings might not be appropriate in all cases. Consider standardizing this approach across all similarity functions:
if (StringUtils.isEmpty(left) || StringUtils.isEmpty(right)) { - return 0.0; + return 0.5; }This change aligns with the ExactSimilarity implementation and provides a neutral similarity score when inputs are invalid.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java (6 hunks)
🧰 Additional context used
🔇 Additional comments (2)
JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/LinkerProbabilistic.java (2)
4-8
: LGTM: New imports added correctlyThe new imports for Soundex, LevenshteinDistance, and Map are correctly added and necessary for the new similarity functions being introduced.
Also applies to: 20-20
80-99
: LGTM: Improved structure with enum and mapThe introduction of the SimilarityFunctionName enum and SIMILARITY_FUNCTION_MAP, along with the updated getSimilarityFunction method, greatly improves the code structure, type safety, and maintainability. This change makes it easier to add new similarity functions in the future and reduces the likelihood of errors.
Summary by CodeRabbit