-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding POS tagging while building pattern for Spaczzruler #24
Comments
Hi @Ibrokhimsadikov, thanks for the kind words. I have gotten behind on spaczz maintenance and improvements lately but am hoping to get back on track in the near future. I believe implementing some form of POS constraints should be doable but I'm going to half to think about how I actually want to go about it. I will keep you updated here as that progresses. |
Hi @Ibrokhimsadikov, sorry there has not been much visible development on this issue yet. However, I did want to update you on where I am at with thinking/working through this. The ideal way to add this feature would be adding fuzzy matching support directly into spaCy's matcher, however because much of this is written in Cython, it is beyond my current coding capabilities. Accordingly, my original thought was to write a Python implementation very similar spaCy's matcher. However this quickly proved to be a massive undertaking that was mostly redundant. Therefore I think the way I am going to attempt to incorporate this with writing an abstraction that translate these "fuzzy" patterns to spaCy matcher compatible patterns. It would find the fuzzy matches then rewrite the patterns with the verbatim text found. For example: import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The manager gave me acess to the database so now I can acces it.")
pattern = [{"TEXT": {"FUZZY": "access"}, "POS": "NOUN"}]
# AbstractedMatcher.add(pattern)
# Under the hood would find fuzzy matches of "access" in the text and then use those to rewrite patterns
# that are compatible with spaCy's matcher.
[{"TEXT": "acess", "POS": "NOUN"}, {"TEXT": "acces", "POS": "NOUN"}]
# This would then only return the first mispelling of "access" - "acess" as it is the noun form. This will still take some time to develop but I feel better about this direction. In the meantime I will post a more obtuse, but still useful, work around you can use in the meantime that makes use of on-match callbacks with the FuzzyMatcher. I should get to that this evening. |
Dear @gandersen101, First of all, thank you so much for not forgetting about me. I am so much grateful for your effort as this is the only library that integrates fuzzy approach. With spaczz I was able to get more entities rather than only using spacy's matcher. As you know, one of the biggest issues in NER is building dictionary/knowledge base which usually comes with different variations of string, or synonyms, which is very time consuming manual effort for custom NER. Spaczz is doing good even though in the expense of memory consumption while running inside spacy pipeline. Also, AbstractedMatcher is it your custom pipeline similar to Spaczzruler. Thank you so much, I always check in this repo from time to time to see your updates, Looking forward to your "obtuse" :) solution and I can start testing it as right now I am working with spaczz |
Hi @Ibrokhimsadikov thanks for the kind words. I'm very happy that you and others are finding this project useful. I certainly haven't forgotten about this request, I've just had less time than I would like to work on spaczz lately. The Below is a workaround with the FuzzyMatcher you can use for now. It will only work as expected with single token patterns and the flex argument set to 0. This is definitely a limited solution but you may be able to expand the idea. The eventual import spacy
from spacy.tokens import Span
from spaczz.matcher import FuzzyMatcher
nlp = spacy.load("en_core_web_md")
text = "The manager gave me acess to the database so now I can acces it."
doc = nlp(text)
def add_ent(matcher, doc, i, matches):
"""Callback on match function. Adds entities to doc with name of label."""
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end, _ratio = matches[i]
entity = Span(doc, start, end, label=match_id)
# If Span already has entity assigned will skip rather than raising exception.
try:
doc.ents += (entity,)
except AttributeError:
pass
def select_nouns(matcher, doc, i, matches):
"""Callback on match function. Will continue passing matches that are nouns."""
# This will only work with single-token patterns.
# Also calling the above callback within this function to add entities to the doc.
match_id, start, _end, _ratio = matches[i]
if doc[start].pos_ == "NOUN":
add_ent(matcher, doc, i, matches)
matcher = FuzzyMatcher(nlp.vocab, flex=0)
# Flex = 0 with single-token patterns will approximate token matching for now.
matcher.add("TEST", [nlp("access")], on_match=select_nouns)
matches = matcher(doc)
# Only the noun version of "access" was added to the doc.
for ent in doc.ents:
print((ent.text, ent.start, ent.end, ent.label_))
Hope that helps for now! |
Thank you so much for your response I will start using it. Immense thanks |
People interested in using the cython source may find this question of interest: |
Hi @ronyarmon thank you for keeping us updated with your research. I hope to eventually Cythonize the algorithmic components of spaczz and integrate them with spaCy Vocab objects but that is currently beyond my programming capabilities. It will be a fairly long-term process for me to develop my C/Cython skills enough to accomplish that so if you and/or others are able to accomplish that faster/better than I can you'll certainly have my full support! If the spaCy team decides to implement some of this functionality even better! Ultimately, I made spaczz to provide features I didn't see anywhere else in the current spaCy ecosystem but I know for sure they could be implemented better than they are now. In the meantime, I hope to have a new version of spaczz with this requested feature ready in the next couple weeks and will continue to provide updates here. |
So as of now I have implementing this feature broken up into 5 distinct elements that I will be working on mostly sequentially.
Pull #35 completes the first task in this list. Hoping to have more done soon! |
More progress on this feature. Please see the roadmap below:
I am hoping to have this feature finished this week. @ronyarmon your stackoverflow question received an interesting response that I will explore in the near future. Seeing that I am close to implementing this feature in my pure-Python way, I will finish this before exploring expanding the spaCy Matcher. |
Thank you for sharing that @gandersen101 |
A few days overdue but this is closed by spaczz v0.4.0. Hopefully you all enjoy it. Please raise an issue if you run into any bugs! |
Thank you so much @gandersen101, I will definitely try that. Just FYI, I know it is known fact with speed issues, I want to share my observations: for processing 2mln reports with average of 150words each, it took approximately 20 days to process them, while with entityruler from spacy 3 days, in production with AWS ml.m5.12xlarge notebook instance. For pos Spaczz is amazing, Thank you once again, I will implement POS tagging capability as well. |
Hey @Ibrokhimsadikov. Thank you for the speed profiling. Definitely a lot of room for improvement. Issue #41 turns into a performance discussion and I am planning on doing some (hopefully substantial) enhancements very soon. I will also try to keep track of major performance updates in issue #20 over the long-term. Let me know if you have questions on the token matcher. There is an example in the readme and more in spaczz document tests and test suite. |
Hello, I am really liking Spaczz, to fuzzy match entity patterns.
Quick question is there a way to add a for example POS tagging constraints as well. For example: I want to extract only Noun phrases of AS, but fuzzy match also getting me 'as' from "as above function"
'i' below here is each string from list of vocab to match
{'label': "ECHO", 'pattern': [{'TEXT': i, 'POS': 'NOUN'}], 'type': 'fuzzy'}
The text was updated successfully, but these errors were encountered: