-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adds scraper script #59
base: development
Are you sure you want to change the base?
feat: adds scraper script #59
Conversation
Unused files (1)
|
@sshivaditya2019, this task has been idle for a while. Please provide an update. |
@0x4007, This is the base scraper logic. Should I write a script for adding the issues for all the users mentioned in the |
Yes and please update the database with it. You can QA with some task matchmaking scoring improvements and second goal some issue dedupe improvements- right? |
QA: After Issue Scraper for Issue Matching Previously 1%
Previously 1% & 0%
Previously 0%
This will not impact issue deduplication since it is restricted to issues within the same organization and repository. |
Is there some type of bias in the algorithm to make 75% the peak of the bell curve? talent referralsI looked through a few of the results and I think above 80% seems actually relevant. What are your thoughts? It might make sense to exclude showing matches below 80%? And then always recommend at least two contributors still. issue deduplicationThe markup seems very noisy and also based on my quick look it seems that below 80% seems kind of irrelevant. What are your thoughts on this? For near term testing purposes I think we should leave on all the markup but I can see us needing to reduce the noisiness and hide anything below a certain threshold, like that 80% again. |
I think we should include matches below 80%, as this would allow for a larger pool of contributors. We can always exclude them by removing
The current similarity search uses a weighted sum of cosine distance (0.8) and L2 distance (0.2). Without this weighting, the results tend to cluster around 90% similarity, that is just using the cosine distance1. With the weighted sum, they are more likely to cluster around 75%. This helps make the results more varied and accurate. Footnotes |
Resolves #56
username
passed in.