Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds scraper script #59

Open
wants to merge 3 commits into
base: development
Choose a base branch
from

Conversation

sshivaditya2019
Copy link
Collaborator

@sshivaditya2019 sshivaditya2019 commented Dec 8, 2024

Resolves #56

  • Scrapes issues based on the username passed in.
  • Reads the token either as a user input or from the cli.
  • Updates issues in the repo, even with same node_id exists.
  • Issue Dedup and Matchmaking Results.

Copy link
Contributor

github-actions bot commented Dec 8, 2024

Unused files (1)

src/handlers/issue-scraper.ts

Copy link

@sshivaditya2019, this task has been idle for a while. Please provide an update.

@sshivaditya2019
Copy link
Collaborator Author

@0x4007, This is the base scraper logic. Should I write a script for adding the issues for all the users mentioned in the auth.users.json ?

@0x4007
Copy link
Member

0x4007 commented Dec 12, 2024

Yes and please update the database with it. You can QA with some task matchmaking scoring improvements and second goal some issue dedupe improvements- right?

@sshivaditya2019 sshivaditya2019 marked this pull request as ready for review December 12, 2024 21:06
@0x4007 0x4007 requested review from rndquu and whilefoo December 13, 2024 00:23
@0x4007
Copy link
Member

0x4007 commented Dec 13, 2024

Is there some type of bias in the algorithm to make 75% the peak of the bell curve?

talent referrals

I looked through a few of the results and I think above 80% seems actually relevant. What are your thoughts?

It might make sense to exclude showing matches below 80%?

And then always recommend at least two contributors still.

issue deduplication

The markup seems very noisy and also based on my quick look it seems that below 80% seems kind of irrelevant. What are your thoughts on this?

For near term testing purposes I think we should leave on all the markup but I can see us needing to reduce the noisiness and hide anything below a certain threshold, like that 80% again.

@sshivaditya2019
Copy link
Collaborator Author

sshivaditya2019 commented Dec 13, 2024

I looked through a few of the results and I think above 80% seems actually relevant. What are your thoughts?

It might make sense to exclude showing matches below 80%?

I think we should include matches below 80%, as this would allow for a larger pool of contributors. We can always exclude them by removing alwaysRecommend and setting the jobMatchingThreshold to 0.8.

Is there some type of bias in the algorithm to make 75% the peak of the bell curve?

The current similarity search uses a weighted sum of cosine distance (0.8) and L2 distance (0.2). Without this weighting, the results tend to cluster around 90% similarity, that is just using the cosine distance1. With the weighted sum, they are more likely to cluster around 75%. This helps make the results more varied and accurate.

Footnotes

  1. https://docs.voyageai.com/discuss/660499a8c27dbb000f201a40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scraper: Populate "Closed As Complete" Issue Specifications
3 participants