Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "google searcher" to populate label studio #102

Open
7 tasks
josh-chamberlain opened this issue Aug 16, 2024 · 0 comments
Open
7 tasks

Add "google searcher" to populate label studio #102

josh-chamberlain opened this issue Aug 16, 2024 · 0 comments

Comments

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Aug 16, 2024

Context

We have an Action to populate label studio. It uses the Common Crawler script, which tries to find needles in the Common Crawl haystack that are related to selected keywords. Then, it uploads stuff to LabelStudio.

We have #74 which may be referenced or continued (related: #76)

Requirements

  • Rename the "Populate LabelStudio" action to "Common Crawler to LabelStudio", for specificity
  • Add a new action called "Google Searcher to LabelStudio"
    • Instead of common crawl, it should accept arguments for making targeted google searches
      • accepts county in "Allegheny, PA" format (including the state!) or County FIPS code in "12345" format (these are in our DB!)
      • accepts custom, comma-separated keywords to apply to every agency in the county
        • pre-populate data portal, public records, documents
      • generates search terms by iterating through the agencies in the PDAP database and concatenating with keywords
        • agencies.submitted_name + "data portal"
        • agencies.submitted_name + "public records"
        • agencies.submitted_name + "documents"
    • It should generate a batch with the first 10 results from each search and send them to LabelStudio in the same way
      • i.e. combine, deduplicate, check for duplicates in LS; log batches and cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant