-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
41 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Annotation Pipeline | ||
|
||
This Python script automates the process of crawling for relevant URLs, scraping HTML content from those pages, formatting the data as Label Studio tasks, and uploading them to Label Studio for annotation. | ||
|
||
## Features | ||
|
||
- **Common Crawl Integration**: Initiates the Common Crawl script to crawl for relevant URLs based on specified parameters such as Common Crawl ID, URL type, keyword, and number of pages to process. | ||
|
||
- **HTML Tag Collector**: Collects HTML tags from the crawled URLs using the tag collector script. | ||
|
||
- **Label Studio Tasks**: Formats the collected data into tasks suitable for Label Studio annotation, including pre-annotation support for assumed record types. | ||
|
||
- **Upload to Label Studio**: Uploads the tasks to Label Studio for review and annotation. | ||
|
||
## Setup | ||
|
||
1. Install Python dependencies: | ||
`pip install pandas argparse huggingface-hub` | ||
|
||
2. Setup Environment variables in annotation_pipeline/dev.env | ||
LABEL_STUDIO_ACCESS_TOKEN=... | ||
LABEL_STUDIO_PROJECT_ID=... | ||
LABEL_STUDIO_ORGANIZATION=... | ||
|
||
As well as in data_source_identification/.env | ||
HUGGINGFACE_ACCESS_TOKEN=... | ||
LABEL_STUDIO_ACCESS_TOKEN=... | ||
LABEL_STUDIO_PROJECT_ID=... | ||
LABEL_STUDIO_ORGANIZATION=... | ||
|
||
## Usage | ||
|
||
`python annotation_pipeline.py common_crawl_id url keyword --pages num_pages [--record-type record_type]` | ||
|
||
- `common_crawl_id`: ID of the Common Crawl Corpus to search | ||
- `url`: Type of URL to search for (e.g. *.gov for all .gov domains). | ||
- `keyword`: Keyword that must be matched in the full URL | ||
- `--pages num_pages`: Number of pages to search | ||
- `--record-type record_type` (optional): Assumed rescord type for pre-annotation. | ||
|
||
e.g. `python annotation_pipeline.py CC-MAIN-2024-10 '*.gov' arrest --pages 2 --record-type Arrest Records` |