Repo for demonstrating NLP Fundamentals on Common Crawl News Corpus
- Insall AWS CLI on the machine if it does not exist
- Create an environment using the environment.yml file provided
- Install pyspark and Java Runtime in case missing
Here are highlights of each Notebook:
Basic Exploration of files available on the CC_News Dataset for the month of December 2019
- extracted some high level meta data from the corpus
- Performed some basic EDA
- Were able to pull the text data and detect the language on it.
- Extracted the English Domains
- On a high level classified these English domains included in the set into 6 main categories.
- And did some parallelization to speed up the process
Mainly, in Spark, we
- Ran through the corpus with our classified English domain list
- Filtered out other domains that are most likely not news channels.
- From the individual webpage, scraped title description and keywords if there are any and the body content.
- Created a Data Frame to perform more NLP
- Using the keywords from the meta data tags of the web pages, created class labels.
- Did some basic preprocessing for modeling
- Train and test split
- Persist the data
- created our inmput features for BERT in a parallel format
- fine tuned a decent BERT classifier that does pretty good on the documents
- Extracted some records to be used in Topic Modeling
Performed LDA to find some of the most commonly talked about topics in the news given a main topic area like "politics"