cc_news

Repo for demonstrating NLP Fundamentals on Common Crawl News Corpus

Requirements:

Insall AWS CLI on the machine if it does not exist
Create an environment using the environment.yml file provided
Install pyspark and Java Runtime in case missing

Here are highlights of each Notebook:

1_eda:

Basic Exploration of files available on the CC_News Dataset for the month of December 2019

extracted some high level meta data from the corpus
Performed some basic EDA
Were able to pull the text data and detect the language on it.
Extracted the English Domains
On a high level classified these English domains included in the set into 6 main categories.
And did some parallelization to speed up the process

2_extract_all_data_spark:

Mainly, in Spark, we

Ran through the corpus with our classified English domain list
Filtered out other domains that are most likely not news channels.
From the individual webpage, scraped title description and keywords if there are any and the body content.
Created a Data Frame to perform more NLP

3_feature_engineering:

Using the keywords from the meta data tags of the web pages, created class labels.
Did some basic preprocessing for modeling
Train and test split
Persist the data

4_classification_bert:

created our inmput features for BERT in a parallel format
fine tuned a decent BERT classifier that does pretty good on the documents
Extracted some records to be used in Topic Modeling

5_topic_modeling

Performed LDA to find some of the most commonly talked about topics in the news given a main topic area like "politics"

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
1_eda.ipynb		1_eda.ipynb
2_extract_all_data_spark.ipynb		2_extract_all_data_spark.ipynb
3_feature_engineering.ipynb		3_feature_engineering.ipynb
4_classification_bert.ipynb		4_classification_bert.ipynb
5_topic_modeling.ipynb		5_topic_modeling.ipynb
README.md		README.md
domains_labeled.csv		domains_labeled.csv
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cc_news

Requirements:

1_eda:

2_extract_all_data_spark:

3_feature_engineering:

4_classification_bert:

5_topic_modeling

About

Releases

Packages

Languages

oersoy1/cc_news

Folders and files

Latest commit

History

Repository files navigation

cc_news

Requirements:

1_eda:

2_extract_all_data_spark:

3_feature_engineering:

4_classification_bert:

5_topic_modeling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages