🇳🇬 NaijaWeb

NaijaWeb is a web scraping project inspired by the FineWeb paper and the WebText dataset, including the OpenWebText dataset.

The scraping was performed on Google Colab due to its free memory and easy integration with Google Drive. Several notebooks (clean_download_data.ipynb, clean_download_data_403.ipynb, download_webpages.ipynb, webscrape_403.ipynb, and webscrape_nairaland.ipynb) were run across 9 different Colab notebooks (3 notebooks per Colab account) to expedite the process. All notebooks were linked to a single Google Drive folder: /content/drive/MyDrive/nairaland_webtext, where the files were saved.

The dataset is available on Hugging Face: NaijaWeb Dataset.

Notebooks and Process

webscrape_nairaland.ipynb

This notebook scrapes and extracts posts from a specific section on Nairaland. The script first collects all post links for each section and saves them to a pickle file. It then downloads the individual posts. Due to Colab's limited runtime, the process was distributed across 9 notebooks, where the long list of post links was split into smaller chunks to speed up the downloads.

extract_outboundlinks.ipynb

This notebook extracts all outbound links from the downloaded posts, filters out certain domains, performs basic cleaning (removing full stops at the end of links and consecutive full stops), and saves the cleaned links to a CSV file.

download_webpages.ipynb

This notebook downloads webpages from the outbound links extracted in the previous step. The links are downloaded in batches of 1,000 and saved as pickle files. This process was also run across 9 notebooks to save time.

clean_download_data.ipynb

This notebook uses Trafilatura (as inspired by the FineWeb paper) to extract and clean the downloaded webpages. Pages that returned a "403 Forbidden" response were saved for later handling.

webscrape_403.ipynb

This notebook redownloads webpages that initially returned a 403 error using Cloudscraper.

clean_download_data_403.ipynb

This notebook extracts and cleans the data from the webpages that were redownloaded due to the 403 error.

fineweb_clean_data.ipynb

This notebook applies the same cleaning process used on the FineWeb dataset, following these steps:

🔻 FILTER: 😈 URL filter
🔻 FILTER: 👯 Gopher repetition
🔻 FILTER: 🥇 Gopher quality
🔻 FILTER: ⛰ C4 quality
🔻 FILTER: 🍷 FineWeb quality
🔢 TOKENIZER: 📊 Counter
💽 WRITER: 🐿 Jsonl

PII_formatter.ipynb

This notebook removes Personally Identifiable Information (PII) such as emails and IP addresses from the dataset.

push_to_hub.ipynb

This notebook pushes the full dataset to Hugging Face and calculates the educational score of the dataset using the FineWeb EDU classifier. Note that the classifier's predictions may not be fully accurate due to the limited amount of Nigerian data the model was likely trained on.

extract_naijaweb_edu.ipynb

This notebook detects the language of the documents and creates two subsets of the dataset: NaijaWeb EDU and NaijaWeb EDU2, using the educational score. This is an attempt to recreate the FineWeb EDU dataset with Nigerian content.

If you find this project helpful, consider giving the repo a star. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🇳🇬 NaijaWeb

Notebooks and Process

webscrape_nairaland.ipynb

extract_outboundlinks.ipynb

download_webpages.ipynb

clean_download_data.ipynb

webscrape_403.ipynb

clean_download_data_403.ipynb

fineweb_clean_data.ipynb

PII_formatter.ipynb

push_to_hub.ipynb

extract_naijaweb_edu.ipynb

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
PII_formatter.ipynb		PII_formatter.ipynb
README.md		README.md
clean_download_data.ipynb		clean_download_data.ipynb
clean_download_data_403.ipynb		clean_download_data_403.ipynb
download_webpages.ipynb		download_webpages.ipynb
extract_naijaweb_edu.ipynb		extract_naijaweb_edu.ipynb
extract_outboundlinks.ipynb		extract_outboundlinks.ipynb
fineweb_clean_data.ipynb		fineweb_clean_data.ipynb
push_to_hub.ipynb		push_to_hub.ipynb
webscrape_403.ipynb		webscrape_403.ipynb
webscrape_nairaland.ipynb		webscrape_nairaland.ipynb

saheedniyi02/Naijaweb

Folders and files

Latest commit

History

Repository files navigation

🇳🇬 NaijaWeb

Notebooks and Process

About

Resources

Stars

Watchers

Forks

Languages