Turkish Abstractive Text Summarization

Abstract

Text summarization can be defined as “is the task of producing a concise and fluent summary while preserving key information content and overall meaning”. There are bunch of studies about this task. Most of them are in English. We aimed to create a Turkish abstractive text summarization pipeline from scratch (Crawler to deployment).

Work Plan

Equal Contribution

Action	Oguzhan Sahin	Nida Kapmaz
Crawler	x	x
Scraping		x
Data Preprocessing	x	x
Labelling		x
Encoder-Decoder Model	x
Flask	x	x
Deployment	x

Pipeline

Crawler
- Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code.
- Used Scrapy library for this task.
- Built Scrapy crawler for Webtekno.com and collected about 18k news links.
Scraping
- Scraped news text by using links that is obtained from crawler.
- Reques and bs4 libraries used for this task.
Labelling
- Since text summarization task is supervised, the news needed to be labelled (summarized).
- Extracted summary for every single article by using TF-IDF method.
Model
- Fine-tuned BERT model for this task for 3 epoch.
Flask
- Created HTML, CSS files for this task.
- Created UI by implementing our model to these html files.
Deployment
- As a future work, Heroku or Streamlit will be used.

How to run?

If you do not have data, you can run crawler first. In scrapy_crawlers/spiders/, run below script:

scrapy crawl webtekno --logfile webktekno.log -o webtekno.json -t jsonlines

Once you run this script, you will have 2 files (webtekno.log, webtekno.json). In webtekno.json, you will have urls. If you change the urls, you need to adjust webtekno.py.
In scrapy_crawler/spiders/, there is parse_json.py for parsing json files and gives an .csv files as an output.
To get new text, scraping.py takes an input urls csv, and return urls and text csv file.
tf_idf.py here will be used for labelling news text.
Once you prepare your data for fine-tuning, you can run fine-tune.ipynb noteboook.

Results

This repository is created by Oguzhan Sahin and Nida Kapmaz

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
__pycache__		__pycache__
crawler		crawler
img		img
model		model
static		static
templates		templates
Procfile		Procfile
README.md		README.md
Turkish-News-Summarizer-Project-Report.pdf		Turkish-News-Summarizer-Project-Report.pdf
app.py		app.py
fine_tuning_model.ipynb		fine_tuning_model.ipynb
requirements.txt		requirements.txt
scraping.py		scraping.py
tf_idf.py		tf_idf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turkish Abstractive Text Summarization

Abstract

Work Plan

Pipeline

How to run?

Results

About

Releases

Packages

Languages

kapmaznida/turkish-news-summarization

Folders and files

Latest commit

History

Repository files navigation

Turkish Abstractive Text Summarization

Abstract

Work Plan

Pipeline

How to run?

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages