Skip to content

kapmaznida/turkish-news-summarization

 
 

Repository files navigation

Turkish Abstractive Text Summarization

Abstract

Text summarization can be defined as “is the task of producing a concise and fluent summary while preserving key information content and overall meaning”. There are bunch of studies about this task. Most of them are in English. We aimed to create a Turkish abstractive text summarization pipeline from scratch (Crawler to deployment).

Work Plan

Equal Contribution

Action Oguzhan Sahin Nida Kapmaz
Crawler x x
Scraping x
Data Preprocessing x x
Labelling x
Encoder-Decoder Model x
Flask x x
Deployment x

Pipeline

  • Crawler
    • Web crawling is a component of web scraping, the crawler logic finds URLs to be processed by the scraper code.
    • Used Scrapy library for this task.
    • Built Scrapy crawler for Webtekno.com and collected about 18k news links.
  • Scraping
    • Scraped news text by using links that is obtained from crawler.
    • Reques and bs4 libraries used for this task.
  • Labelling
    • Since text summarization task is supervised, the news needed to be labelled (summarized).
    • Extracted summary for every single article by using TF-IDF method.
  • Model
    • Fine-tuned BERT model for this task for 3 epoch.
  • Flask
    • Created HTML, CSS files for this task.
    • Created UI by implementing our model to these html files.
  • Deployment
    • As a future work, Heroku or Streamlit will be used.

How to run?

  • If you do not have data, you can run crawler first. In scrapy_crawlers/spiders/, run below script:
scrapy crawl webtekno --logfile webktekno.log -o webtekno.json -t jsonlines
  • Once you run this script, you will have 2 files (webtekno.log, webtekno.json). In webtekno.json, you will have urls. If you change the urls, you need to adjust webtekno.py.

  • In scrapy_crawler/spiders/, there is parse_json.py for parsing json files and gives an .csv files as an output.

  • To get new text, scraping.py takes an input urls csv, and return urls and text csv file.

  • tf_idf.py here will be used for labelling news text.

  • Once you prepare your data for fine-tuning, you can run fine-tune.ipynb noteboook.

Results

This repository is created by Oguzhan Sahin and Nida Kapmaz

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 56.7%
  • Python 28.1%
  • HTML 7.9%
  • CSS 7.3%