Skip to content

Latest commit

 

History

History
58 lines (36 loc) · 2.42 KB

LOCATIONS.md

File metadata and controls

58 lines (36 loc) · 2.42 KB

Locations of data

/fs/zisa0/commoncrawl

  • 2015_27 raw non-english
  • 2016_30 raw non-english
  • 2017_17 experiments with extracting parallel text:
    • /fs/zisa0/commoncrawl/2017_17/db: CommonCrawl Index DB as described here.
    • /fs/zisa0/commoncrawl/2017_17/baseline: Files for running the ModernMT baseline for parallel corpus extraction. It contains sentence aligned text for lv-en at /fs/zisa0/commoncrawl/2017_17/baseline/lv_en/lv-en.sent.All other language pairs are only partially finished.
    • /fs/zisa0/commoncrawl/2017_17/lsi: Results of running Ulrich Germann's LSI.
  • deduped data from /fs/nas/heithrun0/commoncrawl/deduped/en, resharded with sharder from here
  • deduped data files for ca, el, et, is, nl, pt, ro, sk, sl, sv

/fs/freyja0/commoncrawl

  • 2015_06 raw non-english
  • 2015_27 langsplit files
  • 2015_30 langsplit files
  • 2017_17 langsplit files

/fs/mimir0/commoncrawl

  • 2015_06 english raw
  • 2015_11, 2015_14, 2015_18, 2015_22, 2015_27, 2015_27, 2015_32, 2015_35, 2015_40, 2015_48, 2016_50, 2017_17 all raw

/fs/nas/eikthyrnir0/tim/cc

  • 2015_11, 2015_14 english raw
  • deduped files for ar, cs, de, es, fr, it, pl, ru

/fs/meili0/tim/commoncrawl

  • 2015_06 sharded English raw data to feed into the deduper as described in here

/fs/nas/heithrun0/commoncrawl/langsplit

  • langsplit files for all crawls from 2013_20 up to 2015_48 and for 2016_50
  • some scripts and files from Christian which seem to be related to the parallel corpus extraction

/fs/vili0/buck/cc/langsplit2/raw

  • non-english raw files for all 2014 crawls

/fs/vili0/buck/cc/langsplit2 and /fs/vili0/buck/cc/langsplit

  • temporary data between the langsplit files and the raw files for 2014 and 2015 crawls, potential candidate for deletion

/fs/vili0/www/data.statmt.org/ngrams

  • home directory of the "data.statmt.org/ngrams" website, contains symbolic links to old raw data

/fs/gna0/buck/cc/db

  • contains RocksDB Index data for all crawls from 2012 to 2015_40 + 2016_50; used in the parallel corpus extraction pipeline

/fs/zisa0/tim

  • /fs/zisa0/tim/bin/xz: version 5.2.3 of XZ Utils which supports multithreading.