- 2015_27 raw non-english
- 2016_30 raw non-english
- 2017_17 experiments with extracting parallel text:
/fs/zisa0/commoncrawl/2017_17/db
: CommonCrawl Index DB as described here./fs/zisa0/commoncrawl/2017_17/baseline
: Files for running the ModernMT baseline for parallel corpus extraction. It contains sentence aligned text forlv-en
at/fs/zisa0/commoncrawl/2017_17/baseline/lv_en/lv-en.sent
.All other language pairs are only partially finished./fs/zisa0/commoncrawl/2017_17/lsi
: Results of running Ulrich Germann's LSI.
- deduped data from
/fs/nas/heithrun0/commoncrawl/deduped/en
, resharded with sharder from here - deduped data files for ca, el, et, is, nl, pt, ro, sk, sl, sv
- 2015_06 raw non-english
- 2015_27 langsplit files
- 2015_30 langsplit files
- 2017_17 langsplit files
- 2015_06 english raw
- 2015_11, 2015_14, 2015_18, 2015_22, 2015_27, 2015_27, 2015_32, 2015_35, 2015_40, 2015_48, 2016_50, 2017_17 all raw
- 2015_11, 2015_14 english raw
- deduped files for ar, cs, de, es, fr, it, pl, ru
- 2015_06 sharded English raw data to feed into the deduper as described in here
- langsplit files for all crawls from 2013_20 up to 2015_48 and for 2016_50
- some scripts and files from Christian which seem to be related to the parallel corpus extraction
- non-english raw files for all 2014 crawls
- temporary data between the langsplit files and the raw files for 2014 and 2015 crawls, potential candidate for deletion
- home directory of the "data.statmt.org/ngrams" website, contains symbolic links to old raw data
- contains RocksDB Index data for all crawls from 2012 to 2015_40 + 2016_50; used in the parallel corpus extraction pipeline
/fs/zisa0/tim/bin/xz
: version 5.2.3 of XZ Utils which supports multithreading.