Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish Gigaword text based POCOLM and RNNLM training recipe #3136

Open
wants to merge 58 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
801ab93
Merge pull request #1 from kaldi-asr/master
saikiranvalluri Oct 30, 2018
04c4a03
Merge remote-tracking branch 'upstream/master'
GoVivace Dec 8, 2018
ea699b0
Merge remote-tracking branch 'upstream/master'
GoVivace Jan 17, 2019
cbc8eeb
Spanish Gigaword LM recipe
Feb 19, 2019
e8aecbb
Some bug fixes
saikiranvalluri Feb 19, 2019
ece34bd
Update rnnlm.sh
saikiranvalluri Feb 19, 2019
0c4fe47
Combining lexicon words with pocolm wordslist for RNNLM training
Feb 19, 2019
92e241b
merge conflict resolved
Feb 19, 2019
1439b0d
Integrated the 2 stage scientific method POCOLM training for Gigaword…
saikiranvalluri Feb 24, 2019
8ad0e01
Update train_pocolm.sh
saikiranvalluri Feb 26, 2019
f856ac2
Update run.sh
saikiranvalluri Feb 27, 2019
684f029
Text cleaning script for splitting Abbreviation words added
saikiranvalluri Feb 28, 2019
185da3a
Update clean_txt_dir.sh
saikiranvalluri Feb 28, 2019
cb393c8
Update clean_txt_dir.sh
saikiranvalluri Feb 28, 2019
18a9cb6
Update train_pocolm.sh
saikiranvalluri Feb 28, 2019
b023638
Update pocolm_cust.sh
saikiranvalluri Feb 28, 2019
46550f0
Cosmetic fixes
saikiranvalluri Feb 28, 2019
ce3c7d7
Update path.sh
saikiranvalluri Feb 28, 2019
deeaaa7
Bug fix in text normalisation script for gigaword corpus
saikiranvalluri Mar 1, 2019
633f21d
small Fix path.sh
saikiranvalluri Mar 1, 2019
8d6b14d
Update clean_abbrevs_text.py
saikiranvalluri Mar 1, 2019
8c9c37b
Added sparrowhawk installation script for text normalisation
saikiranvalluri Mar 1, 2019
c6b05d1
G2P training stage added into Spanish gigaword recipe
saikiranvalluri Mar 2, 2019
8c226cc
G2P seq2seq scripts added in steps/
saikiranvalluri Mar 2, 2019
7b67fc2
RNNLM scripts updated to UTF8 encoding
saikiranvalluri Mar 2, 2019
4767c7c
Update pocolm_cust.sh
saikiranvalluri Mar 8, 2019
2cd5948
Update run.sh
saikiranvalluri Mar 8, 2019
6595b42
Added steps for generating POCOLM ARPA file
saikiranvalluri Mar 18, 2019
0902c9e
Update run.sh
saikiranvalluri Mar 24, 2019
d8a90ec
Merge branch 'master' into feature/Spanish_gigaword_LM
saikiranvalluri Mar 24, 2019
c10b0fe
Apply g2p part added to get extended lexicon
saikiranvalluri Mar 24, 2019
15a34e8
Merge branch 'feature/Spanish_gigaword_LM' of https://github.com/GoVi…
saikiranvalluri Mar 24, 2019
3df45ae
Small fix in run.sh rnnlm_wordlist
saikiranvalluri Mar 24, 2019
7e47695
Added sanity chack for Sparrowhawk normalizer in cleanup script
saikiranvalluri Mar 25, 2019
91a4611
Data preparation fixes
saikiranvalluri Mar 25, 2019
5f45dd1
Cosmetic options for gigaword textclean
saikiranvalluri Mar 26, 2019
e711d30
Some fixes in rnnlm training
saikiranvalluri Apr 1, 2019
8d521c6
Moved s5_gigaword directory to s5
saikiranvalluri Apr 1, 2019
c57ed95
Merge branch 'master' into feature/Spanish_gigaword_LM
saikiranvalluri Apr 2, 2019
f610470
removed s5_gigaword folder
saikiranvalluri Apr 2, 2019
f810119
Small cleanup for scripts format
saikiranvalluri Apr 2, 2019
dc8a56e
Cosmetic fix
saikiranvalluri Apr 5, 2019
ec0edc5
Merge branch 'master' into feature/Spanish_gigaword_LM
saikiranvalluri Apr 12, 2019
8b8222e
Remove virtenv dependency
saikiranvalluri Apr 18, 2019
0e7afa8
Update path.sh
saikiranvalluri Apr 19, 2019
56d2db9
Update install_sparrowhawk.sh
saikiranvalluri Apr 19, 2019
fb6693e
Set lang to ESP
saikiranvalluri Apr 20, 2019
ce0f420
Set pocolm option - --limit-unk-history=true
saikiranvalluri Apr 23, 2019
9487ce1
Removed unused code
saikiranvalluri Apr 23, 2019
25609c5
Fix in checking for empty space lines in lexicon
saikiranvalluri Apr 23, 2019
510db0f
Fix in RNNLM rescoring decode stage
saikiranvalluri Apr 25, 2019
9894f4c
Update run.sh
saikiranvalluri Apr 26, 2019
3bdb541
Update clean_txt_dir.sh
saikiranvalluri May 20, 2019
6636557
Update run.sh
saikiranvalluri Jun 9, 2019
69b1bca
Merge branch 'master' into feature/Spanish_gigaword_LM
saikiranvalluri Jun 9, 2019
36499a7
Update run.sh
saikiranvalluri Jul 7, 2019
8da5c3e
Reverse the order of Abbreviation process after punct syms
saikiranvalluri Jul 13, 2019
510b415
Update run_norm.sh
saikiranvalluri Aug 21, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 0 additions & 38 deletions egs/fisher_callhome_spanish/s5/RESULTS

This file was deleted.

13 changes: 6 additions & 7 deletions egs/fisher_callhome_spanish/s5/local/chain/run_tdnn_1g.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ nnet3_affix= # affix for exp dirs, e.g. it was _cleaned in tedlium.
affix=1g #affix for TDNN+LSTM directory e.g. "1a" or "1b", in case we change the configuration.
common_egs_dir=
reporting_email=
gigaword_workdir=

# LSTM/chain options
train_stage=-10
Expand Down Expand Up @@ -254,11 +255,6 @@ if [ $stage -le 21 ]; then

fi

rnnlmdir=exp/rnnlm_lstm_tdnn_1b
if [ $stage -le 22 ]; then
local/rnnlm/train_rnnlm.sh --dir $rnnlmdir || exit 1;
fi

if [ $stage -le 23 ]; then
frames_per_chunk=$(echo $chunk_width | cut -d, -f1)
rm $dir/.error 2>/dev/null || true
Expand All @@ -277,8 +273,11 @@ if [ $stage -le 23 ]; then
--online-ivector-dir exp/nnet3/ivectors_${data}_hires \
$tree_dir/graph_${lmtype} data/${data}_hires ${dir}/decode_${lmtype}_${data} || exit 1;
done
bash local/rnnlm/lmrescore_nbest.sh 1.0 data/lang_test $rnnlmdir data/${data}_hires/ \
${dir}/decode_${lmtype}_${data} $dir/decode_rnnLM_${lmtype}_${data} || exit 1;
if [ $gigaword_workdir ]; then
lmtype=fsp_train
bash rnnlm/lmrescore_nbest.sh 1.0 data/lang_test $gigaword_workdir/rnnlm data/${data}_hires/ \
${dir}/decode_${lmtype}_${data} $dir/decode_gigaword_RNNLM_${lmtype}_${data} || exit 1;
fi
) || touch $dir/.error &
done
wait
Expand Down
35 changes: 35 additions & 0 deletions egs/fisher_callhome_spanish/s5/local/clean_abbrevs_text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# 2018 Saikiran Valluri, GoVivace inc.,

import os, sys
import re
import codecs

if len(sys.argv) < 3:
print("Usage : python clean_abbrevs_text.py <Input text> <output text>")
print(" Processes the text before text normalisation to convert uppercase words as space separated letters")
sys.exit()

inputfile=codecs.open(sys.argv[1], encoding='utf-8')
outputfile=codecs.open(sys.argv[2], encoding='utf-8', mode='w')

for line in inputfile:
words = line.split()
textout = ""
wordcnt = 0
for word in words:
if re.match(r"\b([A-ZÂÁÀÄÊÉÈËÏÍÎÖÓÔÖÚÙÛÑÇ])+[']?s?\b", word):
if wordcnt > 0:
word = re.sub('\'?s', 's', word)
textout = textout + " ".join(word) + " "
else:
textout = textout + word + " "
else:
textout = textout + word + " "
if word.isalpha(): wordcnt = wordcnt + 1
outputfile.write(textout.strip()+ '\n')

inputfile.close()
outputfile.close()
57 changes: 57 additions & 0 deletions egs/fisher_callhome_spanish/s5/local/clean_txt_dir.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash

# Script to clean up gigaword LM text
# Removes punctuations, does case normalization

stage=0
nj=500

. ./path.sh
. ./cmd.sh
. ./utils/parse_options.sh

set -euo pipefail

if [ $# -ne 2 ]; then
echo "Usage: $0 <textdir> <outdir>"
exit 1;
fi

if [ ! -s `which normalizer_main` ] ; then
echo "Sparrowhawk normalizer was not found installed !"
echo "Go to $KALDI_ROOT/tools and execute install_sparrowhawk.sh and try again!"
exit 1
fi

txtdir=$1
textdir=$(realpath $txtdir)
outdir=$(realpath $2)

workdir=$outdir/tmp
if [ $stage -le 0 ]; then
rm -rf $outdir
mkdir -p $workdir
mkdir -p $textdir/splits
mkdir -p $outdir/data
split -l 1000000 $textdir/in.txt $textdir/splits/out
numsplits=0
for x in $textdir/splits/*; do
numsplits=$((numsplits+1))
ln -s $x $outdir/data/$numsplits
done
echo $numsplits
cp $SPARROWHAWK_ROOT/documentation/grammars/sentence_boundary_exceptions.txt .
$train_cmd --max_jobs_run 100 JOB=1:$numsplits $outdir/sparrowhawk/log/JOB.log \
local/run_norm.sh \
sparrowhawk_configuration.ascii_proto \
$SPARROWHAWK_ROOT/language-resources/esp/sparrowhawk/ \
$outdir/data \
JOB \
$outdir/sparrowhawk/
cat $outdir/sparrowhawk/*.txt | sed "/^$/d" > $outdir/text_normalized

# check if numbers are there in normalized output
awk '{for(i=1;i<=NF;i++) {if (!seen[$i]) {print $i; seen[$i]=1} }}' \
$outdir/text_normalized > $outdir/unique_words
grep "[0-9]" $outdir/unique_words | sort -u > $outdir/numbers
fi
6 changes: 3 additions & 3 deletions egs/fisher_callhome_spanish/s5/local/ctm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ fi
steps/get_ctm.sh $data_dir $lang_dir $decode_dir

# Make sure that channel markers match
#perl -i -pe "s:\s.*_fsp-([AB]): \1:g" data/dev/stm
#ls exp/tri5a/decode_dev/score_*/dev.ctm | xargs -I {} perl -i -pe 's:fsp\s1\s:fsp A :g' {}
#ls exp/tri5a/decode_dev/score_*/dev.ctm | xargs -I {} perl -i -pe 's:fsp\s2\s:fsp B :g' {}
#sed -i "s:\s.*_fsp-([AB]): \1:g" data/dev/stm
#ls exp/tri5a/decode_dev/score_*/dev.ctm | xargs -I {} sed -i -r 's:fsp\s1\s:fsp A :g' {}
#ls exp/tri5a/decode_dev/score_*/dev.ctm | xargs -I {} sed -i -r 's:fsp\s2\s:fsp B :g' {}

# Get the environment variables
. /export/babel/data/software/env.sh
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/usr/bin/env bash
set -e

# Path to Gigaword corpus with all data files decompressed.
export GIGAWORDDIR=$1
# The directory to write output to
export OUTPUTDIR=$2
# The number of jobs to run at once
export NUMJOBS=$3

echo "Flattening Gigaword with ${NUMJOBS} processes..."
mkdir -p $OUTPUTDIR
find ${GIGAWORDDIR}/data/*/* -type f -print -exec local/flatten_gigaword/run_flat.sh {} ${OUTPUTDIR} \;
echo "Combining the flattened files into one..."
cat ${OUTPUTDIR}/*.flat > ${OUTPUTDIR}/flattened_gigaword.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# -*- coding: utf-8 -*-

import logging
import os
import re
import spacy
import gzip

from argparse import ArgumentParser
from bs4 import BeautifulSoup

en_nlp = spacy.load("es")


def flatten_one_gigaword_file(file_path):
f = gzip.open(file_path)
html = f.read()
# Parse the text with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Iterate over all <p> items and get the text for each.
all_paragraphs = []
for paragraph in soup("p"):
# Turn inter-paragraph newlines into spaces
paragraph = paragraph.get_text()
paragraph = re.sub(r"\n+", "\n", paragraph)
paragraph = paragraph.replace("\n", " ")
# Tokenize the paragraph into words
tokens = en_nlp.tokenizer(paragraph)
words = [str(token) for token in tokens if not
str(token).isspace()]
if len(words) < 3:
continue
all_paragraphs.append(words)
# Return a list of strings, where each string is a
# space-tokenized paragraph.
return [" ".join(paragraph) for paragraph in all_paragraphs]


if __name__ == "__main__":
log_fmt = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
logging.basicConfig(level=logging.INFO, format=log_fmt)
logger = logging.getLogger(__name__)

parser = ArgumentParser(description=("Flatten a gigaword data file for "
"use in language modeling."))
parser.add_argument("--gigaword-path", required=True,
metavar="<gigaword_path>", type=str,
help=("Path to Gigaword directory, with "
"all .gz files unzipped."))
parser.add_argument("--output-dir", required=True, metavar="<output_dir>",
type=str, help=("Directory to write final flattened "
"Gigaword file."))

A = parser.parse_args()
all_paragraphs = flatten_one_gigaword_file(A.gigaword_path)
output_path = os.path.join(A.output_dir,
os.path.basename(A.gigaword_path) + ".flat")
with open(output_path, "w") as output_file:
for paragraph in all_paragraphs:
output_file.write("{}\n".format(paragraph))
17 changes: 17 additions & 0 deletions egs/fisher_callhome_spanish/s5/local/flatten_gigaword/run_flat.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env bash
set -e

. ./path_venv.sh

# Path to Gigaword corpus with all data files decompressed.
GIGAWORDPATH=$1
# The directory to write output to
OUTPUTDIR=$2
file=$(basename ${GIGAWORDPATH})
if [ ! -e ${OUTPUTDIR}/${file}.flat ]; then
echo "flattening to ${OUTPUTDIR}/${file}.flat"
python local/flatten_gigaword/flatten_one_gigaword.py --gigaword-path ${GIGAWORDPATH} --output-dir ${OUTPUTDIR}
else
echo "skipping ${file}.flat"
fi

1 change: 1 addition & 0 deletions egs/fisher_callhome_spanish/s5/local/fsp_data_prep.sh
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ if [ $stage -le 2 ]; then
sed 's:</b::g' | \
sed 's:<foreign langengullís>::g' | \
sed 's:foreign>::g' | \
sed 's:\[noise\]:[noise] :g' | \
sed 's:>::g' | \
#How do you handle numbers?
grep -v '()' | \
Expand Down
5 changes: 3 additions & 2 deletions egs/fisher_callhome_spanish/s5/local/fsp_prepare_dict.sh
Original file line number Diff line number Diff line change
Expand Up @@ -105,8 +105,9 @@ if [ $stage -le 4 ]; then
cp "$tmpdir/lexicon.1" "$tmpdir/lexicon.2"

# Add prons for laughter, noise, oov
w=$(grep -v sil $dir/silence_phones.txt | tr '\n' '|')
perl -i -ne "print unless /\[(${w%?})\]/" $tmpdir/lexicon.2
for w in `grep -v sil $dir/silence_phones.txt`; do
sed -i "/\[$w\]/d" $tmpdir/lexicon.2
done

for w in `grep -v sil $dir/silence_phones.txt`; do
echo "[$w] $w"
Expand Down
39 changes: 39 additions & 0 deletions egs/fisher_callhome_spanish/s5/local/get_data_weights.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/env perl

# Nagendra Kumar Goel

# This takes two arguments:
# 1) Pocolm training output folder
# 2) rnnlm weights file name (for output)

use POSIX;
use List::Util qw[min max];

if (@ARGV != 2) {
die "Usage: get_data_weights.pl <pocolm-folder> <output-file>\n";
}

$pdir = shift @ARGV;
$out = shift @ARGV;

open(P, "<$pdir/metaparameters") || die "Could not open $pdir/metaparameters";
open(N, "<$pdir/names") || die "Could not open $pdir/names" ;
open(O, ">$out") || die "Could not open $out for writing" ;

my %scores = ();

while(<N>) {
@n = split(/\s/,$_);
$name = $n[1];
$w = <P>;
@w = split(/\s/,$w);
$weight = $w[1];
$scores{$name} = $weight;
}

$min = min(values %scores);

for(keys %scores) {
$weightout = POSIX::ceil($scores{$_} / $min);
print O "$_\t1\t$weightout\n";
}
34 changes: 34 additions & 0 deletions egs/fisher_callhome_spanish/s5/local/get_rnnlm_wordlist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# 2018 Saikiran Valluri, GoVivace inc.

import os, sys

if len(sys.argv) < 5:
print( "Usage: python get_rnnlm_wordlist.py <ASR lexicon words> <POCOLM wordslist> <RNNLM wordslist output> <OOV wordlist>")
sys.exit()

lexicon_words = open(sys.argv[1], 'r', encoding="utf-8")
pocolm_words = open(sys.argv[2], 'r', encoding="utf-8")
rnnlm_wordsout = open(sys.argv[3], 'w', encoding="utf-8")
oov_wordlist = open(sys.argv[4], 'w', encoding="utf-8")

line_count=0
lexicon=[]

for line in lexicon_words:
lexicon.append(line.split()[0])
rnnlm_wordsout.write(line.split()[0] + " " + str(line_count)+'\n')
line_count = line_count + 1

for line in pocolm_words:
if not line.split()[0] in lexicon:
oov_wordlist.write(line.split()[0]+'\n')
rnnlm_wordsout.write(line.split()[0] + " " + str(line_count)+'\n')
line_count = line_count + 1

lexicon_words.close()
pocolm_words.close()
rnnlm_wordsout.close()
oov_wordlist.close()
Loading