Skip to content

Commit 1d079fa

Browse files
dophistdanpovey
authored andcommitted
[egs] Aishell2 recipe: turn off jieba's new word discovery in word segmentation (#2740)
1 parent 396c779 commit 1d079fa

File tree

2 files changed

+4
-3
lines changed

2 files changed

+4
-3
lines changed

egs/aishell2/s5/local/prepare_data.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,9 @@ utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tm
4545
python -c "import jieba" 2>/dev/null || \
4646
(echo "jieba is not found. Use tools/extra/install_jieba.sh to install it." && exit 1;)
4747
utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/trans.txt
48-
awk '{print $1}' $dict_dir/lexicon.txt | sort | uniq | awk 'BEGIN{idx=0}{print $1,idx++}'> $tmp/vocab.txt
49-
python local/word_segmentation.py $tmp/vocab.txt $tmp/trans.txt > $tmp/text
48+
# jieba's vocab format requires word count(frequency), set to 99
49+
awk '{print $1}' $dict_dir/lexicon.txt | sort | uniq | awk '{print $1,99}'> $tmp/word_seg_vocab.txt
50+
python local/word_segmentation.py $tmp/word_seg_vocab.txt $tmp/trans.txt > $tmp/text
5051

5152
# utt2spk & spk2utt
5253
awk -F'\t' '{print $2}' $tmp/wav.scp > $tmp/wav.list

egs/aishell2/s5/local/word_segmentation.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,6 @@
1919
jieba.set_dictionary(vocab_file)
2020
for line in open(trans_file):
2121
key,trans = line.strip().split('\t',1)
22-
words = jieba.cut(trans)
22+
words = jieba.cut(trans, HMM=False) # turn off new word discovery (HMM-based)
2323
new_line = key + '\t' + " ".join(words)
2424
print(new_line)

0 commit comments

Comments
 (0)