Sphinx Configuration - Improving searchability of automatically transcribed lectures through dy

http://source.cet.uct.ac.za/svn/people/smarquard/datasets/sphinx4-hub4-oyc/sphinx-custom.xml <?xml version="1.0" encoding="UTF-8"?>

<config>

<property name="logLevel" value="INFO"/>

<property name="absoluteBeamWidth" value="30000"/>

</component>

<component name="decoder" type="edu.cmu.sphinx.decoder.Decoder"> <property name="searchManager" value="wordPruningSearchManager"/> </component>

<component name="wordPruningSearchManager"

type="edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager"> <property name="scorer" value="threadedScorer"/>

<component name="activeListManager" type="edu.cmu.sphinx.decoder.search.SimpleActiveListManager"> <propertylist name="activeListFactories"> <item>standardActiveListFactory</item> <item>wordActiveListFactory</item> <item>wordActiveListFactory</item> <item>standardActiveListFactory</item> <item>standardActiveListFactory</item> <item>standardActiveListFactory</item> </propertylist> </component> <component name="standardActiveListFactory" type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="logMath" value="logMath"/>

<component name="wordActiveListFactory"

type="edu.cmu.sphinx.decoder.search.PartitionActiveListFactory"> <property name="logMath" value="logMath"/>

<component name="lexTreeLinguist"

type="edu.cmu.sphinx.linguist.lextree.LexTreeLinguist"> <property name="wantUnigramSmear" value="true"/> <property name="wordInsertionProbability"

value="${wordInsertionProbability}"/>

<component name="dictionary"

type="edu.cmu.sphinx.linguist.dictionary.FastDictionary"> <property name="dictionaryPath"

value="file:models/cmudict/cmudict.0.7a_SPHINX_40"/>

<component name="ngramModel"

type="edu.cmu.sphinx.linguist.language.ngram.large.LargeNGramModel"> <property name="location" value="file:models/hub4-

lm/language_model.arpaformat.DMP"/>

<property name="logMath" value="logMath"/> <property name="dictionary" value="dictionary"/> <property name="wordInsertionProbability"

value="${wordInsertionProbability}"/>

<component name="hub4"

type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel"> <property name="loader" value="sphinx3Loader"/>

<component name="sphinx3Loader"

type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader"> <property name="logMath" value="logMath"/>

</component>

<component name="unitManager"

<component name="epFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>audioFileDataSource</item> <item>dataBlocker </item> <item>speechClassifier </item> <item>speechMarker </item> <item>nonSpeechDataFilter </item> <item>premphasizer</item> <item>windower</item> <item>fft</item> <item>melFilterBank</item> <item>dct</item> <item>batchCMN</item> <item>featureExtraction</item> </propertylist> </component> <component name="audioFileDataSource" type="edu.cmu.sphinx.frontend.util.AudioFileDataSource"/>

</component>

<component name="dct"

type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform"/>

<component name="batchCMN" type="edu.cmu.sphinx.frontend.feature.BatchCMN"/> <component name="featureExtraction"

type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"/>    <component name="logMath" type="edu.cmu.sphinx.util.LogMath"> <property name="logBase" value="1.0001"/>

Glossary

AI Artificial Intelligence

AM Acoustic Model

API Application Programming Interface ARPA Advanced Research Projects Agency Arpabet ARPA ASR Alphabet

ASCII American Standard Code for Information Interchange (character set)

ASR Automatic Speech Recognition

CHIL Computers in the Human Interaction Loop, research project CMU Carnegie Mellon University

CMUdict CMU Pronouncing Dictionary CNN Cable News Network (broadcaster)

CSAIL Computer Science and Artificial Intelligence Laboratory at MIT CSPAN Cable-Satellite Public Affairs Network (broadcaster)

ETH Zürich Eidgenössische Technische Hochschule Zürich, science and technology university

EU European Union

FP6 Framework Project 6, a European Union funding programme FST Finite State Transducer

g2p Grapheme-to-Phoneme

HMM Hidden Markov Model

HTML Hypertext Markup Language

HUB4 Language corpus from the 1996 DARPA/NIST Continuous Speech Recognition Broadcast News Hub-4 Benchmark Test, and derived acoustic and language models

IPA International Phonetic Alphabet IR Information Retrieval

KHz Kilohertz, measure of frequency and signal bandwidth LDA Latent Dirichlet Allocation

LL Liberated Learning Project and Consortium

LM Language Model

LSA Latent Semantic Analysis LSI Latent Semantic Indexing

LVCSR Large Vocabulary Continuous (or Connected) Speech Recognition MDE Minimum Discrimant Estimation

MIT Massachusetts Institute of Technology

MP3 Moving Picture Experts Group (MPEG) Audio Layer 3, audio format NIST National Institute of Standards and Technology

NLP Natural Language Processing NPR National Public Radio (broadcaster) OOV Out-of-vocabulary words

OYC Open Yale Courses

PCM Pulse-code modulation, digital audio format PLSA Probabilistic Latent Semantic Analysis

REPLAY Lecture capture system developed at ETH Zürich RT Runtime, execution time

RWCR Ranked Word Correct Rate SaaS Software-as-a-service

SI Speaker-independent

SONIC ASR system developed at Colorado University Sphinx4 ASR system developed at Carnegie Mellon University SUMMIT ASR system developed at CSAIL, MIT

TED Translanguage English Database Corpus

TF-IDF Term Frequency - Inverse Document Frequency

TIMIT Transcribed American English corpus developed by Texas

Instruments (TI) and Massachusetts Institute of Technology (MIT) US / USA United States of America

UTF8 Universal Character Set (UCS) Transformation Format – 8 bit, a Unicode character set

WAV Waveform Audio File Format

WCR Word Correct Rate

WER Word Error Rate

WFST Weighted Finite State Transducer

WSJ Wall Street Journal, newspaper and derived language corpus XML Extensible Markup Language

In document Improving searchability of automatically transcribed lectures through dynamic language modelling (Page 99-104)