Faculty of Electrical Engineering

Czech Technical University in Prague

CTU in Prague

Speech Processing Laboratory

Department of Circuit Theory, Technická 2, 166 27 Prague 6
Tel: +420 224 352 049, Fax: +420 233 339 805

Who we are

doc. Ing. Petr Pollák, CSc.
Head of the group - topics of research: speech recognition with the focus on robust techniques for the recognition in the noisy environment, spontaneous speech recognition, special feature extraction techniques, collection of speech databases for training of recognition systems, speech enhancement for communication applications, voice activity detection, phonetic segmentation, phonetic and linguistic knowledge in speech recognition systems.

Ing. Petr Mizera
PhD student - topics of research: speech recognition, artificial neural networks and their applications in speech recognition, phonetic segmentation, articulatory features

Ing. Michal Borský
PhD student - topics of research: speech recognition with the focus on acoustic modelling robustness (discriminative training, adaptation techniques), processing of distorted or compressed (MP3) speech

Students of master degree programs solving their diploma theses and participating on the internal grant of CTU Zdeněk Patč, Aleš Brich, Jiří Fiala, Jiří Valíček.

Research description

Generally, we deal with the analysis and the processing of speech signal with the focus on speech recognition systems and speech enhancement for communication purposes. Currently solved problems can be summarized in the following points:

  • large vocabulary continuous speech recognition with the focus on the processing of distorted speech or spontaneous utterances
  • speech feature extraction with the focus on robustness, dealing with the speech production knowledge in recognition systems (articulatory features)
  • optimization of HMM-based acoustic modelling (discriminative and adaptation techniques, combination of ANN and HMM)
  • language modelling for spontaneous and informal speech (lexica with reduced pronunciation, class-based language models)
  • automatic phonetic segmentation
  • the collection and the preparation of speech data
  • voice activity detection

What is it good for

The above mentioned tasks find their applications in systems for automatic transcription of speech into its text form, typical examples of such systems are dictation systems on PCs, on-line subtitling of video programs, on-line or off-line transcription of audio records with possible indexing for an archivation, voice controlled information systems over the telephone line (the simplest voice control is used as the replacement of possibly unavialable tone dialing, there are also systems communicating with natural dialogue, systems of voice control for various consumer devices (frequently in car environment), phonetic segmentation algorithms as a support of basic phonetic research or special analysis of pathology speech. Speech enhancement is typically used in any communication under adverse background conditions, where noise suppression increase significantly the intelligibility of speech percepted by far distance person. Noise suppression is used also during the extraction of features in robust speech recognition systems (recognition of speech from running cars, public places, in rooms with strong reverberations, speech collected by far distance microphone, etc.). Voice activity detector is crucial part of both speech recognition and speech enhancement systems (for start- and end-point detection in commnad recognition, for noise background characteristics estimation, etc.). The collection and further processing of speech and text corpora is necessary for the training of recognition systems based on stochastic models or artificial intelligence (neural networks).


Continuous and spontaneous speech recognition

The main purpose of our activities is the optimization of continuous speech recognition with the focus on spontaneous or informal speech. The most important topics are:

  • firstly, the construction of continuous and spontaneous speech recognition for Czech language
  • within various evaluations we work also with systems constructed for other languages (English, Slovak, German, French)
  • language modelling, which gives the information about possible inter-word relations to used decoder, we work with standard n-gram models (usually bigrams and trigrams),which we try to optimize for the more precise recognition of informal speech
  • decoding of continuous speech realized by available tools based on weighted finite state automata (WFST), mainly KALDI toolkit

Parameterization of speech signal

The parameterization of speech signal represents the computation of suitable features which are then used for the classification in further parts of recognition systems. In this field we study mainly:

  • the properties of basic standard techniques (MFCC or PLP cepstral coefficients) under various conditions with minor modification for these purposes
  • techniques working with longer temporal context in a feature vector (TRAP, RASTA, MULTI-RASTA)
  • within our recent activities we study also the properties of aritculatory features
  • in the above mentioned special techniques we work with neural networks which are nowadyas more and more frequently used (currently mainly so called deep neural networks with higher number of hidden layers)

Robust speech recognition

Reliable recognition of the speech with possible distortion is the important assumption for the usage of such systems under common (frequently adverse) conditions. Our activities in this field are focused on:

  • the study of an influence and possible elimination of additive distortion in feature extraction phase
  • an important result is CtuCopy tool, where various methods of feature extration (including the noise suppression) are implented and which is compatible with world-wide used tools such as HTK or KALDI
  • the elimination of disturbance at the level of acoustic modelling, mainly based on suitably chosen training or adaptation technique
  • we study also properties of special recognizer architectures such as ANN/HMM, DNN/HMM or TANDEM

Implementation and software simulations fo speech recognition

The reasonable part of our activities must be devoted to the implementation issues. The most important activities dealing with these probles are summarized in the following points:

  • the usage of freely available tools which are continuously under development and which can be used for evaluation, testing and also final versions of speech recognizers (currently mainly KALDI tools)
  • required power for huge computations during the training and decoding is avialable in our paralel cluster of computers with operating system Linux, on which the majority of our experiments are realized
  • we study also the implementations of simpe applications on PCs or smart-phones, which are controlled by voice

Phonetic segmentation of an utterance

Within this field, we are cooperating with experts from phonetics, linguistics, and psycholinguistics who use our systems in their research tasks. We also consult with them possible optimizations of speech recognition based on speech production knowledge. Our activities are focused on:

  • the realization of the large corpora segmentation for the training of neural networks,
  • the research of pronunciation variability during the analysis of informal speech
  • the creation of the segmentation tool working in interactive environment within the Praat tool, see figure


Voice activity detection

Within this research area we study the algorithms based on:

  • energy, cepstral, or coherence characteristics
  • stochastic modelling of suitable features (GMM)
  • artificial neural networks

Collection of speech and text corpora

Speech and text corpora are necessary for the creation of speech recognition because they are used as the resources of the information about acoustic speech feature variability (speech databases) or about the frequency of appearance of particular word context (text corpora). Within this research field we participated on the collection of several corpora created within European grants or bilateral commercial or research projects. Created databases are currently used in our reseach, i.e.

  • telephone ČÍSLOVKY (NUMERALS) and Czech SpeechDat, both with approx 1000 speakers (avialable via http://www.elra.info
  • Czech mutation of SPEECON databse, 650 adult speakers and 50 childern, from various environments (see http://www.elra.info)
  • 1000 from car environment for TEMIC SDS
  • the corpus with evoked Lombard effect
  • the corpus of lectures from the field of speech and biological signal processing
  • Czech and Slovak version of LC-Star lexica (see http://www.elra.info)
  • the corpus of spontaneous and informal communications, Nijmegen Corpus of Causal Czech (viz http://www.mirjamernestus.nl/Ernestus/NCCCz/index.php)

Financial support and contracts

Our research was supported in past by grants GAČR (Grant Agency of Czech Republic, 1996-2011), Academy of Sciences of CR (2004-2007), COST (1994-2005), Research plan of MSM (2005-2011), FRVŠ (2010, 2011), internal grant of CTU (2012-2013). We participated on European projects SpeechDat-E (1999-2000), SPEECON (2002-2003), LC-StarII (2006-2007). Within bilateral projects we cooperated with industrial companies as Siemens AG, Muenchen, Germany (1999, 2006), Škoda Mladá Boleslav (2002-2003), TEMIC-Harman/Becker, Ulm, Germany (2000-2004), or with Radboud University of Nijmegen, Netherlands (2008-2009).

Currently, our research is supported by internal grant CTU SGS14/191/OHK3/3T/13 (2014-2016), another project is currently in grant competition of GAČR.

Under commercial basis we cooperate curently with the company ZOOM International.

International collaboration

We cooperate with experts from various labs, currently mainly from the following institutions:

Informally, we cooperate also with Czech university speech processing labs, mainly with:

Selected publications

The results of our research were published in international impacted journals and at several international conferences and workshop. The most important works from several last years are:

  • Mizera, P. - Pollák, P.: Robust Neural Network-Based Estimation of Articulatory Features For Czech. Neural Network World. 2014, vol. 24, no. 5, p. 463-478. ISSN 1210-0552.
  • Procházka, V. - Pollák, P. - Žďánský, J. - Nouza, J.: Performance of Czech Speech Recognition with Language Models Created from Public Resources. Radioengineering. 2011, vol. 40, no. 4, p. 1002-1008. ISSN 1210-2512.
  • Rajnoha, J. - Pollák, P.: ASR systems in Noisy Environment: Analysis and Solutions for Increasing Noise Robustness. Radioengineering. 2011, vol. 20, no. 1, p. 74-84. ISSN 1210-2512.
  • Vondrášek, M. - Pollák, P.: Methods for Speech SNR Estimation: Evaluation Tool and Analysis of VAD Dependency. Radioengineering. 2005, vol. 14, no. 1, s. 6-11. ISSN 1210-2512.
  • Ernestus, M. - Kockova-Amortova, L. - Pollák, P.: The Nijmegen Corpus of Casual Czech. In Proceedings of the 9th Language Resources and Evaluation Conference. Paris: ELRA - European Language Resources Association, 2014, vol. 1.
  • Kolman, A. - Pollák, P.: Speech reduction in Czech. In LabPhone 14. The 14th Conference on Laboratory Phonology. Tokyo: National Institute for Japanese Linguistics in Tokyo, 2014.
  • Mizera, P. - Pollák, P. - Kolman, A. - Ernestus, M.: Impact of Irregular Pronunciation on Phonetic Segmentation of Nijmegen Corpus of Casual Czech. In Text, Speech, and Dialogue. 17th International Conference, TSD 2014. Heidelberg: Springer, 2014, vol. 1, p. 499-507.
  • Pollák, P. - Borský, M.: Small and Large Vocabulary Speech Recognition of MP3 Data under Real-Word Conditions: Experimental Study. Communications in Computer and Information Science. 2012, vol. 314, p. 409-419.
  • Borský, M. - Pollák, P.: The optimization of PLP feature extraction for LVCSR recognition of MP3 data. In 19th International Conference on Applied Electronics 2014. Pilsen: University of West Bohemia, 2014, p. 55-58.
  • Pollák, P. - Běhunek, M.: Accuracy of MP3 Speech Recognition Under Real-World Conditions. Experimental Study. In Proceedings of SIGMAP 2011 - International Conference on Signal Processing and Multimedia Applications. [CD-ROM]. Sevilla: University of Seville, 2011, vol. 1, p. 5-10.
  • Pollák, P. - Rajnoha, J.: Multi-Channel Database of Spontaneous Czech with Synchronization of Channels Recorded by Independent Devices. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), La Valleta, Malta, 2010.
  • Volín, J. - Pollák, P.: The Dynamic Dimension of the Global Speech-Rhythm Attributes. In Proceedings of Interspeech 2009 [CD-ROM]. Brighton, UK, 2009, p. 1543-1546.
  • Pollák, P. - Rajnoha, J.: Long Recording Segmentation Based on Simple Power Voice Activity Detection with Adaptive Threshold and Post-Processing. In SPECOM 2009 Proceedings. St. Petersburg, Russia, 2009, p. 55-60.
  • Pollák, P. - Volín, J. - Skarnitzl, R.: Phone Segmentation Tool with Integrated Pronunciation Lexicon and Czech Phonetically Labelled Reference Database. In 6th International Conference on Language Resources and Evaluation. Marrakech, Morocco, 2008, vol. 1, p. 1-5.
Group members participated as authors of the monography "Technology in Voice Communication" (in Czech language)
  • Uhlíř, J. - Sovka, P. - Pollák, P. - Hanžl, V. - Čmejla, R.: Technologie hlasových komunikací. 1. vyd. Praha: Nakladatelství ČVUT, 2007. 276 s. ISBN 978-80-01-03888-8.

The most important publications are available on-line at the page of Speech Processing Laboratory, in the section Publications

Responsible person: doc. Ing. Milan Polívka, Ph.D.