By defining a mapping between hmmbased synthesis models and asrstyle models, this paper introduces an approach to the unsupervised speaker adaptation task for hmmbased speech synthesis models which avoids the need for supplementary acoustic models. This paper firstly presents an approach to the unsupervised speaker adaptation task for hmm based speech synthesis models which avoids the need for such supplementary acoustic models. Hidden markov model hmmbased speech synthesis systems possess several advantages over concatenative synthesis systems. Analysis of speaker adaptation algorithms for hmm based speech synthesis and a constrained smaplr adaptation algorithm. As a demonstration in splice algorithm, we generate the pseudoclean features to replace the ideal clean features from one of the stereo channels, by using hmmbased speech synthesis.
The technique is based on an hmm based textto speech tts system and maximum likelihood linear regression mllr adaptation algorithm. Speaker adaptation that transforms a given set of hmms to a target speaker or condition is a successful technique for both automatic speech recognition asr and hmmbased textto speech tts synthesis. Hidden markov models for artificial voice production and. The application of our research is the personalisation of speech to speech translation in which we employ a hmm statistical. A comparison of supervised and unsupervised crosslingualspeaker adaptation approaches for hmm based speech synthesis hui liang1,2, john dines1, lakshmi saheer1,2 1 idiap research institute, martigny, switzerland 2 ecole polytechnique fe. Unsupervised speaker adaptation of dnnhmm by selecting similar speakers for lecture transcription masato mimura and tatsuya kawahara kyoto university, academic center for computing and media studies, sakyoku, kyoto 6068501, japan abstractunsupervised speaker adaptation of deep neural network dnn is investigated for lecture transcription. Since speech has temporal structure and can be encoded as a sequence of spectral vectors spanning the audio frequency range, the hidden markov model hmm provides a natural framework for. Utilizing the at least one of the speech synthesis parameters for the selected subnode for adaptation can include. Unsupervised adaptation for hmmbased speech synthesis, 2003. Flexible speech synthesis based on hidden markov models keiichi tokuda nagoya institute of technology apsipa asc 20, kaohsiung november 1, 20.
Generating speech from a model has many potential advantages over concatenating waveforms. It will include a brief introduction to speech synthesis, including just enough coverage of the textprocessing part of the problem to set the scene. Yamagishi, junichi isca, 200809 it is now possible to synthesise speech using hmms with a comparable quality to unitselection techniques. It is created by the htsworking group as a patch to the htk 18. The use of adaptation to create new voices for speech synthesis makes hmm based speech synthesis very attractive. Unsupervised crosslingual speaker adaptation for hmm based speech synthesis. Cabral trinity college dublin, ireland the adapt centre is funded under the sfi research centres programme grant rc2106 and is cofunded under the european regional development fund. The adaptation technique automatically controls the number of phone mismatches. A study of speaker adaptation for dnnbased speech synthesis. Unsupervised crosslingual speaker adaptation for hmm. Analysis of speaker clustering strategies for hmm based speech synthesis rasmus dall, christophe veaux, junichi yamagishi, simon king the centre for speech technology research, the university of edinburgh, u. Analysis of speaker clustering strategies for hmmbased. Us6076057a unsupervised hmm adaptation based on speech. In hmmbased speech synthesis, speaker adaptation techniques can be used to adapt the source model using speech data from target.
Analysis of unsupervised crosslingual speaker adaptation for hmm based speech synthesis using kld based transform mapping by keiichiro oura, junichi yamagishi, mirjam wester, simon king and keiichi tokuda. An unsupervised, discriminative, sentence level, hmm adaptation based on speech silence classification is presented. Frequency warping for speaker adaptation in hmmbased speech. Speaker adaptation is one of the most exciting ones. In the emime project, we developed a mobile device that performs personalized speech to speech translation such that a users spoken input in one language is used to produce spoken. Analysis of unsupervised crosslingual speaker adaptation. For unsupervised adaptation of hmmbased speech synthesis. As a statistical parametric approach, the hmmbased framework provides a great deal of.
Index termshmmbased speech synthesis, unsupervised. Unsupervised adaptation for hmm based speech synthesis. Flexible speech synthesis based on hidden markov models. This paper presents an automatic speech recognition based unsupervised adaptation method for hidden markov model hmm speech synthesis and its quality evaluation.
Speech synthesis based on hidden markov models hmm. The hmmdnnbased speech synthesis system hts has been developed by the hts working group and others see who we are and acknowledgments. In the emime project we have studied unsupervised crosslingual speaker adaptation. By defining a mapping between hmm based synthesis models and asrstyle models, this paper introduces an approach to the unsupervised speaker adaptation task for hmm based speech synthesis models which avoids the need for supplementary acoustic models. Mar 31, 2020 awesome speech recognition speech synthesis papers. This paper first presents an approach to the unsupervised speaker adaptation task for hmm based speech synthesis models which avoids the need for such supplementary acoustic models. Context adaptive training with factorized decision trees for. This paper presents a technique for synthesizing emotional speech based on an emotionindependent model which is called average emotion model. Some aspects of asr transcription based unsupervised.
Speech database excitation parameter extraction spectral. Gales, 1998 111 and maximum a posteriori map adaptation gauvain, 1994112. Techniques in rapid unsupervised speaker adaptation based on. The application of hidden markov models in speech recognition. This paper describes an hmm based speech synthesis system hts, in which speech waveform is generated from hmms themselves, and applies it to english speech synthesis using the general speech synthesis architecture of festival. Deep neural networks dnns have been recently introduced in speech synthesis. China speaker adaptation in speech synthesis transforms a source utterance to a target ut.
Data selection and adaptation for naturalness in hmmbased. Adaptation of pitch and spectrum for hmmbased speech. The hmm dnn based speech synthesis system hts has been developed by the hts working group and others see who we are and acknowledgments. Hmm based speech synthesis erica cooper cs4706 spring 2011 concatenative synthesis hmm synthesis a parametric model can train on mixed data from many speakers model takes up a very small amount of space speaker adaptation hmms some hidden process has generated some visible observation. The discriminative training procedure using a gpd or any other discriminative training algorithm, employed in conjunction with the hmm.
In recent years, hidden markov model hmm has been successfully applied to acoustic modeling for speech synthesis, and hmm based parametric speech synthesis has become a mainstream speech synthesis. The application of our research is the personalisation of speech to speech translation in which we employ a hmm statistical framework for both speech recognition and synthesis. Twopass decision tree construction for unsupervised. In the hmm based tts system, speech synthesis units are modeled by multispace probability distribution msd hmms which can model spectrum and pitch simultaneously in a unified framework. Hmmbased speech synthesis minitutorial hmms are used to generate sequences of speech in a parameterised form from the parameterised form, we can generate a waveform the parameterised form contains suf.
In this paper, an investigation on the importance of input features and training data on speaker dependent sd dnn based speech synthesis is presented. However, it still requires high quality audio data with low signal to noise ration and precise labeling. Synthesizer with hmm based speech synthesis toolkit hts hts is a toolkit 17 for building statistical based speech synthesizers. Unsupervised speaker adaptation for dnnbased tts synthesis. In this paper, we present a novel approach to relax the constraint of stereodata which is needed in a series of algorithms for noiserobust speech recognition. The most popular speaker adaptation approaches in speech synthesis are based on maximum likelihood linear transforms mllt m. Speech synthesis is the artificial production of human speech. Unsupervised adaptation for hmmbased speech synthesis core. I have chosen hidden markovmodel based textto speech synthesis for my research topic because of its novelty and countless possibilities.
Analysis of unsupervised crosslingual speaker adaptation for. Unsupervised adaptation for hmmbased speech synthesis 2008. In this paper, we introduce a method capable of unsupervised adaptation, using only speech from the target speaker without any labelling. Unsupervised intralingual and crosslingual speaker adaptation for hmm based speech synthesis using twopass decision tree construction m gibson, w byrne ieee transactions on audio, speech, and language processing 19 4, 895904, 2010.
A read is counted each time someone views a publication summary such as the title, abstract, and list of authors, clicks on a figure, or views or downloads the fulltext. Automatic speech recognition has been investigated for several decades, and speech recognition models are from hmm gmm to deep neural networks today. Thus, a core goal of emime is the development of unsupervised crosslingual speaker adaptation for hmmbased tts. This paper describes the integration of these developments into a single architecture which achieves unsupervised crosslingual speaker adaptation for hmmbased speech synthesis. Supervised adaptation the use of adaptation to create new voices for speech synthesis makes hmm based speech synthesis very attractive.
Junichi yamagishi october 2006 main adaptation for hmm based speech synthesis system using mllr masatsune tamura y, takashi masuko, keiichi tokuda, and takao kobayashi y tokyo institute of technology, yokohama, 2268502 japan. Oct 17, 2012 the task of speech synthesis is to convert normal language text into speech. The purpose of this toolkit is to provide research and development environment for the progress of speech synthesis using statistical models. Thus, an unsupervised crosslingual speaker adaptation system can be developed. Oct 14, 2016 a comparison of supervised and unsupervised crosslingual speaker adaptation approaches for hmmbased speech synthesis. Improving rapid unsupervised speaker adaptation based on hmm sufficient statistics in noisy environments using multitemplate models. Hmmbased pseudoclean speech synthesis for splice algorithm. We have employed an hmm statistical framework for both speech recognition and synthesis which provides transformation mechanisms to adapt the synthesized voice in tts textto speech using the recognized voice in asr automatic speech recognition. Unsupervised intralingual and crosslingual speaker adaptation for hmmbased speech synthesis using twopass decision tree construction abstract. Unsupervised intralingual and crosslingual speaker.
Citeseerx unsupervised adaptation for hmmbased speech synthesis citeseerx document details isaac councill, lee giles, pradeep teregowda. It is also known as automatic speech recognition asr, computer speech recognition or speech to text stt. Unsupervised clustering for expressive speech synthesis. Speaker adaptation for hmm based speech synthesis system using mllr masatsune tamura y, takashi masuko, keiichi tokuda, and takao kobayashi y tokyo institute of technology, yokohama, 2268502 japan yy nagoya institute of technology, nagoya, 4668555 japan abstract. Multimodal speech synthesis architecture for unsupervised speaker adaptation hieuthi luong 1and junichi yamagishi. This paper demonstrates how unsupervised crosslingual adaptation of hmm based speech synthesis models may be performed without explicit knowledge of the adaptation data language. Byrne1 1cambridge university engineering department, 2helsinki university of technology introduction twopass decision tree construction evaluation. Similarly to other datadriven speech synthesis approaches, hts has a compact language. Hybrid systems basically use hmm alignments to bootstrap themselves into producing recognition, and still use much of the surrounding machinery that hmm based recognizers used to use. Tokuda analysis of unsupervised crosslingual speaker adaptation for hmm based speech synthesis using kld based transform mapping. Context adaptive training with factorized decision trees for hmm based speech synthesis kai yu 1, heiga zen2, francois mairesse, and steve young 1 cambridge university engineering department, trumpington street, cambridge, cb2 1pz, uk. It is now possible to synthesise speech using hmms with a com parable quality to unitselection techniques. A new journal paper journal papars junichi yamagishi.
Voice conversion for unitselection concatenation speech synthesis 3 yamagishi, junichi, takao kobayashi, yuji nakano, katsumi ogata, and juri isogai. In recent years, hidden markov model hmm has been successfully applied to acoustic modeling for speech synthesis, and hmm based parametric speech synthesis has become a mainstream speech synthesis method. The hmmbased speech synthesis system hts v ersion 2. Speech synthesis based on hidden markov models core. Unsupervised speaker adaptation of dnnhmm by selecting. Analysis of unsupervised and noiserobust speakeradaptive hmmbased speech synthesis systems toward a uni. Unsupervised crosslingual speaker adaptation for hmm based speech synthesis using twopass decision tree construction m. A textto speech tts system converts normal language text into speech. Ieice special issue on statistical modeling for speech processing e89d 3. Unsupervised crosslingual speaker adaptation for hmmbased speech synthesis by john dines, hui liang, lakshmi saheer, matthew gibson, william byrne, keiichiro oura, keiichi tokuda, junichi yamagishi, simon king, mirjam wester, teemu hirsimaki, reima karhila and mikko kurimo. Most research into speaker adaptation for hmm based speech synthesis or textto speech, tts has focussed upon the supervised scenario, where transcribed adaptation data is available. Analysis of unsupervised and noiserobust speakeradaptive. Analysis of unsupervised crosslingual speaker adaptation for hmmbased speech synthesis using kldbased transform mapping article in speech communication 546. Speech synthesis based on hidden markov models and deep.
Index terms hmm based speech synthesis, unsupervised. Generating speech from a model has many potential advantages unsupervised adaptation for hmm based speech synthesis. This is achieved by defining a mapping between hmm based synthesis models and asrstyle models, via a twopass decision tree construction process. Finally, listener evaluations reveal that the proposed unsupervised adaptation methods deliver performance approaching that of supervised adaptation. Adapting full context models for each full context dependent model, we can obtain the correspondingtriphonemodelbyignoringtheprosodiccontextualfactors and dropping some phonetic contextual factors. Consequently, this paper investigates crosslingual speaker adaptation based on uni. Also, hmms are generative models so they are much more useful in the case of speech synthesis the just is still out on using deep networks for the synthesis. Flexible speech synthesis based on hidden markov models keiichi tokuda nagoya institute of technology apsipa asc 20, kaohsiung. Frequency warping for speaker adaptation in hmm based speech synthesis weixun gao1 and qiying cao1,2 1school of information science and technology 2college of computer science and technology donghua university shanghai, 200051 p. In this paper we present results of unsupervised crosslingual speaker adaptation applied to textto speech synthesis.
Hidden markov model hmm based speech synthesis for urdu. Us8438029b1 confidence tying for unsupervised synthetic. The patch code is released under a free software license. The task of speech synthesis is to convert normal language text into speech. Speech synthesis based on hidden markov models and deep learning marvin cotojim enez1. Furthermore it was a challenge to pioneer hmm tts research in hungary. The core of all speech recognition systems consists of a set of statistical models representing the various sounds of the language to be recognised. The training part of hts has been implemented as a modified version of htk and released as a form of patch code to htk. Listening tests show very promising results, demonstrating that adapted. We proposed a decision tree marginalization technique in 4 for uni. No other constraints need to be placed on the asrhmm.
It is now possible to synthesise speech using hmms with a comparable quality to unitselection techniques. Unsupervised adaptation for hmmbased speech synthesis. Currently various organizations use it to conduct their own research projects, and we believe that it has contributed signi. When the asrhmm uses gaussian mixtures, we can use an approximated kld goldberger et al. Hmmbased emotional speech synthesis using average emotion. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware products. In the current thesis booklet i summarize the novel outcomes of my research grouped in the three research objectives. Speech recognition is an interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. We demonstrate an endtoend speechtospeech translation system built for four languages american english, mandarin, japanese, and finnish. For speech synthesis, a model trained on multiple speakers data is called an average voice model 6. On the other hand, our recent experiments with hmm based speech synthesis systems have demonstrated that speakeradaptive hmm based speech synthesis which uses an average voice model plus model adaptation is robust to nonideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly. Such supervised methods require labelled adaptation data for the target speaker.
454 450 1095 109 644 982 154 829 299 1506 1187 1060 654 1674 1042 950 920 52 1452 344 524 338 410 1076 1415 365 614 185 826 1155 81 221 283 793 384 4