Varying speaking styles with neural text-to-speech alexa blogs gas finder app


When people speak, they use different speaking styles depending on context. A TV newscaster, for example, will use a very different style when conveying the day’s headlines than a parent will when reading a bedtime story. electricity in indian villages Amazon scientists have shown that our latest text-to-speech (TTS) system, which uses a generative neural network, can learn to employ a newscaster style from just a few hours of training data. This advance paves the way for Alexa and other services to adopt different speaking styles in different contexts, improving customer experiences.

With the increased flexibility provided by NTTS, we can easily vary the speaking style of synthesized speech. e85 gas stations in iowa For example, by augmenting a large, existing data set of style-neutral recordings with only a few hours of newscaster-style recordings, we have created a news-domain voice. That would have been impossible with previous techniques based on concatenative synthesis.

The first component of the system is a sequence-to-sequence model, meaning that it doesn’t compute a given output solely from the corresponding input but also considers its position in the sequence of outputs. static electricity jokes The spectrograms that the system outputs are mel-spectrograms, meaning that their frequency bands are chosen to emphasize acoustic features that the human brain uses when processing speech.

When trained on the large data sets used to build general-purpose concatenative-synthesis systems, this sequence-to-sequence approach will yield high-quality, neutral-sounding voices. By design, those data sets lack the distinctive prosodic features required to represent particular speech styles. Although high in quality, the speech generated through this approach displays a limited variety of expression (pitch, breaks, rhythm).

We find that we can leverage a large corpus of style-neutral data in the training of a style-specific speech synthesizer by making a simple modification to the sequence-to-sequence model. electricity games We train the model not only with phoneme sequences and the corresponding sequences of mel-spectrograms but with a “style encoding” that identifies the speaking style employed in the training example.

With this approach, we can train a high-quality multiple-style sequence-to-sequence model by combining the larger amount of neutral-style speech data with just a few hours of supplementary data in the desired style. This is possible even when using an extremely simple style encoding, such as a “one-hot” vector — a string of zeroes with a single one at the location corresponding to the selected style.

The key technical advantage of this multiple-style sequence-to-sequence model is its ability to separately model aspects of speech that are independent of speaking style and aspects of speech that are particular to a single speaking style. gas and supply When presented with a speaking-style code during operation, the network predicts the prosodic pattern suitable for that style and applies it to a separately generated, style-agnostic representation. gas 10 ethanol The high quality achieved with relatively little additional training data allows for rapid expansion of speaking styles.

The output of our model passes to a neural vocoder, a neural network trained to convert mel-spectrograms into speech waveforms. In order to be of general practical use, the vocoder must be capable of simulating speech by any speaker in any language in any speaking style. gas stoichiometry practice sheet Typically, neural vocoder systems use some form of speaker encoding (either a one-hot encoding or some other vector representation, or “embedding”). Our neural vocoder takes mel-spectrograms from any speaker, regardless of whether he or she was seen during training time, and generates high-quality speech waveforms without the use of a speaker encoding.

To determine whether the NTTS newscaster style makes a difference to listeners’ experiences, we conducted a large-scale perceptual test. We asked listeners to rate speech samples on a scale from 0 to 100, according to how suitable their speaking styles were for news reading. We included an audio recording of a real newscaster as a reference.

The results are presented in the figure below. Listeners rated neutral NTTS more highly than concatenative synthesis, reducing the score discrepancy between human and synthetic speech by 46%. They preferred NTTS newscaster style to both other systems, however, shrinking the discrepancy by a further 35%. The preference for the neutral-style NTTS reflects the widely reported increase in general speech synthesis quality due to neural generative methods. electricity worksheets ks1 The further improvement for the NTTS newscaster voice reflects our system’s ability to capture a style relevant to the text.