Technology
Tacotron 2
Google's neural network architecture for direct speech synthesis from text using recurrent sequence-to-sequence models and WaveNet vocoders.
Tacotron 2 streamlines the text-to-speech pipeline by mapping character sequences directly to mel-scale spectrograms. This architecture combines a recurrent sequence-to-sequence model with an attention mechanism to handle alignment, followed by a modified WaveNet vocoder to generate the final 24 kHz audio. By eliminating complex hand-engineered features like phoneme alignments or linguistic prosody models, the system achieves a Mean Opinion Score (MOS) of 4.53, nearly matching the 4.58 score of natural human speech. It remains a foundational framework for producing high-fidelity, natural-sounding synthetic voices in production environments.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1