Supriya Ghosh (Editor)

ESpeakNG

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Original author(s)
  
Jonathan Duddington

Written in
  
C

Developer(s)
  
Reece Dunn

Initial release
  
February 2006; 11 years ago (2006-02)

Stable release
  
1.49.0 / September 10, 2016; 5 months ago (2016-09-10)

Repository
  
github.com/espeak-ng/espeak-ng/

eSpeakNG is a compact open source software speech synthesizer for Linux, Windows and other platforms. It uses a formant synthesis method, providing many languages in a small size. Much of the programming for eSpeakNG's language support is done using rule files with feedback from native speakers.

Contents

Because of its small size and many languages, it is included as the default speech synthesizer in the NVDA open source screen reader for Windows, Android, Ubuntu and other Linux distributions. Its predecessor eSpeak was used by Google Translate for 27 languages in 2010; 17 of these were subsequently replaced by commercial voices.

The quality of the language voices varies greatly. In eSpeakNG's predecessor eSpeak, the initial version of each language was based on information found on Wikipedia. Some languages have had more work or feedback from native speakers than others. Most of the people who have helped to improve the various languages are blind users of text-to-speech.

History

In 1995, Jonathan Duddington released the Speak speech synthesizer for RISC OS computers supporting British English. On the 17th Feb 2006, Speak 1.05 was released under the GPLv2 license, initially for Linux, with a Windows SAPI 5 version added in January 2007. Development on Speak continued until version 1.14, when it was renamed to eSpeak.

Development of eSpeak continued from 1.16 (there was not a 1.15 release) with the addition of an eSpeakEdit program for editing and building the eSpeak voice data. These were only available as separate source and binary downloads up to eSpeak 1.24. The 1.24.02 version of eSpeak was the first version of eSpeak to be version controlled using subversion, with separate source and binary downloads made available on Sourceforge. From eSpeak 1.27, eSpeak was updated to use the GPLv3 license. The last official eSpeak release was 1.48.04 for Windows and Linux, 1.47.06 for RISC OS and 1.45.04 for Mac OS X. The last development release of eSpeak was 1.48.15 on the 16th April 2015.

On the 25th June 2010, Reece Dunn started a fork of eSpeak on GitHub using the 1.43.46 release. This started off as an effort to make it easier to build eSpeak on Linux and other POSIX platforms. On the 4th October 2015 (6 months after the 1.48.15 release of eSpeak), this fork started diverging more significantly from the original eSpeak.

On the 8th December 2015, there were discussions on the eSpeak mailing list about the lack of activity from Jonathan Duddington over the previous 8 months from the last eSpeak development release. This evolved into discussions of continuing development of eSpeak in Jonathan's absence. The result of this was the creation of the espeak-ng (Next Generation) fork, using the GitHub version of eSpeak as the basis for future development.

On the 11th December 2015, the espeak-ng fork was started. The first release of espeak-ng was 1.49.0 on the 10th September 2016, containing significant code cleanup, bug fixes, and language updates.

Features

eSpeakNG can be used as a command-line program, or as a shared library.

It supports Speech Synthesis Markup Language (SSML).

Language voices are identified by the language's ISO 639-1 code. They can be modified by "voice variants". These are text files which can change characteristics such as pitch range, add effects such as echo, whisper and croaky voice, or make systematic adjustments to formant frequencies to change the sound of the voice. For example, "af" is the Afrikaans voice. "af+f2" is the Afrikaans voice modified with the "f2" voice variant which changes the formants and the pitch range to give a female sound.

eSpeakNG uses an ASCII representation of phoneme names which is loosely based on the Kirshenbaum system.

Phonetic representations can be included within text input by including them within double square-brackets. For example:

espeak-ng -v en "Hello [[w3:ld]]"

will say  Hello world in English.

Synthesis method

eSpeakNG can be used as text-to-speech translator in different ways, depending on which text-to-speech translation step user want to use.

1. step — text to phoneme translation

There are many languages (notably English) which doesn't have straightforward one-to-one rules between writing and pronunciation. Therefore first step in text-to-speech generation have to be text-to-phoneme translation.

  1. input text is translated into pronunciation phonemes (e.g. input text xerox is translated into zi@r0ks for pronunciation).
  2. pronunciation phonemes are synthesized into sound e.g. zi@r0ks is voiced as  zi@r0ks in monotone way

To add intonation for speech i.e. prosody data are necessary (e.g. stress of syllable, falling or rising pitch of basic frequency, pause, etc.) and other information, which allows to synthesize more human, non-monotonous speech. E.g. in eSpeakNG format stressed syllable is added using apostrophe: z'i@r0ks which provides more natural speech:  z'i@r0ks with intonation

For comparison two samples with and without prosody data:

  1. [[DIs Iz m0noUntoUn spi:tS]] is spelled  in monotone way
  2. [[DIs Iz 'Int@n,eItI2d sp'i:tS]] is spelled  intonated way

If eSpeakNG is used for generation of prosody data only, then prosody data can be used as input for MBROLA diphone voices.

2. step — sound synthesis from prosody data

The eSpeakNG provides two different types of formant formant synthesis using its own eSpeakNG synthesizer and a Klatt synthesizer.

The eSpeakNG synthesizer creates voiced speech sounds such as vowels and sonorant consonants by adding together sine waves to make the formant peaks. Unvoiced consonants e.g. s are made by playing recorded sounds. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded unvoiced sound.

The Klatt synthesizer mostly uses the same formant data as the eSpeakNG synthesizer. It produces voiced sounds by starting with a waveform which is rich in harmonics (simulating the vibration of the vocal cords) and then applying digital filters in order to produce speech sounds.

For the MBROLA voices, eSpeakNG converts the text to phonemes and associated pitch contours. It passes this to the MBROLA program using the PHO file format, capturing the audio created in output by MBROLA. That audio is then handled by eSpeakNG.

Languages

eSpeakNG does text-to-speech synthesis for the following languages:

References

ESpeakNG Wikipedia


Similar Topics