Speech translation

Updated on Apr 25, 2026

Edit

Comment

Speech Translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language. This differs from phrase translation, which is where the system only translates a fixed and finite set of phrases that have been manually entered into the system. Speech translation technology enables speakers of different languages to communicate. It thus is of tremendous value for humankind in terms of science, cross-cultural exchange and global business.

How it works

A speech translation system would typically integrate the following three software technologies: automatic speech recognition (ASR), machine translation (MT) and voice synthesis (TTS).

The speaker of language A speaks into a microphone and the speech recognition module recognizes the utterance. It compares the input with a phonological model, consisting of a large corpus of speech data from multiple speakers. The input is then converted into a string of words, using dictionary and grammar of language A, based on a massive corpus of text in language A.

The machine translation module then translates this string. Early systems replaced every word with a corresponding word in language B. Current systems do not use word-for-word translation, but rather take into account the entire context of the input to generate the appropriate translation. The generated translation utterance is sent to the speech synthesis module, which estimates the pronunciation and intonation matching the string of words based on a corpus of speech data in language B. Waveforms matching the text are selected from this database and the speech synthesis connects and outputs them.

History

In 1983, NEC Corporation demonstrated speech translation as a concept exhibit at the ITU Telecom World (Telecom '83).

The first individual generally credited with developing and deploying a commercialized speech translation system capable of translating continuous free speech is Robert Palmquist, with his release of an English-Spanish large vocabulary system in 1997. This effort was funded in part by the Office of Naval Research To further develop and deploy speech translation systems, in 2001 he formed SpeechGear, which has broad patents covering speech translation systems.

In 1999, the C-Star-2 consortium demonstrated speech-to-speech translation of 5 languages including English, Japanese, Italian, Korean, and German. .

In 2015, was developed Blabber Messenger - speech translator for 23 languages.

Features

Apart from the problems involved in the text translation, it also has to deal with special problems occur in speech-to-speech translation, incorporating incoherence of spoken language, fewer grammar constraints of spoken language, unclear word boundary of spoken language, the correction of speech recognition errors and multiple optional inputs. Additionally, speech-to-speech translation also has its advantages compared with text translation, including less complex structure of spoken language and less vocabulary in spoken language.

Research and development

Research and development has gradually progressed from relatively simple to more advanced translation. International evaluation workshops were established to support the development of speech-translation technology. They allow research institutes to cooperate and compete against each other at the same time. The concept of those workshop is a kind of contest: a common dataset is provided by the organizers and the participating research institutes create systems that are evaluated. In this way, efficient research is being promoted.

The International Workshop on Spoken Language Translation (IWSLT), organized by C-STAR, an international consortium for research on speech translation, has been held since 2004. “Every year, the number of participating institutes increases, and it has become a key event for speech translation research.”

Standards

When many countries begin to research and develop speech translation, it will be necessary to standardize interfaces and data formats to ensure that the systems are mutually compatible. International joint research is being fostered by speech translation consortiums (e.g. the C-STAR international consortium for joint research of speech translation and A-STAR for the Asia-Pacific region). They were founded as “international joint-research organization[s] to design formats of bilingual corpora that are essential to advance the research and development of this technology (...) and to standardize interfaces and data formats to connect speech translation module internationally”.

Applications

Today, speech translation systems are being used throughout the world. Examples include medical facilities, schools, police, hotels, retail stores, and factories. These systems are applicable anywhere that spoken language is being used to communicate. A popular application is Jibbigo that works offline.

Challenges and future prospects

Currently, speech translation technology is available as product that instantly translates free form multi-lingual conversations. These systems instantly translate continuous speech. Challenges in accomplishing this include overcoming Speaker dependent variations in style of speaking or pronunciation are issues that have to be dealt with in order to provide high quality translation for all users. Moreover, speech recognition systems must be able to remedy external factors such as acoustic noise or speech by other speakers in real-world use of speech translation systems.

For the reason that the user does not understand the target language when speech translation is used, a method "must be provided for the user to check whether the translation is correct, by such means as translating it again back into the user's language". In order to achieve the goal of erasing the language barrier world wide, multiple languages have to be supported. This requires speech corpora, bilingual corpora and text corpora for each of the estimated 6,000 languages said to exist on our planet today.

As the collection of corpora is extremely expensive, collecting data from the Web would be an alternative to conventional methods. “Secondary use of news or other media published in multiple languages would be an effective way to improve performance of speech translation.” However, “current copyright law does not take secondary uses such as these types of corpora into account” and thus “it will be necessary to revise it so that it is more flexible.”

References

Speech translation Wikipedia

(Text) CC BY-SA

Contents