| Localisation, eCommerce, Travel, Enterprise and Government|
Gregory Binger, Dion Wiggins, Bob Hayward
Singapore, Thailand, The Netherlands
Gregory Binger, Dion Wiggins, Philipp Koehn, Andrew Rufener
Language Studio Machine Translation and Language Processing Platform
Omniscien Technologies (formerly Asia Online) is a privately owned company delivering machine translation and language processing software and services. The company is backed by individual investors and institutional venture capital. Omniscien Technologies is headquartered in Singapore, with R&D operations in Bangkok, Thailand, and European operations based out of The Hague, The Netherlands. The firm was founded in 2007 by Prof. Dr. Philipp Koehn, a leading scientist in the field, Gregory Binger a technologist and IT/IP lawyer, and former Gartner senior analysts Bob Hayward and Dion Wiggins.
The firm delivers professional machine translation solutions for the localisation industry as well as government, eCommerce and large Enterprise customers based on statistical machine translation (SMT) technology as well as the emerging neural machine translation (NMT) technology. Omniscien Technologies supports in excess of 540 global language pairs in 12 industry domains.
The firm's statistically and neural based translation software employ recent advances in automated translation as well as extensive data manufacturing technologies. Until the early 1990s, almost all production-level machine translation technology relied on collections of linguistic rules to analyze the source sentence, and then map the syntactic and semantic structure into the target language. Its current approach uses statistical and/or neural techniques from cryptography, applying machine learning algorithms that automatically acquire statistical models from existing parallel collections of human translations, in the same way as Google Translate and the systems made using Koehn's own open source Moses tool for SMT.
Omniscien Technologies Wikipedia
Google, Microsoft and SDL Language Weaver and others have also created SMT and more recently NMT systems, some publicly accessible. The specific difference in Omniscien Technologies approaches are:Clean data: The traditional approach leveraged content found on the web in corporate sites, news articles and other similar sources where the same content was available in multiple languages: this gives low-quality data. Asia Online has focused machine and human resources in this area to ensure that the data is as clean and as accurate as possible. The company's data is sourced from high-quality translations provided by book publishers and translation companies, and is aligned at the segment level (usually sentences) and converted into a consistent format in order to be processed by the learning software. This step includes extracting segments from files and documents if they are not in a TMX format. Then the extracted sequence are aligned—and processed by machines, with humans used to validate the accuracy.The data is converted to a base UTF-8 encoding for training the SMT system, small subsets are extracted to guide training, and finally the data is reviewed, cleaned, and analyzed.
Multiple domains: the system allows for training in many domains, by extending a base set of information with multiple additional learning sources, including tuning for specific writing style
The firm currently has more than 540 language pairs available in a baseline form and is progressively deploying 12 domains across each language pair. In addition, Omniscien Technologies offers in excess of 100 Industry Engines that can be used "off the shelf". Currently supported languages are the Asian languages: Arabic, Burmese, Chinese, Hindi, Indonesian, Japanese, Korean, Malay, Tagalog, Thai and Vietnamese; and the European languages: Bulgarian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Spanish, Swedish and Ukrainian. The additional Asian languages Bengali, Gujarati, Punjabi, Tamil and Urdu are currently under development.