Story by DON SAMBANDARAKSA
One year ago, Asia Online invited Professor Philipp Koehn from the University of Edinburgh's School of Informatics to help perfect Statistical Machine Translation (SMT) for Thai and many other Asian languages. In an exclusive interview, Koehn explained how SMT first arose as an IBM project in the late 1980s, translating between French and English.
Instead of the usual rules and grammatical structure, SMT uses statistics through paring up sentences in each language, called parallel copra, and learning how sentences are translated. Essentially, the system can learn Thai by feeding it, for example, copies of Harry Potter in English and the translated Thai versions for it to analyse.
The system works much better than conventional rule-based systems as few languages have words that map directly to another. "Take the phrase 'interest rate.' Interest has a lot of different meanings, rate is also an amorphous word that has many meanings, but interest rate together has a very definite translation. Local context helps a lot," Koehn explained. Another example is refuse (the verb, meaning turn down) and refuse (the noun, meaning trash). These give surprisingly few problems.
The challenges lie with different sentence structures. Japanese and German, for instance, have the verb at the end of the sentence. German also has morphology, where words merge into huge monster words. A bigger problem is with languages that leave out information altogether, for instance on tense or omitted subjects or objects which have to then be gathered from the surrounding sentences.
|Professor Philipp Koehn of the University of Edinburgh shows of the EuroMatrix, a statistical machine translation (SMT) project used to translate to and from each of the European Union's languages. Today he is helping perfect Thai SMT with Asiaonline. — DON SAMBANDARAKSA|
Away from his work at Asia Online and the university, Koehn is also working on real-time voice translation for DARPA, the US Defence Advanced Research Projects Agency for Chinese and Arabic. The system is already deployed in Iraq and has by far the most mature SMT engine available, with over 200 million parallel copra sentence pairs. The system starts to work well with 20 million sentence pairs and gives a good result with 40 million, according to Koehn.
Another project is in the European Union. Its 25 official languages means over 600 language pairs that every document needs to be translated to and from. One benefit is the high quality of existing legal documents which can be used to train the SMT engine.
For Thai, one of the unique challenges has been to create a work segmentation pre-processor, as Thai does not have breaks between words or full stops. Today, Asia Online has hired fresh graduates from Chulalongkorn University's computational linguistics programme and is working with researchers from Chulalongkorn, Thammasat, Kasetsart and Nectec (the National Electronics and Computer Technology Centre).
It also has developed a post-processor that rates the quality of Thai and the automatic changes are fed back into the SMT learning engine.
The algorithms are relatively advanced. Learning Arabic with 200 million parallel copra takes around a week on a modern Linux PC with 4GB of memory and a lot of hard disk space.
Asia Online founder Dion Wiggins explained that while Professor Koehn has been focusing on the usability and the translation quality as part of his pure research work, it will be up to Asia Online to architect the algorithms in a way that can scale to thousands of transactions a second. This will allow users of the Asia Online portal to view the Internet in any language they wish.
A lot of work can be taken care of in the pre- and post-processing. For instance, Chinese numbering refers to 52,000 as "five point two ten thousand", which would need to be translated into "fifty-two thousand" for both English and Thai. Other engineers are working on a name and place recognition pre-processor that will tag words that need to be translated phonetically.
Professor Kohen and the Asia Online staff declined to show the quality of Thai translation just yet, though they promised it would be better than anything else available when it is formally launched.
Time will tell.
One key improvement of the raw algorithms will be the development of specialised domains. For instance, language used in car manuals is quite different from legal documents and from chatrooms. Wiggins said that the system will feature thousands of domains, which will be one of the unique points of the Asiao Online SMT engine.
For languages with insufficient texts, like Khmer, SMT algorithms can triangulate with two or more different languages, for instance merging Japanese-Khmer with English-Khmer parallel copra. This has successfully been used to train SMT systems for Gaelic, Welsh and Catalan and other "low resource" languages.
The Bible and the Universal Declaration of Human Rights have been very useful as it has been translated into every major language. For Asia, Wiggins is eyeing the Buddhist Tripitaka for use to train the engine.
Asked if this would mean that Asia Online will be able to translate into ancient languages such as Pali, still used widely in Buddhist rituals, and Sanskrit, Wiggins laughed and said, "Let's get Thai working first."