Why you should and should not train your own model for machine translation

Melanie Yang
May 22, 2022
3 min read

Machine Translation: An alternative approach?

When we think about the approaches to perform the actual translations for a localization project, the choices are either manual translation by hiring someone, or machine translation. Manual translation takes longer and costs more, while machine translation is less accurate and too broad.

An alternative approach we could attempt is to use machine translation, but use our own model instead of broader public ones. We could sample our own training data and test data, and perform rounds of training until we like the results. It can be a good middle ground, flexible, and catered towards specific needs.

What I did with my group was training and testing a machine translation model for GTA 5 with its user content and official news. We sampled our own data from older generations of the game, existing translated materials, and specific wikis, then trained and improved the model until satisfaction. It delivered a decent automatic translator, and I would love to share some findings throughout the project.

What is good about training translation models?

It can be done with a customized scope and made closer to what we actually want to translate. We were able to limit the sample and test data around GTA 5 and it helped eliminate some chaos and confusion. If a word can mean many different things or be used in multiple ways, limiting scope really helps raise the accuracy as that word might only have one meaning in the particular game or field.

It can be tweaked in any way we see fit. The testing data can be selected carefully to make sure the most important content is translated well, even including less tangible features like the desired tone. We can know what potentially causes wrong results and what direction the model should go, and adjust training data accordingly. My team was able to select items related to “video game” in the Wikipedia database, and filter all our data to exclude traditional Chinese since the target is simplified Chinese. Applying these appropriate clean-ups can raise the accuracy of translations, and it is easy to check the effectiveness of a filter with blue score.

It can be done quickly, given that the data volume and accuracy requirement are not high. For our project, we performed only 6 rounds of training before getting to an acceptable level, and it did not cost too much time. It is important to avoid the “fake success” by excluding the testing data from training data, otherwise the score could be skewed.

It is cheap and reusable. The training barely costs any money to do compared to using another software or an actual translator. Even though sampling data is a little hard and would take more time than hiring translators, the product can be used forever. It is likely, especially within a game community, that more content that requires translation comes out day by day. The trained model can be a fix-all solution, instead of having to do the process again and again.

What are the problems about this approach?

It is still not very accurate. The nature is still automated translation, so the result will have errors and will require a lot of postediting. It is not optimal for fields that require a very high standard of accuracy from the time needed to fix everything, and hiring a manual translator could be a better option. Surprisingly, SYSTRAN, the machine learning platform I used to train the GTA models, performs quite well for the legal and medical areas. There are a lot of specific terminologies and a low overall content size, and can be handled with careful postediting.

It is not scalable. My team had only 10000 lines of training data at the max point. The more content needed, the longer it takes for a training cycle to run, and the harder it is to choose training data. If the amount necessary is too high, then it is unrealistic to do ourselves. Machine translation services that are already developed will be a better option, especially in situations where the scope cannot be restricted and a lot of common words will be used.