Going multi-lingual

Its a hard problem and not many people are tacking it in the web development world. Heres some research into presenting content in different languages via HTML.

To frame the problem heres a good break down of the three technical issues:

There are three considerations for presenting HTML in non-English languages. First, that the document is delivered in the desired natural language (such as English, French, etc.) and dialect (US, British, etc.). Second, that the document is presented in the correct character set. This is a requirement for most Eastern languages (Russian, Japanese, etc.). Third, that the document is presented in the correct directionality. This is a consideration for languages such as Hebrew, Arabic, Japanese that are customarily written right-to-left or top-to-bottom.

Desired natural language

This can be broken down into three parts itself:

  1. Which language to deliver in and doing it automatically with content negotiation
  2. The data model in a CMS to store and track changes (versions going out of sync etc) of parallel content
  3. The interface text/strings need to be translated and time/currency etc need to be localised the target ‘culture’ (combination of language and region i.e. Swiss-German as apposed to German-German or Dutch-German)

Content negotiation

Content-negotiation uses the features of e.g. the Apache server to serve a document based on natural language. The browser sends an http_accept_language request and the server uses a type-map file to find the correct file.

Parallel Content

System Requirements

Here are two lists of good basic system requirements, a basic list and a more advanced list:

So, what is a basic level ?
- translate every content in other languages
- set a default / fallback language
- choose how to handle pages not in the language you selected : do you display them in the fallback language, or do you make them invisible ?

It look pretty much like Jose’s i18n module : the same site for every language.

More advanced levels could be handled by contrib modules :
- specific contents for different languages (no translation required)
- different settings for subsites
- translation for custom fields, blocks, everything
- different versions of attached files or links (ex: PDF files in the right language if available)

Database design

The database design is the tricky part here. The design goals would be:

  1. Simple record retrieval (SQL) given a table name, record id and language id.
  2. Scalable to many tables of unknown column data types for various character set.
  3. Maintain a single record for values that don’t need translation for a given record i.e. ‘title’ and ‘description’ might need translation but ‘quantity’, ‘name’ do not. There is also the case of currency in which its value would be location specific and its actual currency would be something else (a vector), a combination of content and localisation metadata.
  4. Version tracking changes in the source language that need to be reflexed in the secondary language so that it might be flagged that the given translation is out of date. Perhaps by how much its out of sync and what part needs to be re translated would also greatly assist translation efforts if the data set was large.
  5. Fallback/default language path so if German-Swiss isn’t available, German-German might be and failing that perhaps French then English, or an error message alternatively with a reference to the alternative languages? (see: A case for better handling of multi-lingual objects in Drupal for a good break down of system requirements)
Simple database model

The most obvious approach that leaps to mind is to make a ‘translation’ for every table that has strings that need translation. This table would have a reference to the original tables ID (primary key) and only contain those columns that need to be translated plus a language/culture reference.

This works but might suffer when the number of tables in the schema is large, potentially doubling the number of tables in the system. As can be seen from an example, which starts with a bad unnormalized schema this isn’t exactly straight forward and you get the feeling of “Why have two tables where you really only want one?”.

A single table solution

You might instead have one table for all strings which might look like:

id primary key
table table name the value is for
column column name the vaue is for
record_id foreign key reference to the record the table the value is for.
value actual data we want (?what data type is this column?)

This is a ‘lookup’ table that doesn’t exactly follow the normal rules of database design but is very flexible and means that you only need one table to do all translations on any table, so this scales better but you have the problem of the data type for the ‘value’ column which could just be the longest string type of the DB and perhaps add an ‘encoding’ column so you know what type of character encoding its storing. Inefficient use of DB space as well unless the DB can optimise for it. Alternatively you could have one of these tables for each data type which could be a performance optimisation trade off. You would also need to have some metadata about each table you use this on so a routine could determine which columns it should provide translations for. Look up would be simply:


SELECT column, value
FROM translations
WHERE table = 'myTable'
AND record_id = x

no joins and quite simple to get ones heads around: KISS SQL.

Internationalists (I8n)

This concerns the User Interface (UI) translation and formation of things like time and currency and a separate issue from content translation as far as I can tell, although the two are obviously related.

Character set

The tricky part as to understand the issues is not straight forward and one must understand things like how MS Word deals with character sets from other languages (as most content is copy/pasted from word into a web interface).

The history of character sets is a long one and as old as computing itself. Basically it comes down to: ANSI is the lowest common denominator as present in all character sets and fonts but Unicode is the long term solution but not going to save your bacon unless you understand that it can represent all languages but for display its only as good as the font that is used to show it. What this means is that its best to store yoru data in UTF-8 or UTF-16 (for Asian languages) but to display it you will have to translate bac to one of the old ‘Windows ANSI‘ character sets that the User Agent will understand and have a font for.

Correct directionality

This affects design layout as well as HTML representation of it. There will also be issues with different User Agents handling the various directionality possible.

Other resources

5 Responses to “Going multi-lingual”


  1. 1 Justin

    Another factor to consider…

    Something I’ve found recently, while working on a multilingual site in Japan (japanese, english, portuguese and spanish), is that essentially, we as developers need to consider the standards created by other sites created primarily in that language.

    For example, most, if not all, sites in Japan use simple left-to-right (ltr) writing for in the language. And you typically find magazines and signs written in this system as well. Even though they tradionally learnt right to left and in vertical sentencing, they can still read ltr japanese, and are accustomed to doing it thusly on sites these days.

    If you really want to cater for multiple visitors (in my case, I’m writing a site for an international centre) rather than just provide text translations, you need to understand the standard layout of sites using that language, and the format they typically use.

    Sure you should still push the envelope, but all the while, the amount of websense an average user from each culture has, needs to be considered. This is especially so for countries that have developed their own unique culture of web… or who are still left behind in the mid-90s.

  2. 2 CpILL

    Yes,

    I agree that there are cultural expectancies. The ones you’ve mentioned seem to be about ‘Web Design’, which I hadn’t considered at all here :)

    These ’standards’ you talk about for Japanese seem to be used on bad web design that evolved because if defaults set by English domanence of the web. Thus left-to-right becomes a norm, which is sad.

    Do you suppose that if you did do it the traditional way that it wouldn’t be understood?

  3. 3 Justin

    Understood maybe - but how would you lay it out?

    The medium - and the audience - haven’t been tried and tested with the text format - esp. not with vertical sentencing.

    The web is truely a global medium, and I kinda like the way things are common regardless of culture. I love multiculturalism, but I also like being able to communicate across language barriers.

    ..But hey, maybe that’s just me - too much programming, database normalisation and all that crap.

  4. 4 CpILL

    I think its an aesthetic design challenge rather than a data design challenge. The directionality of layout can be completely controlled in CSS, the trick is designing layout that will work in Western AND Easter wrting styles, or detecting the language of the user agent and then switching layout. If your using good CSS based layout you shouldn’t have to change the HTML structure, just a simple reference change.

    The hardest part will be for the graphic designer, and I find few understand the challenge or manage to meet it. This is because design collages don’t teach “Web Design” beyond Dreamweaver. HTML and CSS is usually too much for their non-technical minds and so most are left to their own devices.

    But I digress, most of the web is the way it is because Americans set the standard for the technology (I HATE spelling colour as ‘color’ in CSS or example, I won’t start on a linguistics rant…grrr). I think once designers in Japan really take a good look at the web some will start to do things more to their tradition including the design.

    I don’t think that Kanji will ever be able to communicate across language barriers no matter which way the type is set. I also think that the audience will figure it out if the type is laid out top to bottom right-to-left, as we would notice immediately if our it happened to our language.

    I think it would make the sites look new and fresh and give the designers a chance to play with new ideas, or at least the good one :P

Leave a Reply