Its a hard problem and not many people are tacking it in the web development world. Heres some research into presenting content in different languages via HTML.
To frame the problem here is a good break down of the three technical issues:
There are three considerations for presenting HTML in non-English languages. First, that the document is delivered in the desired natural language (such as English, French, etc.) and dialect (US, British, etc.). Second, that the document is presented in the correct character set. This is a requirement for most Eastern languages (Russian, Japanese, etc.). Third, that the document is presented in the correct directionality. This is a consideration for languages such as Hebrew, Arabic, Japanese that are customarily written right-to-left or top-to-bottom.
Desired natural language
This can be broken down into three parts itself:
- Which language to deliver in and doing it automatically with content negotiation
- The data model in a CMS to store and track changes (versions going out of sync etc) of parallel content
- The interface text/strings need to be translated and time/currency etc need to be localised the target ‘culture’ (combination of language and region i.e. Swiss-German as apposed to German-German or Dutch-German)
Content-negotiation uses the features of e.g. the Apache server to serve a document based on natural language. The browser sends an http_accept_language request and the server uses a type-map file to find the correct file.
Here are two lists of good basic system requirements, a basic list and a more advanced list:
So, what is a basic level ?
– translate every content in other languages
– set a default / fallback language
– choose how to handle pages not in the language you selected : do you display them in the fallback language, or do you make them invisible ?
It look pretty much like Jose’s i18n module : the same site for every language.
More advanced levels could be handled by contributed modules :
– specific contents for different languages (no translation required)
– different settings for sub-sites
– translation for custom fields, blocks, everything
– different versions of attached files or links (ex: PDF files in the right language if available)
The database design is the tricky part here. The design goals would be:
- Simple record retrieval (SQL) given a table name, record id and language id.
- Scalable to many tables of unknown column data types for various character set.
- Maintain a single record for values that don’t need translation for a given record i.e. ‘title’ and ‘description’ might need translation but ‘quantity’, ‘name’ do not. There is also the case of currency in which its value would be location specific and its actual currency would be something else (a vector), a combination of content and localisation metadata.
- Version tracking changes in the source language that need to be reflexed in the secondary language so that it might be flagged that the given translation is out of date. Perhaps by how much its out of sync and what part needs to be re translated would also greatly assist translation efforts if the data set was large.
- Fallback/default language path so if German-Swiss isn’t available, German-German might be and failing that perhaps French then English, or an error message alternatively with a reference to the alternative languages? (see: A case for better handling of multi-lingual objects in Drupal for a good break down of system requirements)
Simple database model
The most obvious approach that leaps to mind is to make a ‘translation’ for every table that has strings that need translation. This table would have a reference to the original tables ID (primary key) and only contain those columns that need to be translated plus a language/culture reference.
This works but might suffer when the number of tables in the schema is large, potentially doubling the number of tables in the system. As can be seen from an example, which starts with a bad un-normalized schema this isn’t exactly straight forward and you get the feeling of “Why have two tables where you really only want one?”.
A single table solution
You might instead have one table for all strings which might look like:
|table||table name the value is for|
|column||column name the vaue is for|
|record_id||foreign key reference to the record the table the value is for.|
|value||actual data we want (?what data type is this column?)|
This is a ‘lookup’ table that doesn’t exactly follow the normal rules of database design but is very flexible and means that you only need one table to do all translations on any table, so this scales better but you have the problem of the data type for the ‘value’ column which could just be the longest string type of the DB and perhaps add an ‘encoding’ column so you know what type of character encoding its storing. Inefficient use of DB space as well unless the DB can optimise for it. Alternatively you could have one of these tables for each data type which could be a performance optimisation trade off. You would also need to have some metadata about each table you use this on so a routine could determine which columns it should provide translations for. Look up would be simply:
SELECT column, value FROM translations WHERE table = 'myTable' AND record_id = x
no joins and quite simple to get ones heads around: KISS SQL.
This concerns the User Interface (UI) translation and formation of things like time and currency and a separate issue from content translation as far as I can tell, although the two are obviously related.
The tricky part as to understand the issues is not straight forward and one must understand things like how MS Word deals with character sets from other languages (as most content is copy/pasted from word into a web interface).
The history of character sets is a long one and as old as computing itself. Basically it comes down to: ANSI is the lowest common denominator as present in all character sets and fonts but Unicode is the long term solution but not going to save your bacon unless you understand that it can represent all languages but for display its only as good as the font that is used to show it. What this means is that its best to store yoru data in UTF-8 or UTF-16 (for Asian languages) but to display it you will have to translate bac to one of the old ‘Windows ANSI‘ character sets that the User Agent will understand and have a font for.
This affects design layout as well as HTML representation of it. There will also be issues with different User Agents handling the various directionality possible.