Internationalization & localization data sets

I started gathering data for a project relating to Language and did a lot of research into ISO standards and was looking for a complete list of Languages + some ISO code to id them by. This got expanded to cross referencing by Country (as languages have dialects in different countries) and then I thought it would be easy to complete this set with Locate settings and perhaps even country to IP address mappings. I got a bonus from the UNeTradeS with all sub regions for all country’s (according to the UN) and most of their Geo co-ordinates!

Data sources:

  • Ethnologue’s Three-letter codes for identifying languages is the very last word in language data. Three sets of data, in three easy to use tab delimited TXT files: Languages with 7,357 distinct language codes, Countries with 226 countries of the world in ISO3166 and finally Language Index a mega listing of 39,418 distinct native names used for the 7,299 languages for a total of 52,584 records (since many of the names are used in more than one country and some are used with more than one language or dialect). It also has a brief history of the ISO codes and an explanation of how it all goes together. A must have for any serious international web application. There is also SIL International’s ISO-639-3 data for download which is the master reference but slightly less digestable.
  • Unicode CLDR Project: Common Locale Data Repository
    For localization ALL your currency, time/date formats etc but this data is all in XML, thousands of small ones. Which is OK, as you probably only need to read one once you know a users location.
  • United Nations Centre for Trade Facilitation and Electronic Business which has world regions and sub-region data, country/currency data and international units of measurement. The data is presented in some rather hard to digest formats (MS Access for one!).

Comments are closed.