Fork me on GitHub
Menu
News
Web Development
UNIX
Physics
Misc
This Website
message icon wishlist icon

amadeus.maclab.org

Resources for web developers and other stuff™
15:34 11/07/2005

Khmer Gettext

Aims

Background

In Summer 2004 I was sent to Cambodia by the Missions Etrangères de Paris (Paris Foreign Missions). Part of my role was to develop the website for the Catholic Church in Cambodia and to ensure its continual running after I left.

One of the problems I faced was how to display the website in the native Cambodia language, Khmer. Recent efforts had been made by Maurice Bauhahn et. al with respects to Khmer UNICODE, which is the only "proper" way of displaying the Khmer language (ie. cross-compatible). By the time I arrived in Cambodia Khmer UNICODE was working pretty well under Windows 2000/XP. Since then the Khmer Software Initiative has received government backing to produce a Khmer language open-source operating system, relying on Khmer UNICODE.

For a brief explanation on Khmer UNICODE try this page.

During my time spent working on this I found a number of problems, and figured out ways to get it all working in the "cleanest" possible way. This page is to serve as an aid for anyone attempting the same thing.

Khmer in brief

If you want your Khmer page to be accessible by all using Windows you need to embed the Khmer font in your pages with some CSS like:

div.khmer {
    font-family: "Khmer OS";
    font-style: normal;
    font-weight: normal;
    src: url(/_style/KHMEROS0.eot);
  }

And matching HTML:

<div class="khmer">ផ្តិតទំព័រនេះ</div>

Where your embedded .eot file can be created using Microsoft's Weft package.

You can skip the embedding if you expect (and tell!) your users to setup Khmer UNICODE correctly on their computers.

Why Gettext?

So what happens if you want to display your site in, say, 3 languages?

One approach (the long one) is to have different pages for the different languages. So:

index-en.php:
<a href="/contact.php">Contact Us</a>

index-fr.php:
<a href="/contact.php">Contactez-nous</a>

index-km.php:
<a href="/contact.php">ទំនាក់ទំនងមក​យើងខ្ញុំ</a>

I'm sure you'll agree this is a big headache to maintain. So this is where gettext comes in. The following is the equivalent of above:

index.php:
<a href="/contact.php">
	<?php echo gettext("Contact Us"); ?>
</a>

The gettext command will check what language it should be returning the string "Contact Us" in. You will have different "catalogs" of the English strings and their translations in their respective languages, in locale directories. For example:

/locale/fr_FR/LC_MESSAGES/messages.po
/locale/es_ES/LC_MESSAGES/messages.po

French and Spanish

The messages.po files are the same initially: a list of all the words you want to "translate on the fly", translated in the respective languages (you can make this files automatically by scanning your source code for gettext() strings; more on this later). Once you have translated these files you need to compile them to messages.mo.

So you only write one set of pages, in English, and you then work on your language catalogs. This makes it much easier to update your one-set of pages, and add as many languages as you want.

You can set the language easily so the above script returns the "Contact Us" part in whichever language you want:

putenv("LANG=km" ); 
setlocale(LC_ALL, "km" );

One way to do this is to have a QUERY_STRING on each page:

index.php?lang=km
index.php?lang=fr
index.php?lang=en

A neater way is to use directory based urls, for example:

/km/index.php
/en/index.php
/fr/index.php

Where en, km and fr are just php scripts which load the same index.php, just setting the locale beforehand to what it should be.

That is briefly how gettext works. If you don't understand something or want to learn more please try a search engine before contacting me or posting in the forums.

In order to extract all your translation strings from your source I recommend you install Cygwin under Windows with the "gettext-devel" package. This will give you the gettext command suite for manipulating catalogs (start with "man xgettext").

Method

So how do put this all together for Khmer?

  1. Generate pages, catalogs from English
  2. Translate catalog files in Khmer UTF-8 (.po)
  3. Convert catalog files to HTML-ENTITIES
  4. Compile binary catalog files (.mo)
  5. Modify HTML to display Khmer correctly

If you want a guide on how to install PHP with Gettext on Windows, read my article about it.

Implementation

Catalog files in UTF-8

Once you have extracted all the message strings to translate into Khmer, you should have a .po file ready to edit.

The poedit software runs well with Khmer UNICODE and Windows. Once you have UNICODE correctly set up on your computer (see Khmer Software Iniative link above) you only need to modify:

File -> Preferences -> Editor -> Fonts

Check "Use custom font for translations list"
Check "Use custom font for text fields" 

 selecting a Khmer OS variant and size which suits you

File -> Preferences -> Editor -> Behaviour

Uncheck "Automatically compile .mo file on save"

poedit does not seem to correctly compile .mo files with native Khmer UTF-8. This is why we need to:

Convert catalog files to HTML-ENTITIES

It's good practice in HTML to convert non-ASCII characters to HTML-ENTITIES so you ensure they are displayed properly in the user's browser. It is necessary to do this with Khmer UNICODE and gettext as well.

Once you have a translated your English -> Khmer catalog you will need to convert all the UTF-8 data to HTML-ENTITIES. One way I did this was using PHP:

<?php

$input = file_get_contents("khmer.po");

$to = "HTML-ENTITIES";
$from = "UTF-8";

$output = mb_convert_encoding ( $input, $to, $from );

$fp = fopen ( "output.po" , "w" );
fwrite ( $fp, $output );
fclose ( $fp );

?>

NOTE: You will need the php_mbstring module installed for the above to work.

You will then have an output.po file. Open this in poedit and make sure no strings have been lost.

Using cygwin (see my article about it) you can run a quick grep:

Home@paws ~
$ grep msgstr khmer.po | wc -l
309

Home@paws ~
$ grep msgstr output.po | wc -l
309

So at least you haven't lost any strings; but it's worth manually glancing over the file just to make sure.

Compile binary catalog

Now you have your output.po you can compile it for use with gettext:

Home@paws ~
$ msgfmt output.po

You will then (hopefully) have a .mo file which needs to placed in your locale directory under the correct language.

NOTE: on some Linux systems because there is currently no "official" locale for Khmer you have to make use of an exisiting one, such as Malaysian (ms_MY) for gettext to work. On NetBSD it doesn't matter whether the locale is installed system wide or not.

Modify HTML

Now you have your catalogs correctly compiled and setup in the correct locale directories, you will need to modify the gettext() script a little to include <font> tags, unless you get around this will CSS (ie. global font declaration).

I used the following approach; where the current language setting was stored in a SESSION variable, and the default language can be specified if none is set yet.

echo text("Contact Us");

function text ( $text, $notag='' )
{

  $return = NULL;

  if ( getLanguage() == "km" && empty ($notag) )
    $return .= "<font class='Khmer'>";

  $return .= gettext( $text );

  if ( getLanguage() == "km" && empty ($notag) )
    $return .= "</font>";

  return $return;

}

function getLanguage ( $return_default = '' )
{

  return ( isset($_SESSION["lang"]) ) ? $_SESSION["lang"] : 
		( isset ( $return_default ) ? 'en' : NULL);

}

Summary

That should be enough information and ideas to help you write multilanguage websites with Khmer and gettext.

If you have any questions/comments/suggestions please contact me or post in the discussion forums.

Copyright © 2002-2017 message icon Amadeus Stevenson, Photo credit: Klaus Post
XHTML CSS
wishlist icon