Zen and the Art of National Language Support

AZIndex LogoJust a quick update on what I have been working on with the AZIndex plugin.   I decided it was finally time to do something that I have long been putting off — adding national language support to the plugin.  That doesn’t mean I am translating all the English text to other languages (sorry!), but I am looking at fixing the problems to do with sorting indexes with non-English characters in it, and the displaying of non-English characters in the alphabetical links and headings.

But, wow, little did I realize the complexity that is PHP national language support.  Unicode support will only appear in PHP 6.0, so I have to rely on the older PHP APIs, and only those which are likely to be installed on a WordPress server (i.e not many!).  Not only that, but I discovered that WordPress used something called UTF-8 (which is a multi-byte codepage where characters can be one, two, or even three bytes long) which is fine, but PHP’s collation (sorting) function on Windows doesn’t work with UTF-8 so, on Windows systems, you have to convert every index item into the local codepage before the index can be sorted.  Yuck!

Finally, there is the thorny problem of character equivalence.  If you have index items beginning with an accented character (e.g. “Êtes” or “Übersicht”) then where does it go in the index?  In some languages, like French, the accented characters belong in the same group as the same non-accented character (e.g. “Êtes” goes between “Elle” and “Eve” under “E”) but in other languages, some of the accented characters are grouped seperately.

None of this works in AZIndex at the moment.  Even if the entries like  “Elle”, “Êtes”, and “Eve” are sorted in the correct order (which they will be when I make the right changes) , the items will be put under three separate headings (“E”, “Ê”, and “E” again) when they all need to be under “E”.

The only way I have found to do this reliably on all platforms is to hardcode the mappings of the accented characters to the base characters.  I’ve done this by reverse engineering the UTF-8 collation tables for MySQL (please tell me if you know of a better way!), so that I can fold the accented entries into the correct alphabetical grouping.

Finally, since some non-English characters can appear in different places in the index depending on which language is being used, I have decided to add a new option to the index settings to allow you to pick which language rules to use when folding the accented index entries into the index.  Hopefully, if the results in the default language are not to your liking, you will find a language setting that does work.

Anyway, I will likely be putting out a new version of AZIndex with all this stuff in it sometime soon.  Hopefully those of you who are using AZIndex with non-English web sites will see an improvement.  In the meantime, please let me know if you have any suggestions about anything here or that I may have missed.

18 thoughts on “Zen and the Art of National Language Support

  1. K

    hello Mike
    Thanks for the fast release! Just installed the new version. I’ve enabled:

    # Turn on additional support for national languages
    # Set collation table to use for grouping index items
    # Set locale to be used while sorting index

    It seems like plugin shows ???? instead of “Index subheadings” which extracted from custom field on first azindex load, but after F5 refresh seems to work fine. But this happens every time I create a new post 🙁

    Another issue is that “Customized alphabetical links” always displayed like ?????? no matter of what.
    there is no “Russian” in “Language table used for grouping index items” dropdown. Could it lead into that problems?

    thanks again,
    K.

    Reply
    1. English Mike Post author

      Hi K, ugh — that doesn’t sound great, but I have a few questions:

      1. Do you need to enable the locale and collation table options? Unless your server is not running in a Russian codepage, then you should not need to set the locale for things to work for you, and you should keep those options turned off.

      2. Are you running on Linux or Windows?

      3. If I just enable the main NLS option on my English-only system then I see the Cyrillic alphabet in the headings and the links just fine.

      4. I got the collation tables from here: http://www.collation-charts.org/mysql60/. There is no separate Russian table for UTF8 (which is what WordPress is running in), so (I assume) the default General European table should work for Russian.

      5. I must admit that I haven’t tested putting Russian characters into the Customized alphabetical list — I will take a look.

      Sorry you are having so many problems — it’s hard for me to test these things on an English system, so thanks for reporting back to me!

      What I suggest you try now is the following:

      1. Turn off Customized Alphabetical Links option.
      2. Turn off option to put unused alphabetical characters in the links.
      3. Turn on National Language Support option (but not locale or collation table)

      If you do that, do you still see those problems?

      Thanks,

      Mike

      Reply
    2. K

      Hi

      thanks for the support

      1. No, my server server is not running in a Russian codepage.
      2. Im on Linux at Dreamhost
      3. Yes, I can see the alphabet in headings and links but all the alphabet links points to #char_208 and that’s why all the posts are on the same page
      4. I’m not sure too, but assume you are right.
      5. you could try to create several posts with “russian lorem ipsum” from here http://vesna.yandex.ru/astronomy.xml to check the behavior (refresh the page to get new portion of text)

      the biggest problems now:
      – every time I change the plugin settings or create a new post, index subheadings turns into ???? and I need to refresh the page to fix it
      – all the links in alphabetical links points to #char_208

      hope that helps
      K.

      Reply
    3. English Mike Post author

      Okay — thanks for the feedback. I have remove 0.7 from circulation for the time being, so that I can sort out these issues (if I can). I don’t think I’m seeing all the same problems on my system, but I will play some more to see if there is something I am missing.

      Reply
    4. English Mike Post author

      Thank you, K. I have a few other things to take care of, so I’m not sure how much more I can do today, and the problem with the links is going to take a little time since it’s in complicated code, so check back here tomorrow for an update. Thanks also for the snapshot of your settings page.

      Reply
  2. K

    ..And just noticed – all post are shown on the same page no matter of the first letter and autocreated alphabetical links don’t work too..

    Reply
  3. English Mike Post author

    Good news, K. I believe I have fixed all the problems you mentioned. Thanks for testing out my first effort, your feedback was invaluable to me in helping find the outstanding problems — and there were quite a few (ugh!).

    You can download the new test version using this link and install it on your test system. But, because I had to change the database table, you will have to do one of two things before you see all the fixes:

    1. Uninstall the AZIndex plugin using the link at the bottom of the main AZIndex settings page (you will lose all your indexes), or,

    2. If you don’t want to lose all your settings, you will have to force the plugin to recreate the database. You can do this by running phpmyadmin (or whatever database application your server uses), finding the table called wp_az_indexes in your blog’s wordpress database, and renaming (changing) the nlslocale field to something else (e.g. to “dummy”). Now all you need to do is deactivate and reactivate AZIndex and the database table will support UTF8 (double-byte) strings!

    (Sorry for the complicated process, normal users upgrading directly from an earlier version will not need to do this as it will be done automatically).

    You should not see any more ??????? and you can try putting Cyrillic characters in the custom alphabetical links again if you want. As long as they are added in the same order as they are sorted in the index, the links should work correctly (I tested this, so it should work for you too!). The alpha links pointing to other pages should also work.

    Please give it a try and let me know if there are still any problems.

    EM

    Reply
    1. K

      Hi there

      Yes, now subtitles works fine, without ???? characters and alphabetical links now different, but I can’t filter the posts by first letter. I have posts with russian A and T and alphabetical links now #char_d0a2 and #char_d090, but still all my posts on the same page and alphabetical links broken… I hope you understand what I’m trying to explain 🙂

      K.

      Reply
    2. English Mike Post author

      Hey, K. Are you sure you have the “Use multiple pages” option selected? (You don’t in the screen shot you linked). You also have to enter a number in the “Number of items per page” field. I have several test indexes, and all of them are working with multiple pages, so I’m not seeing that problem.

      Also, what character(s) are you using as a filter. I just tested single quotes (‘) and that seemed to work okay. And when you say the character links are broken, what exactly do you mean? (That when you click them, nothing happens?).

      Reply
    3. English Mike Post author

      Hmm — looks like I was not using the double-byte version of the curly quotes in the filter, so that could have been only problem you were seeing (and it messed up the alphalinks too!). I will post a new version for you shortly.

      Reply
    4. English Mike Post author

      Okay, K — new test plugin for you here: AZIndex

      (You should just be able to install this version of the plugin on top of the old one — no extra steps this time. :-))

      The problem was that there were still a couple of places in the code where I was using non-multibyte string functions (in this case chr() and ltrim()). Hopefully I have found them all this time (!) and everything should work for you.

      Please let me know what you find.

      EM

      Reply
    5. English Mike Post author

      Phew! That’s a relief 🙂

      Thanks for your help too — I was going to have to fix the problems eventually so it really helps to have someone, like you, to test the code for me.

      EM

      Reply

Leave a Reply to English Mike Cancel reply

Your email address will not be published. Required fields are marked *