Locale sensitive String sorting in Java

| 1 Comment

So, the day after I get made a Microsoft MVP I do two posts about Java - go figure.  Anyway, today I had one of those moments where you thought you understood something and then realize you didn't and probably a lot of your code that you've written over the past 10 years doesn't work as well as you thought...  All this with the humble String.compareTo method.

Take the following strings:-

  • charlotte
  • Chloé
  • Raoul
  • Real
  • Réal
  • Rico

In .NET, if you want to perform a standard case insensitive, dictionary based comparison between two strings then you can use the String.Compare method.  This does a culture based, case insensitive comparison.

In Java, if you were to do use the Comparable interface which makes use of the standard String.compareTo method to sort a list, you would end up with:-

  • Chloé
  • Raoul
  • Real
  • Rico
  • Réal
  • charlotte

That is because compareTo looks at the unicode value of the character and sorts on that - which for those of us that tend to live in the ASCII range tends to work ok (only that lowercase letters come after the uppercase ones) - however if you have a language that uses one of the many other characters it doesn't work so well.  If you had a language where M comes before A in the alphabet you are totally screwed.

This is were you should be using the java.text.Collator class in Java.  The Collator class does locale sensitive string comparisons - i.e. allowing you to do a dictionary base sort of a set of strings.

Dope.  One of those classes I should have been using for a while...  I thought I was just being dumb, but then a couple of other people I mentioned this to were not aware of the issue so I thought it worth a blog post.

1 Comment

We struggle with this every day... We do most of our sorting in the database (Order by...), so it's sorted by collation. Which is fine for some customers, but not fine for others. So we should probably move the sorting into code and sort it according to their locale.

It sort of annoys me that most, if not all, example code comes with out Globalization code. And almost no "Best Practices" uses Globalization code. Perhaps I should start a blog on this? :)

- Jarle Nygård -
System Developer, Synergi.com

Archives

Creative Commons License
This blog is licensed under a Creative Commons License.