Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Java Sort: How to Match Oracle for Japanese?

Learn how to replicate Oracle’s sorting of Japanese characters in Java using proper encoding like EUC-JP and collation standards.
Java vs Oracle Japanese text sorting difference with kana characters displayed in code and SQL output Java vs Oracle Japanese text sorting difference with kana characters displayed in code and SQL output
  • 🎌 Oracle sorts Japanese text by how it sounds, using rules about language sorting.
  • 🧪 Java sorts by default using Unicode's byte values unless you apply a Collator that understands local language rules.
  • ⚙️ ICU4J lets you set up specific ways to sort Japanese characters to act like Oracle.
  • 🚫 Without making things regular, UTF-8 and EUC-JP order kana differently.
  • 🐞 Docker or locale settings set up wrong can stop Japanese sorting from working right across systems.

Java and Oracle each have their own methods and ideas about how to sort text. This becomes a big problem with non-Latin languages like Japanese. When you move Japanese data between an Oracle database and a Java application, the sorting logic may suddenly become different. This can lead to issues in what users see, bugs with keeping data in sync, and wrong results. This article looks closely at why sorting acts differently across systems, how Japanese character sets make this harder, and how to set up Java to sort like Oracle does when you work with Japanese text.


Why Oracle and Java Sort Japanese Characters Differently

Sorting Japanese text is not just about comparing strings. Japanese includes many writing systems and complicated sound rules. These make sorting the way people expect more than simple byte comparison. Oracle and Java deal with this problem in different ways.

Oracle uses a clear set of rules. These rules come from two main settings:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • NLS_SORT: This sets the language-based sorting order.
  • NLS_COMP: This sets how comparisons act when using NLS_SORT.

When set with NLS_SORT = JAPANESE_M, Oracle sorts Japanese kana by how it sounds. This is closer to how a Japanese dictionary would sort. And then, it handles different types of characters, like Hiragana, Katakana, full-width, and half-width. It does this by making them regular inside the system during comparison and sorting.

But Java uses the default Unicode order for comparisons unless you set it up differently. The standard method String.compareTo() compares the Unicode values directly. This leads to an order based on the byte values. For example, the Hiragana "か" (ka) has a different Unicode value than the Katakana "カ" (ka). Java would sort them based on byte value and not by their meaning or how they sound.

To deal with these differences, Java applications must use sorting logic that knows about language rules. They can do this with specific tools like Collator or other libraries that have custom sorting rules.


Japanese Character Sets and Sort Orders

Japanese is one of the hardest languages because of how characters are coded and the many different writing styles. Knowing the types of characters used is key to getting the sorting right:

Major Japanese Scripts

  • Hiragana (ひらがな): Used for native grammar and words (e.g., あいうえお).
  • Katakana (カタカナ): Usually used for foreign loanwords and science terms.
  • Kanji (漢字): Picture-like characters from Chinese. They show meaning.
  • Latin characters: ASCII characters often mix in Japanese text (e.g., names, email addresses).
  • Half-width vs full-width types: How things look affects sorting by byte order but not always in dictionary order.
  • Voiced and semi-voiced sounds: Small marks can really change how words sound, even on the same base character.

Sorting Types

  1. Dictionary-like, Unicode-based
    • This is fast and simple.
    • It sorts only by Unicode code points.
    • Used by default in Java (String.compareTo()).
  2. Phonetic or Dictionary Order
    • This sorts based on how words sound, like a Japanese dictionary.
    • People often like this order for what they see or for showing data.
    • It needs sorting rules that understand local language.
  3. Byte Order
    • This is the fastest way, using direct byte comparisons.
    • Often used in older databases or for things like file names.
    • It does not care about language meaning or how words sound.

Choosing the right type depends on what your application needs. But for web and user-facing apps, phonetic order is best. This is not just for user experience but also for making sure different systems sort the same way people expect.


How Oracle Sorts Japanese Characters

Oracle offers strong ways to set up sorting through NLS_SORT and NLS_COMP. This is why it can sort Japanese text correctly the way users want to see it.

Common NLS_SORT Settings

  • BINARY: This sorts by byte values.
  • JAPANESE_M: This is sorting based on language, by pronunciation for Japanese kana and kanji.
  • BINARY_AI: This sorts by byte values, without caring about accents and letter case.

Importance of NLS_COMP

When set to LINGUISTIC, Oracle uses NLS_SORT. This helps with same, meaningful comparison during SQL tasks like ORDER BY or WHERE clauses.

Order Hierarchy

Oracle defines a clear order when sorting Japanese characters under NLS_SORT=JAPANESE_M:

  1. Hiragana
  2. Katakana
  3. Kanji
  4. Latin (ASCII)

This makes sure characters with the same pronunciation but different looks (like か and カ) sort correctly. And it also deals with voiced/unvoiced variations and half-width/full-width forms.

Oracle documentation says that "NLS_SORT = JAPANESE_M" sorts kana based on pronunciation and not by Unicode values (Oracle, 2023).

This way of sorting helps Oracle databases keep sorting that makes sense and is always the same for many languages and hard datasets. And we want Java to do the same.


Java’s Default Sorting Behavior and Locale Influence

By default, Java uses byte sorting with String.compareTo(). This is not good enough for languages like Japanese where how words sound and making scripts regular are very important.

The recommended way to sort using language rules is Java’s Collator class from java.text:

Collator collator = Collator.getInstance(Locale.JAPAN);

But the result really depends on how you set up the Collator.

Strength Levels

  • PRIMARY: This ignores letter case, marks like voice marks, and width. It is best for normal Japanese sorting needs.
  • SECONDARY: This tells the difference between marks, like kana voice marks (は vs ば).
  • TERTIARY: This tells apart everything—accent, letter case, width.
  • IDENTICAL: This is the highest level. It even looks at small differences in code points.

Locale Differences: Locale.JAPAN vs Locale.JAPANESE

  • Locale.JAPANESE: This refers to the language.
  • Locale.JAPAN: This refers to the country (region). It offers slightly more specific handling.

Some JVM setups do these things in different ways. So, testing your sorting behavior across JVM versions or vendors is very important.

In short, turning on collators that understand local language is only the first step. You need to set up more things to match Oracle’s many sorting levels.


Customizing Java Collation Using EUC-JP

Oracle's past use of EUC-JP for Japanese data changes how text acts. It does this by changing how the computer reads the data at its lowest level.

To make Java sort like Oracle:

  1. Change input strings into EUC-JP encoding before comparing for sorting.
  2. Make different writing styles regular, especially full-width vs half-width.
  3. Use ICU4J or RuleBasedCollator to set a phonetic order.

Rule-Based Custom Collation

With java.text.RuleBasedCollator, you can clearly set the sort sequence:

RuleBasedCollator japaneseCollator = new RuleBasedCollator(
    "< あ < い < う < え < お < か < が < き < ぎ < く..."
);

This direct way sets a sorting order based on pronunciation. It does this instead of Unicode or byte values. This offers the closest match to Oracle.

Encoding Consideration

Even though UTF-8 is most common on the internet today, Oracle often uses EUC-JP or Shift_JIS by default in older systems. Java must deal with this when moving data or when systems need to work together.

To make things regular via EUC-JP:

String input = "あいうえお";
byte[] eucjpBytes = input.getBytes("EUC-JP");
String normalized = new String(eucjpBytes, "EUC-JP");

Do this step before sorting, especially if the text came from Oracle data.


ICU4J: Advanced Sorting Compatibility

ICU4J (International Components for Unicode – Java) is the best tool for working with all parts of Unicode in Java. The Unicode Consortium built it.

Why ICU4J Is Better Than Collator

  • It offers sorting that knows about specific regions and lets you control it in small ways.
  • It supports changing text between scripts (e.g., Katakana ➝ Hiragana).
  • It lets you set up custom rules down to syllable and tone.
  • It is kept up-to-date and works the same as the Unicode Collation Algorithm (UCA).

ICU4J says it supports more than 200 languages and specific sorting orders that follow Unicode rules (Unicode Consortium, 2022).

Example: ICU4J Setup

Collator collator = Collator.getInstance(ULocale.JAPAN);
collator.setStrength(Collator.PRIMARY);

With just a few lines, you make sure kana sort the same way, scripts are regular, and sorting is always the same, just like Oracle expects.

More advanced use lets you set custom sorting rules using UCA standards. This is important for Japanese settings where small details matter (e.g., education apps, dictionary tools).


Docker and Postgres: Dev Setup with Correct Encoding

Often, things don't match up not because of code, but because dev environments (like Docker containers) are set up wrong.

Dockerfile Locale Setup

ENV LANG ja_JP.EUC-JP

PostgreSQL init.sql

CREATE DATABASE japandb
  WITH ENCODING 'EUC_JP'
  LC_COLLATE='ja_JP.eucJP'
  LC_CTYPE='ja_JP.eucJP';

Make sure:

  • Your Docker host supports the Japanese locale.
  • Java app and database use the same encodings.
  • Sorting at the DB level uses the correct language code (LC_COLLATE).

Without this matching setup, even if your Java code is perfect, it can give wrong results.


Sample Project: Java Sorting Oracle-Style

You can use Maven and ICU4J to make a working sample that copies Oracle’s sort ordering.

pom.xml

<dependency>
  <groupId>com.ibm.icu</groupId>
  <artifactId>icu4j</artifactId>
  <version>73.1</version>
</dependency>

Main.java

Collator collator = Collator.getInstance(ULocale.JAPAN);
collator.setStrength(Collator.PRIMARY);

List<String> words = Arrays.asList("か", "あ", "ア", "さ");
Collections.sort(words, collator);

System.out.println(words);  // This should match Oracle’s kana order

Testing Tips

  • Use lists of words that you know are in order.
  • Compare against ORDER BY queries in Oracle with NLS_SORT=JAPANESE_M.
  • Check that half-width and full-width are treated the same.

Performance Impacts of Custom Sorting

Custom sorting means you trade off speed for how correct the sorting is.

Overheads

  • The cost of making strings regular
  • The time to read rules in RuleBasedCollator or ICU4J
  • Converting from Unicode to EUC-JP and back

Optimization Tips

  • Keep a cache of sorted results for word sets you already know.
  • Let Oracle do the sorting when you can (SELECT … ORDER BY col).
  • Let scripts do the work beforehand that prepare regular forms.

In places where many things happen fast, test the speed before using lots of language-based sorting in memory.


Best Practices for Consistent Sorting

To make sure sorting always works right across Oracle and Java:

  • ✅ Make encoding regular across all inputs (EUC-JP or UTF-8).
  • ✅ Use ICU4J or RuleBasedCollator—not String.compareTo().
  • ✅ Match locale settings between JVM, DB, and OS.
  • ✅ Use test data from actual Japanese text collections.
  • ✅ Write down all sorting and regularization rules.

Common Pitfalls When Sorting Japanese in Java

Avoid these common errors:

  • ❌ Thinking UTF-8 applies to older EUC-JP data.
  • ❌ Not making half-width/full-width characters regular.
  • ❌ Using Locale constants wrong (Locale.JAPANESE vs Locale.JAPAN).
  • ❌ Just using default string comparisons.
  • ❌ Not using ICU4J when you need more control.

Testing everything well and making sure things match across systems are the only ways to get sorting that is very close to Oracle’s.


Get started today by downloading our Docker + Java ICU4J starter project to see the difference yourself and start matching Oracle Japanese sort behavior the same way across all your systems.


Citations

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading