The main idea of the presentation is not to discuss how Internationalization is done in Java but to highlight some of the problems which one comes across while developing an internationalized application in Java. Most of the observations listed below are based on the personal experience while trying to internationalize a Java based Web Application. Attendees are expected to have basic idea about the features provided by Java to support Internalization. A good resource for the basic reading is OReilly – Java Internationalization.

Background

To internationalize any application one can divide the problem into two categories i.e. internationalizing the static text displayed to users of different languages and internationalizing the dynamic content of the application. The former issue is mostly handled using resource bundles properties file while the latter issue is the one where the challenge lies mostly as the solutions differ based on the complexity of the application.

Further when one talks about Internationalization, level of complexity depends on the languages to be supported. Many applications which support multiple languages restrict themselves to English and European languages. This simplifies the issue because the number of characters which are supported are few though number of languages are multiple. Complexity multiplies when supporting languages like Chinese, Japanese and Korean where number of characters is very large.

In technical sense the above difference is based on the number of bytes required to represent the characters of the languages. Most of the European and Asian languages can be represented by 1 or 2 bytes of characters. Hence common encodings like ASCII and ISO-8859-1 are sufficient to represent these characters (ISO-8859-1 is the default encoding in most Windows OS). But when the number of characters increase these encodings are not sufficient and one needs move to UTF domain and represent the information is UTF-8, UTF-16 or UTF-32 (UTF-32 is rarely used). More details about encoding are not the scope of this presentation but can be read in OReilly book. But the general preferred encoding is UTF-8 due to its backward compatibility with ASCII. Though Java internally represents all the Strings in UTF-16 encoding but developer needs to take care of the encoding aspects when data is transferred between different interfaces like Browser to application to database or file system etc.

Challenges

Using one encoding or multiple encodings

A general problem which one comes across is the choice of encoding one would like to use for the supporting languages. One can either choose some universal encoding like UTF-8 for all the languages since this is based on Unicode standard and therefore is a good fit for any language. But there might be many applications who are already in use and have been using ISO encoding extensively. In such cases one can think of using multiple encoding for different languages. This approach though may minimize the legacy changes but would be lot harder to maintain as one needs to ensure that the encoding being used for one language data is used consistently throughout the application. Also a major issue which would come up with the latter approach is the data exchange between users of different languages especially if the data is stored in File system and not database. Therefore general advised approach is to use one universal encoding based on Unicode standard to make the effort extensible and easy to maintain.

Using Properties Class

This is one peculiar aspect in Java. When one deals with Files in Java, one has the freedom to specify the encoding using Reader stream hierarchy. When it comes to using properties files using java.util.Properties class currently the properties file is expected to contain characters belonging to ISO 88591 character encoding. There is way to specify Unicode characters in the properties file but one need to specify the Unicode values for this and cannot mention the actual characters like Chinese. This may not be a big issue but makes the life of developers difficult as for resource bundles etc. one needs to specify Unicode values. A work around this issue can be to read properties file using file streams to read resources file and parse them as properties file.

Handling File Streams

In this case the biggest challenge faced by developers is to migrate legacy application to use Reader/Writer hierarchy instead of InputStream/ OutputStream hierarchy classes. Most application which are not developed for multiple languages use input and output stream directly to read or write file data. But when one needs to specify the encoding one needs to use Reader and Writer classes. Also one basic principle always to be kept in mind is that files needs to be read and written in the same encoding format to protect the data from getting corrupted. This again raises the problem for developers and writing of many of the files required in the system may be happening from some third party applications and therefore cannot control there behavior. TO solve this issue partly the best way is to develop a small framework within your application to handle file system related activities and then ensure that all developers use these classes only to deal with file system.

XML files and XML Parser issues

Java provides a very robust API to handle XML files and also ability to parse those using SAX and DOM parsers. As its known that XML files can handle Unicode data, the issue in this case is mostly in the parser side. Also complexity increases for cases when incoming XML or SOAP needs to be validated against a given schema and therefore parser needs to validate the content properly. Following were the major issues we came across in our application for XML related activities

  • Internalization of XML tags and attributes: Since its not possible to specify to the parser, rules of character comparison of various languages ideally its best the tag names and attribute names belong to English alphabets only. The values of tags and attributes can contain any character based on the supported encoding.
  • XML header encoding specification: The encoding information given at top of the XML document is used by parser to know the format in which XML file contains the data. If nothing is specified it assumes to be UTF-8. This therefore adds the restriction that the header encoding should be the format in which the file is actually saved otherwise parser will throws an error.
  • Ideally UTF-8 is the universally chosen encoding but there are few peculiar issues to be kept in mind. If one saves the UTF-8 XML file in notepad, there are few invisible characters added in the start of the file. If one tried to read this file as a UTF-8 stream in java and submit to the XML parser there is a fair chance of getting the error “Content not allowed in prolog”. The reason for this is that Java parsers fail to handle the invisible characters added by notepad and therefore throw this error. Xerces parsers and few others have overcome this issue by taking care of such characters. But still there are issues which one comes across in actual applications. The best thing here is to use proper XML editors and also some tools like hex editor to remove any such invisible characters before submitting the XML.

Handling String and Text

The best way to handle String comparison and sorting in Internationalization context is to use java.util.Collator class. This allows the developers to develop the ability to have Locale based comparisons based on comparison rules for different languages. The behavior of String class is to use the UTF-16 codes of the characters for the comparison but this would not be acceptable for most languages other than English. E.g. for Czech language “ch” is one alphabet and comes between “h” and “i”. To ensure that while sorting Czech language data collator class needs to be used. But the problem may not end here. Collator class is very powerful and provides the developer with lot of options. Developer needs to decide various parameters of the collator object in order to obtain the desired behavior. One important parameter is Collator strength. By default the strength is set to TERTIARY. But if one wants to achieve the behavior in the example above the strength would need to be changed to SECONDARY. For more discussions on meaning of these parameters refer the O Reilly’s book.

The above issue needs to be solved properly when dealing with real world applications. Sorting and comparison requirements may vary for different languages. E.g. Czech user might want the results which need SECONDARY strength whereas a Chinese user may want other options. TO solve such dilemma the best approach would be to provide a proper API in the application which developers can use and specify such parameters easily and configure the collator object as desired. Also such a functionality would need to be exposed to application administrator so that he can change such a behavior at runtime based on the requirements.

Handling Currencies in Java

Generally there are so many currencies in the world that its best to leave their handling to Java itself. The DecimalFormat and NumberFormat classes in Java library are well equipped to deal with different currencies taking away the burden of parsing and formatting currencies away from the developer. There are few subtle issues here which developer might need to worry about.

There are few currencies like Polish Zloty or Russian Rubble which have there decimal separator as “ “. Ideally this character means space but its actual ASCII value in Java is 160 and not 32. This creates an issue when the data is coming from Web browser like Firefox. The browser sends this character as 32 and not 160 because of which if one tries to parse this value in java as a Zloty price invalid character is encountered. Currently we are not aware if this issue has been fixed but till jdk5.0 this issue is open.

Handling Date formats in Java

DateFormat and other related classes provided by Java are very powerful to handle various date related issues of an application. Generally issue of various date formats, time zones etc java handles very well. The role of the application developers comes on how to use all this to their advantage. Ideally it would be advisable that we always store the dates in DB or any other persistent layer in one common format to run the DB queries on data for all data. But when we display the date to users we display the date to the user in there chosen format. The goal here again should be to provide a generic API in the application handle these operations so that not every developer needs to worry about parsing and remembering different date formats.

Handling Data from Browser (For Servlet class and classes handling request and response objects)

In this case there are not many issues but few things to kept in mind while retrieving data from the request.

1.) Always better to set the encoding of the request and response objects to the chosen encoding of the application underlying API reads the data in correct format.

2.) While handling Multipart forms especially files in the request, the request should be parsed using the correct encoding which can handle the incoming data.

Handling String Length issues (More with regards to Database).

This issue is not really in java world but comes when interacting with DB layer of an internationalized application. The basic issue is that when one defines VARCHAR columns in DB, length needs to be specified. Normally this length means the byte length and not the actual character length. For most English based application this doesn’t matter as number of bytes is equal to number of characters. But when one comes across multibyte languages one needs to think more in terms of characters length and not byte length as they cannot be predicted. DB can be configured for such a setting using NLS_LNGTH parameter. Setting this right would mean DB will take care to allocate maximum number of bytes to any VARCHAR column so as to accommodate any language character. Thus if one defines VARCHAR2(500). DB internally will allocate 2000 bytes to this column, but would still allow only 500 characters of any language. This setting can be changes at session or system level. Since there are some space and performance consideration in altering system settings, many times it might be preferred to change this setting only when required on session basis. When this needs to be done on session basis role of Application developer becomes important.

From Java perspective, the DB drivers are oblivious to any such changes so developer need not worry about it. But there are few things which needs to be taken care in the code. One necessary step required is to set the NLS_LENGTH parameter for every DB session if the basic DB setting is not altered. Thus before any DB insert/update query a SQL command needs to be fired to alter the session NLS_LENGTH parameter to character.

Another issue which may or may not be important for all is accessing the character field length of a VARCHAR column instead of its byte length. By default the metadata of any result set would not carry the character length information of the field but its actual byte length. Thus based on the chosen DB, developer may need to refer to the special tables (which differ for different DBs) to access this property. Ideally we should have various implementation classes implementing the DB layer interface and they should take care of all such idiosyncrasies of the DB interaction.