사용방법
1. CharsetToolkit를 사용
File file = new File("windows-1252.txt"); Charset guessedCharset = CharsetToolkit.guessEncoding(file, 4096); System.err.println("Charset found: " + guessedCharset.displayName()); FileInputStream fis = new FileInputStream(file); InputStreamReader isr = new InputStreamReader(fis, guessedCharset); BufferedReader br = new BufferedReader(isr); String line; while ((line = br.readLine()) != null) { System.out.println(line); }
2. SmartEncodingInputStream를 사용
FileInputStream fis = new FileInputStream("us-ascii.txt"); SmartEncodingInputStream smartIS = new SmartEncodingInputStream(fis); System.err.println("The charset of this input stream is: " + smartIS.getEncoding().displayName()); Reader reader = smartIS.getReader(); BufferedReader bufReader = new BufferedReader(reader); String line; while ((line = bufReader.readLine()) != null) { System.out.println(line); }
GuessEncoding
New Home for the Project
Please note that the project has moved to Codehaus a while ago, and you can find the latest and up-to-date version here:
http://docs.codehaus.org/display/GUESSENC
Origins
At work, I'm developping with IntelliJ IDEA, from Jetbrains. Though I've tried Eclipse and Jbuilder in the past, I came to love this IDE. It's certainly the best IDE around. It's a real pleasure to develop with it.
During the summer 2002, I came across an issue regarding file encodings. At work, one of our concerns is localisation/internationalisation issues. We develop applications that are i18n/i10n aware. We used to have our Java source files encoded in ISO-latin-1, and our XML files encoded in UTF-8 (especially because there were some language specific stuff inside). At that time, IDEA was able to read a file within a specified encoding. But it could not detect the encoding used to encode that file. And as shit happens sometimes ;-) I totally messed up a very important XML file... I then realised that it was due to the fact that IDEA was not able to guess the encoding. Charsets issues are very critical when dealing with l10n/i18n, that's why I filed some feature requests to the IDEA's developers. I wrote a two simple classes to show them that it was very easy to guess a charset, and I granted them the right to include (and modify) my source code inside IDEA. That's what they did, and since then, all IDEA fans can open their files without worring about messing up their files... (who hasn't seen some weird boxes or interrogation points in their messed files ?)
Content
The package com.glaforge.i18n.io consists of two classes : CharsetToolkit and SmartEncodingInputStream. The first one is a utility class that guesses the charset used in the byte buffer given as parameter. The latter one is a specialised input stream that wraps an input stream and reads a certain amount of the file to guess the right charset, and then opens the file with the right encoding.
Source code and API
- source files in HTML : http://glaforge.free.fr/projects/guessencoding/html/com/glaforge/i18n/io
- source code : http://glaforge.free.fr/projects/guessencoding/src
- sample files : http://glaforge.free.fr/projects/guessencoding/samples
- Javadoc API : http://glaforge.free.fr/projects/guessencoding/api
Usage
For a more detailed explanation of the usage of this package, please go read the Javadoc.