While you can normally insert arbitrary unicode characters to any X11 application using Ctrl-Shift-u and four hex digits, it doesn’t work in kwrite or kate. Instead you’d have to press F7 to switch to command line and type in
char <unicode>
For example, to get the degree symbol (Unicode: U+00B0) you’d type in 'char 176' (176 being 0xB0 converted do decimal).
Assumed, we got a fully parsed org.w3c.dom.Document:
Document doc;
//parse doc etc...
Just using LSSerializer‘s writeToString method without specifying any encoding will result in (rather impractical) UTF-16 encoded xml file per default
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");
LSSerializer lsSerializer = implLS.createLSSerializer();
lsSerializer.getDomConfig().setParameter("format-pretty-print", true);
String result = ser.writeToString(doc);
will output
<?xml version="1.0" encoding="UTF-16"?>
...
Unfortunately, specifying an encoding isn’t trivial. Here are two solutions that don’t require any third party libraries:
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");
LSSerializer lsSerializer = implLS.createLSSerializer();
lsSerializer.getDomConfig().setParameter("format-pretty-print", true);
LSOutput lsOutput = implLS.createLSOutput();
lsOutput.setEncoding("UTF-8");
Writer stringWriter = new StringWriter();
lsOutput.setCharacterStream(stringWriter);
lsSerializer.write(doc, lsOutput);
String result = stringWriter.toString();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
DOMSource source = new DOMSource(doc);
Writer stringWriter = new StringWriter();
StreamResult streamResult = new StreamResult(stringWriter);
transformer.transform(source, streamResult);
String result = stringWriter.toString();
Deep down in the Java-API:
http://java.sun.com/javase/6/docs/api/java/io/FileWriter.html
Convenience class for writing character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable. To specify these values yourself, construct an OutputStreamWriter on a FileOutputStream.
So, if you want to write you XML-Document to a file, for the love of god, don’t use the FileWriter like this:
BufferedWriter bufout = new BufferedWriter(new FileWriter(OUTFILE));
bufout.write(out);
bufout.close();
or you might end up with an XML-file that has a UTF-16 header (encoding="UTF-16") but is encoded completely differently (plain ASCII?! Not sure…).
Insted, use
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(OUTFILE),"UTF-16");
out.write(s);
out.close();
Resources:
http://www.malcolmhardie.com/weblogs/angus/2004/10/23/java-filewriter-xml-and-utf-8/
Just as you can convert entire files from one charset to another, you can convert the filenames. For example:
convmv -f iso-8859-15 -t utf-8 -r .
would recursively convert all files in the current directory from iso-8859-1 charset into utf-8. Well, not exactly. To finally rename the files you need the --notest flag. Otherwise convmv will perform a dry run without any changes.
How to convert iso-8859-1 charset files into utf-8? Simple:
iconv --from-code=ISO-8859-1 --to-code=UTF-8 oldfile > newfile
Of course, your values for --from-code and --to-code may vary. For a list of available encodings use iconv --list