SAX handling special characters

I'm trying to parse an XML file with Java and SAX for an android device. I got from the internet and while parsing it I'm getting an ExpatException : not well-formed (invalid token) on the character "é". Is there a way to handle those characters without having to change all the specials characters in the xml file?

edit : Here is the part of my code writing the file to my SDcard.

File SDCardRoot = Environment.getExternalStorageDirectory();
            File f = new File(SDCardRoot,"edt.xml");
            f.createNewFile();
            FileOutputStream fileOutput = new FileOutputStream(f);
            InputStream inputStream = urlConnection.getInputStream();


            byte[] buffer = new byte[1024];
            int bufferLength = 0;
            while ( (bufferLength = inputStream.read(buffer)) > 0 ) {
                fileOutput.write(buffer, 0, bufferLength);
            }

            fileOutput.close();

Here is a part of my xml :

<?xml version="1.0" encoding="iso-8859-1"?>
<?xml-stylesheet type="text/xsl" href="ttss.xsl"?>

<timetable>
<option combined="0" totalweeks="0" showemptydays="0" dayclass="reverse">
<link href="g56065.xml" class="xml">Imprimer</link>
<link href="g56065.pdf" class="pdf">Version PDF</link>
<weeks>Semaines</weeks>
<dates>Dates</dates>
<week>Semaine</week>
<date>Date</date>
<all>Toutes les semaines</all>
<notes>Remarques</notes>
<id>ID</id>
<tag>Champs Libre</tag>
<footer>Publié le 10/09/2011 22:14:28</footer>
... </timetable>

here is the parsing code :

public class ParserSemaines extends DefaultHandler {
    private final String SEMAINE = "span";
    private final String DESCRIPTION = "description";
    private ArrayList<Semaine> semaines;
    private boolean inSemaine;
    private Semaine currentSemaine;
    private StringBuffer buffer;
    @Override
    public void processingInstruction(String target, String data) throws SAXException {
        super.processingInstruction(target, data);
    }
    public ParserSemaines() {
        super();
    }

    @Override
    public void startDocument() throws SAXException {
        super.startDocument();
        semaines = new ArrayList<Semaine>();
    }

    @Override
    public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
        buffer = new StringBuffer();
        if (localName.equalsIgnoreCase(SEMAINE)){
            this.currentSemaine = new Semaine();
            this.currentSemaine.setDate(attributes.getValue("date"));
            this.inSemaine = true;
        }
        if(localName.equalsIgnoreCase(DESCRIPTION)){
            this.currentSemaine.setDescription(buffer.toString());
        }
    }

    @Override
    public void endElement(String uri, String localName, String name) throws SAXException {
        if (localName.equalsIgnoreCase(SEMAINE)){
            this.semaines.add(currentSemaine);
            this.inSemaine = false;
        }
    }

    public void characters(char[] ch,int start, int length) throws SAXException{
        String lecture = new String(ch,start,length);
        if(buffer != null) buffer.append(lecture);
    }

    public ArrayList<Semaine> getData(){
        return semaines;
    }
}

here is the code I use to call the parser :

SAXParserFactory fabrique = SAXParserFactory.newInstance();
        SAXParser parseur = null;
        ArrayList<Semaine> semaines = null;
        try {
            parseur = fabrique.newSAXParser();
            DefaultHandler handler = new ParserSemaines();
            File f = new File(Environment.getExternalStorageDirectory(),"edt.xml");
            parseur.parse(f, handler);  
            semaines = ((ParserSemaines) handler).getData();
        }

Ask if any other code parts are required.

After check it appears that the xml file in the SDcard shows "é" as "�". That should be the problem but I don't have any clue why. I also tried to parse with the URI but it don't change anything I get always the same exception.


After check it appears that the xml file in the SDcard shows "é" as "�".

This does indicate an encoding problem.

The code that you posted appears to be a correct byte-by-byte copy from the URL to the file, so the file should represent exactly what you're getting from the URL. Which means that the response from the server may not be in ISO-8859-1.

My next step would be to use a tool such as Fiddler to examine the entire response, paying particular attention to:

  • The Content-Type header. If it tells you a different character set, you'll have to pass that information to the parser and/or manually convert it.
  • The actual bytes returned. For all you know, both the Content-Type and the XML prologue could be lying. If the file is truly ISO-8859-1, then the accented e should have a byte value of 0xE9. If the content is actually UTF-8, there should be the two-byte sequence 0xC3 0xA9 (see here). You're showing a three-byte sequence, which doesn't make sense. But it's best to check the source.
  • Also, verify that you're not converting the file to a string before passing it to the SAX parser.


    For reference: I wrote a minimal program that connects to the OP's URL and passes that connection directly to a minimal SAX parser. It appeared to run without error. I also used a DOM parser, and verified that at least the root element had been parsed correctly.

    public static void main(String[] argv)
    throws Exception
    {
       URL url = new URL("http://www.disvu.u-bordeaux1.fr/et/edt_etudiants2/Master/Semestre1/g56065.xml");
       InputStream in = url.openConnection().getInputStream();
    
       SAXParserFactory spf = SAXParserFactory.newInstance();
       SAXParser parser = spf.newSAXParser();
       parser.parse(in, new DefaultHandler());
       System.out.println("parse successful");
    }
    

    I finally find the solution. Instead of using SAXparder, I use

    android.util.Xml.parse(InputStream,Xml.Encoding.ISO_8859_1, DefaultHandler);
    

    Thanks everyone for all the help you provide me.


    Might be a problem with the encoding. Try changing it to ISO-8859-1 .

    In your xml try:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    

    or, in your code, use:

    inputSource.setEncoding("ISO-8859-1");
    
    链接地址: http://www.djcxy.com/p/34910.html

    上一篇: 用SAX解析XML:如何在xml中将html作为文本处理

    下一篇: SAX处理特殊字符