SAX character buffer size

2018-06-12 04:35:23

I'm trying to use Sax to parse very large XML files. 100's of megs. The problem is the Parser reads in exactly 2048 characters at a time and terminates. I get a los of tag's value splitted into two parts using the callback "public void characters(...)". For example, the first part is in the character array on position 2044 with length 4 "2013" and the second part "-09-30" on position 0 with length 6. It should be a date value "2013-09-30" if receiving in one part. Ho can I avoid this splitting? Anyone can help me?

    public void characters(char[] ch, int start, int length) throws SAXException {
    if (Main.errorProceso==0){
    for(int i=0;i < strlista.size();i++){
    if(strlista.get(i).equals(sEtiqueta_actual)){
    if (sEtiqueta_actual.equals("Root.Header.Body.")){
    String FileNm= String.valueOf(ch, start, length);
    if (!FileNm.substring(0,2).equalsIgnoreCase("XX")){
    logger.info("El identificador no es XX");
    Main.errorProceso=1;
    i=strlista.size()+1;
    sEtiqueta_actual="";
    }
    else{
    sCod_Fichero=FileNm.substring(0,2)+XXteFormat.format(XXte);
    }
    }
    else if (sEtiqueta_actual.equals("Root.Header.Date.")){
    String aux = String.valueOf(ch, start, length).split("T")[0];
    try {
    sFec=newFormat.format(oldFormat.parse(aux));
    } catch (ParseException e) {
    logger.error(e.getLocalizedMessage());
    Main.errorProceso=1;
    }
    }
    else if (sEtiqueta_actual.equals("Root.Header2.Body2.")){
    sNum_Total=String.valueOf(ch, start, length);
    }
    else if (sEtiqueta_actual.equals("Root.Header3.Body3.Spcf.Inst.")){
    sImp =String.valueOf(ch, start, length);
    }
    .
    .
    .
    else if (sEtiqueta_actual.equals("Root.Header3.Body3.Spcf.Req.")){
    try {
    sFec2=newFormat.format(oldFormat.parse(String.valueOf(ch, start, length)));
    } catch (ParseException e) {
    logger.error(e.getLocalizedMessage());
    Main.errorProceso=1;
    }
    }
    }
    }

This is just the way SAX parsers work. If you could increase the buffer size (and I don't know how to do that), it wouldn't help; it would only reduce the number of times you get values broken into pieces.

The SAX parser is free to split character strings wherever it needs to (documentation). It does this for efficiency; to avoid using memory; for simplicity of implementation; or whatever other reason the library developer came up with.

So if you want to get your strings in one piece, you'll need to do so yourself. A simple solution, assuming that you never need to accumulate string values with sub-elements:

Add a StringBuffer accumulator to your implementation class, as well as an isAccumulating flag.

In startElement , if the element is of interest, set the isAccumulating flag.

In characters , if the isAccumulating flag is set, append the characters to accumulator.

In endElement , if the isAccumulating flag is set, do whatever you need to do with the accumulated string, and then clear the flag and empty the buffer.

If you might need to collect values with sub-elements, you could change isAccumulating from a flag to an integer depth counter. startElement increments the counter if it is greater than 0, or sets it to 1 if the element needs to have its value collected. characters appends the characters if the counter is greater than 0. endElement decrements the counter if it is greater than zero, and if the result is 0, handles and then clears the accumulator.

Use String.trim() and check String.length()>=0 before proceeding further into the characters() function

And use a stack to keep track of which tag the cData belongs to. And you then can append to it.

链接地址: http://www.djcxy.com/p/34892.html

上一篇: 在XOM中解析XHTML文档时DTD下载错误

下一篇: SAX字符缓冲区大小