Java Unicode translation

I came across the following code:

public class LinePrinter {
    public static void main(String args[]) {
      //Note: u000A is unicode for Line Feed
      char c=0x000A;
      System.out.println(c);
    }
}

This doesn't compile due to the Unicode replacement done.

The question is, why doesn't the comment ( // ) override Unicode replacement done by the compiler? I thought the compiler should ignore the comments first before doing anything else with the code translation.

EDIT:

Not sure if the above is clear enough.

I know what happens with the above and why it errors out. My expectation is that the compiler should ignore all the commented lines before doing any translation with the code. Obviously that's not the case here. I am expecting a rationale for this behaviour.


The specification states that a Java compiler must convert Unicode escapes to their corresponding characters before doing anything else, to allow for things like non-ASCII characters in identifiers to be protected (via native2ascii ) when the code is stored or sent over a channel that is not 8-bit clean.

This rule applies globally, in particular you can even escape comment markers using Unicode escapes. For example the following two snippets are identical:

// Deal with opening and closing comment characters /*, etc.
myRisquéParser.handle("/*", "*/");

u002Fu002F Deal with opening and closing comment characters /*, etc.
myRisquu00E9Parser.handle("/*", "*/");

If the compiler were to try and remove comments before handling Unicode escapes it would end up stripping everything from the /*, etc. to the handle("/*", "*/ , leaving

u002Fu002F Deal with opening and closing comment characters ");

which would then be unescaped to one single line comment, and then removed at the next stage of parsing. Thus generating no compiler error or warning but silently dropping a whole line of code...


It is in Java Puzzlers # 14 - an extract of the explanation:

The key to understanding this puzzle is that Java provides no special treatment for Unicode escapes within string literals. The compiler translates Unicode escapes into the characters they represent before it parses the program into tokens, such as strings literals [JLS 3.2].

Relevant pargraph in JLS v7 is paragraph 3.3:

A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) of the indicated hexadecimal value, and passing all other characters unchanged.

The introduction to section 3 of the JLS gives a hint as to why this is the case:

Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters.

链接地址: http://www.djcxy.com/p/20578.html

上一篇: 什么会导致Java编译器在解析注释时失败?

下一篇: Java Unicode转换