What can cause Java compiler to fail while parsing a comment?

2018-06-06 15:34:30

The following code is a valid Java program.

public class Foo
{
    public static void u006du0061u0069u006e(String[] args)
    {
        System.out.println("hello, world");
    }
}

The main identifier is written using Unicode escape sequences. It compiles and runs fine.

$ javac Foo.java && java Foo
hello, world

Although the following details may not be necessary for this question, I am sharing it in case someone is curious about it. I am using Java compiler from OpenJDK on Debian 8.0 but what I ask in this question should be applicable to any Java compiler.

$ javac -version
javac 1.7.0_79
$ readlink -f $(which javac)
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac

The following program is an error because the escape sequence used to write m of main is invalid.

public class Foo
{
    public static void u6du0061u0069u006e(String[] args)
    {
        System.out.println("hello, world");
    }
}

The compiler complains about illegal unicode sequence.

$ javac Foo.java && java Foo
Foo.java:3: error: illegal unicode escape
    public static void u6du0061u0069u006e(String[] args)
                           ^
Foo.java:3: error: invalid method declaration; return type required
    public static void u6du0061u0069u006e(String[] args)
                            ^
2 error

What surprised me is that the following program is also invalid even though the illegal unicode escape sequence seems to appear to be in a comment.

public class Foo
{
    // This comment contains u6d.
    public static void main(String[] args)
    {
        System.out.println("hello, world");
    }
}

Here is the error.

$ javac Foo.java && java Foo
Foo.java:3: error: illegal unicode escape
    // This comment contains u6d.
                                 ^
1 error

The compiler complains about the illegal unicode escape sequence although it appears to be in a comment.

The reason behind this behaviour becomes clear when we see how an end-of-line comment is defined in JLS §3.7.

EndOfLineComment:
/ / {InputCharacter}

JLS §3.4 defines InputCharacter as follows.

InputCharacter:
  UnicodeInputCharacter but not CR or LF

Finally, JLS §3.3 defines UnicodeInputCharacter as follows.

UnicodeInputCharacter:
  UnicodeEscape
  RawInputCharacter

UnicodeEscape:
   UnicodeMarker HexDigit HexDigit HexDigit HexDigit

UnicodeMarker:
  u {u}

HexDigit:
  (one of)
  0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F

RawInputCharacter:
  any Unicode character

Therefore, the lexical analyzer is required to first recognize the Unicode escape sequences in order to recognize comments, and if an illegal Unicode escape sequence is found, the lexical analysis would fail and an error would occur. Therefore, the compiler would never proceed to recognizing the comment that contained the illegal Unicode escape sequence.

Although I used to think that everything from the start of a comment (say // ) till the end is ignored, the above example shows that this is not the case because the lexical analyzer has to recognize Unicode escape sequences between the start of a comment and the end of a comment, and an illegal Unicode escape sequence can cause the lexical analysis to fail.

What else can cause the compiler to fail while parsing a comment?

Short:

Nothing (nothing else ).

Long:

Logically, the u escape sequences are handled before lexical processing (scanning/tokenizing) takes place. According to https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.2:

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:

A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.

A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).

A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).

So technically, u6d in your example is NOT a part of the comment. Whether or not it belongs to that comment is determined after it is translated back to a unicode code-point. But unfortunately it fails there.

As a proof, following class should compile:

public class Test {
    // is comment, the rest, notu000a public static void main( String[] args) {
        System.out.println("See!");
    }
}

链接地址: http://www.djcxy.com/p/20580.html

上一篇: 如何在java中避免双引号而不是\“但是\\ u0022

下一篇: 什么会导致Java编译器在解析注释时失败？