为什么Java允许在其标识符中使用控制字符？

2018-06-06 15:30:22

谜

在探索Java标识符允许使用哪些字符时，我偶然发现了一件非常好奇的事情，以至于几乎肯定是一个错误。

我期望能够找到Java标识符符合以下要求：它们以具有Unicode属性ID_Start的字符ID_Start ，后面跟着具有ID_Continue属性的ID_Continue ，并为引导下划线和美元符号授予例外。事实证明情况并非如此，我发现与这个或任何其他我听说过的标准标识符的想法存在极大的差异。

短演示

考虑以下演示，证明在Java标识符中允许使用ASCII ESC字符（八进制033）：

$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_33 = "i am escape: 33"; System.out.println(var_33); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[

不过，这甚至更糟糕。事实上，几乎是无穷的。即使NULL也是允许的！还有成千上万的其他代码点甚至不是标识符字符。我已经在Solaris，Linux和运行达尔文的Mac上测试过了，并且都给出了相同的结果。

长演示

这是一个测试程序，它将显示所有这些意想不到的代码点，Java相当夸张地允许它作为合法标识符名称的一部分。

#!/usr/bin/env perl
# 
# test-java-idchars - find which bogus code points Java allows in its identifiers
# 
#   usage: test-java-idchars [low high]
#   e.g.:  test-java-idchars 0 255
#
# Without arguments, tests Unicode code points
# from 0 .. 0x1000.  You may go further with a
# higher explicit argument.
#
# Produces a report at the end.
#
# You can ^C it prematurely to end the program then
# and get a report of its progress up to that point.
#
# Tom Christiansen
# tchrist@perl.com
# Sat Jan 29 10:41:09 MST 2011

use strict;
use warnings;

use encoding "Latin1";
use open IO => ":utf8";

use charnames ();

$| = 1;

my @legal;

my ($start, $stop) = (0, 0x1000);

if (@ARGV != 0) {
    if (@ARGV == 1) {
        for (($stop) = @ARGV) { 
            $_ = oct if /^0/;   # support 0OCTAL, 0xHEX, 0bBINARY
        }
    }
    elsif (@ARGV == 2) {
        for (($start, $stop) = @ARGV) { 
            $_ = oct if /^0/;
        }
    } 
    else {
        die "usage: $0 [ [start] stop ]n";
    } 
} 

for my $cp ( $start .. $stop ) {
    my $char = chr($cp);

    next if $char =~ /[sw]/;

    my $type = "?";
    for ($char) {
        $type = "Letter"      if /pL/;
        $type = "Mark"        if /pM/;
        $type = "Number"      if /pN/;
        $type = "Punctuation" if /pP/;
        $type = "Symbol"      if /pS/;
        $type = "Separator"   if /pZ/;
        $type = "Control"     if /pC/;
    } 
    my $name = $cp ? (charnames::viacode($cp) || "<missing>") : "NULL";
    next if $name eq "<missing>" && $cp > 0xFF;
    my $msg = sprintf("U+%04X %s", $cp, $name);
    print "testing p{$type} $msg...";
    open(TESTPROGRAM, ">:utf8", "testchar.java") || die $!;

print TESTPROGRAM <<"End_of_Java_Program";

public class testchar { 
    public static void main(String argv[]) { 
        String var_$char = "variable name ends in $msg";
        System.out.println(var_$char); 
    }
}

End_of_Java_Program

    close(TESTPROGRAM) || die $!;

    system q{
        ( javac -encoding UTF-8 testchar.java 
            && 
          java -Dfile.encoding=UTF-8 testchar | grep variable 
        ) >/dev/null 2>&1
    };

    push @legal, sprintf("U+%04X", $cp) if $? == 0;

    if ($? && $? < 128) {
        print "<interrupted>n";
        exit;  # from a ^C
    } 

    printf "is %s in Java identifiers.n",  
        ($? == 0) ? uc "legal" : "forbidden";

} 

END { 
    print "Legal but evil code points: @legaln";
}

下面是一个运行该程序的例子，它仅仅是前33个既不是空白也不是标识字符的代码点：

$ perl test-java-idchars 0 0x20
testing p{Control} U+0000 NULL...is LEGAL in Java identifiers.
testing p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers.
testing p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers.
testing p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers.
testing p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers.
testing p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers.
testing p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers.
testing p{Control} U+0007 BELL...is LEGAL in Java identifiers.
testing p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers.
testing p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers.
testing p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers.
testing p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers.
testing p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers.
testing p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers.
testing p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers.
testing p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers.
testing p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers.
testing p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers.
testing p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers.
testing p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers.
testing p{Control} U+0018 CANCEL...is LEGAL in Java identifiers.
testing p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers.
testing p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers.
testing p{Control} U+001B ESCAPE...is LEGAL in Java identifiers.
testing p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers.
testing p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers.
testing p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers.
testing p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers.
Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B

这是另一个演示：

$ perl test-java-idchars 0x600 0x700 | grep -i legal
testing p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers.
testing p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers.
testing p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers.
testing p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers.
testing p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers.
Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD

问题

任何人都可以解释这个看似疯狂的行为吗？有许多许多其他许多令人难以置信的代码点，从U + 0000开始，这可能是最奇怪的。如果您在第一个0x1000代码点上运行它，您确实会看到某些模式出现，例如允许使用属性Current_Symbol任何和所有代码点。但太多的东西是完全不可理解的，至少是我。

Java语言规范部分3.8遵循Character.isJavaIdentifierStart（）和Character.isJavaIdentifierPart（）。后者除其他条件外还具有Character.isIdentifierIgnorable（），它允许非空白控制字符（包括整个C1范围，请参阅列表链接）。

另一个问题可能是：为什么Java不允许控制字符在其标识符中？

在设计语言或其他系统时，一个好的原则就是在没有正当理由的情况下不要禁止任何事情，因为你永远不知道如何使用它，规则实施者和用户必须应对的越少越好。

确实，你肯定不应该利用这一点，通过实际将转义嵌入到你的变量名中，并且你不会看到任何公开其中包含空字符的类的流行库。

当然，这可能会被滥用，但这不是语言设计师的工作，不是通过强制适当的缩进或精心设计的变量名来保护程序员不受这种方式的影响。

我看不出有什么大不了的。无论如何，它如何影响你？

如果开发人员想混淆他的代码，他可以用ASCII来完成。

如果开发人员想让他的代码可以理解，他会使用行业的通用语言：英语。不仅标识符只有ASCII，而且来自普通的英文单词。否则，没有人会使用或阅读他的代码，他可以使用他喜欢的任何疯狂的角色。

链接地址: http://www.djcxy.com/p/20571.html

上一篇: Why does Java allow control characters in its identifiers?

下一篇: A unicode newline character(\u000d) in Java