Lexer Rules

一个lexer grammar由lexer rules组成，可以选择分为多个modes。 Lexical modes允许我们将单个lexer grammar拆分为多个sublexers。 lexer只能返回与当前mode中的规则匹配的Token。

Lexer规则指定Token定义，并且或多或少遵循parser rules的语法，只是lexer rules不能有参数、返回值或局部变量。 Lexer规则名称必须以大写字母开头，这将它们与parser rule名称区分开来:

/** Optional document comment */
TokenName : alternative1 | ... | alternativeN ;

您还可以定义一些特殊的规则，这些规则不是Token，而是有助于识别Token。这些fragment rule不会产生parser可见的Token：

fragment
HelperTokenRule : alternative1 | ... | alternativeN ;

例如，DIGIT 是一个非常常见的片段规则:

INT : DIGIT+ ; // 引用DIGIT辅助规则
fragment DIGIT : [0-9] ; //本身不是Token

Lexical Modes

Modes允许您按上下文对lexical rules进行分组，例如XML标记的内部和外部。这就像有多个sublexers，每个上下文对应一个sublexers。 lexer只能返回通过在当前Mode中输入规则匹配的Tokens。 Lexers从所谓的默认Mode开始。除非指定Mode命令，否则所有规则都将被视为处于默认Mode。 combined grammars中不允许使用Mode，只能在lexer grammars中使用。 (请参阅 grammar XMLLexer Tokenizing XML。)

rules in default mode
...
mode MODE1;
rules in MODE1
...
mode MODEN;
rules in MODEN
...

Lexer Rule Elements

Lexer rules允许两种parser rules无法使用的结构： .. range运算符和用方括号括起来的字符集表示法[characters]。不要将字符集与parser rules的参数混淆。这里[characters] 仅表示lexer中的字符集。以下是所有lexer rule元素的摘要：

Syntax	Description
T	在当前输入位置匹配Token T。 Token总是以大写字母开头。
’literal’	匹配该字符或字符序列。例如，“while”或“=”。
[char set]	匹配字符集中指定的一个字符。 Interpret `x-y` as the set of characters between range `x` and `y`, inclusively. 以下转义字符被解释为单个特殊字符: `\n`, `\r`, `\b`, `\t`, `\f`, `\uXXXX`, and `\u{XXXXXX}`. To get `]` or `\` you must escape them with `\`. To get `-` you must escape it with `\` too, except for the case when `-` is the first or last character in the set. You can also include all characters matching Unicode properties (general category, boolean, or enumerated including scripts and blocks) with `\p{PropertyName}` or `\p{EnumProperty=Value}`. (You can invert the test with `\P{PropertyName}` or `\P{EnumProperty=Value}`). For a list of valid Unicode property names, see Unicode Standard Annex #44. (ANTLR also supports short and long Unicode general category names and values like `\p{Lu}`, `\p{Z}`, `\p{Symbol}`, `\p{Blk=Latin_1_Sup}`, and `\p{Block=Latin_1_Supplement}`.) As a shortcut for `\p{Block=Latin_1_Supplement}`, you can refer to blocks using Unicode block names prefixed with `In` and with spaces changed to `_`. For example: `\p{InLatin_1_Supplement}`, `\p{InYijing_Hexagram_Symbols}`, and `\p{InAncient_Greek_Numbers}`. A few extra properties are supported: `\p{Extended_Pictographic}` (see UTS #35) `\p{EmojiPresentation=EmojiDefault}` (code points which have colorful emoji-style presentation by default but which can also be displayed text-style) `\p{EmojiPresentation=TextDefault}` (code points which have black-and-white text-style presentation by default but which can also be displayed emoji-style) `\p{EmojiPresentation=Text}` (code points which have only black-and-white text-style and lack a colorful emoji-style presentation) Property names are case-insensitive, and `_` and `-` are treated identically 以下是几个例子： WS : [ \n\u000D] -> skip ; // 与[\n\r]相同 UNICODE_WS : [\p{White_Space}] -> skip; // 匹配所有Unicode空格 ID : [a-zA-Z] [a-zA-Z0-9]* ; // 匹配常用identifier规范 UNICODE_ID : [\p{Alpha}\p{General_Category=Other_Letter}] [\p{Alnum}\p{General_Category=Other_Letter}]* ; //全Unicode字母ID匹配 EMOJI : [\u{1F4A9}\u{1F926}] ; // 注意Unicode码点> U FFFF DASHBRACK : [\-\]]+ ; // match - or ] one or more times DASH : [---] ; //匹配单个字符 - ，即介于 - 和 - 之间的“任意字符”（注意第一个和最后一个 - 不转义）
’x’..’y’	匹配范围x和y之间的任何单个字符(包括x和y)。例如，‘a’..‘z’。‘a’..‘z’等同于[a-z]。
T	调用lexer规则T; 通常允许递归，但不允许左递归。T可以是常规Token或fragment rule。 ID : LETTER (LETTER\|'0'..'9')* ; fragment LETTER : [a-zA-Z\u0080-\u00FF_] ;
.	点是匹配任何单个字符的单字符通配符。例子： ESC : '\\' . ; // match any escaped \x character
{«action»}	从4.2开始，Lexer actions可以出现在任何地方，而不仅仅是出现在最外层alternative的末尾。 lexer根据规则中动作的位置在适当的输入位置执行动作。要对具有多个alternatives的Rule执行单个操作，可以将alternatives括在括号中，然后将操作放在后面： END : ('endif'\|'end') {System.out.println("found an end");} ; 该action符合目标语言的语法。 ANTLR将Action的内容逐字复制到生成的代码中; parser actions中没有像$x.y这样的表达式翻译。仅执行最外层Token Rule内的Action。换句话说，如果STRING调用ESC_CHAR并且ESC_CHAR具有一个Action，则当Lexer开始在STRING中匹配时，该操作不会执行。
{«p»}?	计算语义谓词«p»。如果«p»在运行时的计算结果为false，则周围的规则将变为“不可见”(不可用)。表达式 «p» 符合目标语言语法。虽然语义谓词可以出现在lexer规则中的任何地方，但将它们放在规则末尾是最有效的。有一点需要注意，语义谓词必须在lexer actions之前。请参阅Lexer Rules中的Predicates。
~x	匹配不在x描述的集合中的任何单个字符。 Set x can be a single character literal, a range, or a subrule set like ~(’x’\|’y’\|’z’) or ~[xyz]. 以下是使用~的例子，匹配~[\r\n]字符以外的任何字符的rule： COMMENT : '#' ~[\r\n] '\r'? '\n' -> skip ;

就像parser rules一样，lexer rules允许括号和EBNF运算符中的subrules: ?, *, +。 COMMENT rule说明了*和?运算符。 +的常见用法是[0-9]+来匹配整数。 Lexer subrules也可以在那些EBNF运算符上使用非贪婪?后缀。

Recursive Lexer Rules

与大多数词汇语法工具不同，ANTLR的lexer rules可以是递归的。当您想要匹配嵌套的标记(如嵌套的动作块：{...{...}...} 时，这非常方便。

lexer grammar Recur;

ACTION : '{' ( ACTION | ~[{}] )* '}' ;

WS : [ \r\t\n]+ -> skip ;

Redundant String Literals

请注意，不要在多个lexer rules的右侧指定相同的字符串文字。这样的文本是不明确的，可以匹配多个token types。 ANTLR使此文字对parser不可用。跨Mode的Rule也是如此。例如，以下lexer grammar定义了两个具有相同字符序列的标记：

lexer grammar L;
AND : '&' ;
mode STR;
MASK : '&' ;

parser grammar不能引用字面上的 ’&’，但是它可以引用Token的名称:

parser grammar P;
options { tokenVocab=L; }
a : '&' // results in a tool error: no such token
    AND // no problem
    MASK // no problem
  ;

下面是一个构建和测试序列：

$ antlr4 L.g4 # yields L.tokens file needed by tokenVocab option in P.g4
$ antlr4 P.g4
error(126): P.g4:3:4: cannot create implicit token for string literal '&' in non-combined grammar

Lexer Rule Actions

ANTLR lexer在匹配lexical rule后创建Token对象。每个Token请求都以 Lexer.nextToken 开始，一旦识别出Token，它就会调用 “emit”。 emit从lexer的当前状态收集信息以构建Token。访问字段_type、_text、_channel、_tokenStartCharIndex、_tokenStartLine、和_tokenStartCharPositionInLine。您可以使用各种setter方法 (例如 setType) 设置它们的状态。例如，如果enumIsKeyword为false，则以下规则将enum转换为identifier。

ENUM : 'enum' {if (!enumIsKeyword) setType(Identifier);} ;

ANTLR在lexer actions中不执行特殊的$x属性转换(与v3不同)。

lexical rule最多只能有一个动作，而不管该规则中有多少种alternatives。

Lexer Commands

为了避免将语法与特定目标语言绑定，ANTLR支持lexer命令。与arbitrary embedded actions不同，这些命令遵循特定的语法，并且仅限于几个常见命令。 Lexer命令出现在lexer规则定义的最外层alternative的末尾。与arbitrary actions一样，每个Token规则只能有一个。 lexer命令由->操作符和一个或多个命令名组成，这些命令名可以选择性地接受参数：

TokenName : «alternative» -> command-name
TokenName : «alternative» -> command-name («identifier or integer»)

alternative可以有多个命令用逗号分隔。以下是有效的命令名：

skip
more
popMode
mode( x )
pushMode( x )
type( x )
channel( x )

有关用法，请参阅书中的源代码，示例如下所示：

skip

'skip'命令告诉lexer程序获取另一个Token并丢弃当前文本。

ID : [a-zA-Z]+ ; //匹配identifiers
INT : [0-9]+ ; //匹配integers
NEWLINE:'\r'? '\n' ; // 将换行符返回到parser (是结束语句信号)
WS : [ \t]+ -> skip ; // 去掉空白

mode(), pushMode(), popMode, and more

MODE命令改变模式堆栈，从而改变lexer的模式。 'more' 命令迫使lexer获得另一个Token，但不会丢弃当前文本。 Token类型将是匹配的“final”规则的类型（即没有“more”或“skip”命令的类型）。

// Default "mode": Everything OUTSIDE of a tag
COMMENT : '' ;
CDATA   : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
 ...
XMLDeclOpen : '<?xml' S -> pushMode(INSIDE) ;
SPECIAL_OPEN: '<?' Name -> more, pushMode(PROC_INSTR) ;
// ----------------- Everything INSIDE of a tag ---------------------
mode INSIDE;
CLOSE        : '>' -> popMode ;
SPECIAL_CLOSE: '?>' -> popMode ; // close <?xml...?>
SLASH_CLOSE  : '/>' -> popMode ;

另请查看：

lexer grammar Strings;
LQUOTE : '"' -> more, mode(STR) ;
WS : [ \r\t\n]+ -> skip ;
mode STR;
STRING : '"' -> mode(DEFAULT_MODE) ; // token we want parser to see
TEXT : . -> more ; // collect more text for string

弹出Mode堆栈的底层将导致异常。使用mode切换模式会更改当前堆栈顶部。多个more与只有一个more是一样的，位置并不重要。

type()

lexer grammar SetType;
tokens { STRING }
DOUBLE : '"' .*? '"'   -> type(STRING) ;
SINGLE : '\'' .*? '\'' -> type(STRING) ;
WS     : [ \r\t\n]+    -> skip ;

对于多个 'type()'命令，只有最右边才有效果。

channel()

BLOCK_COMMENT
    : '/*' .*? '*/' -> channel(HIDDEN)
    ;
LINE_COMMENT
    : '//' ~[\r\n]* -> channel(HIDDEN)
    ;
... 
// ----------
// Whitespace
//
// Characters and character constructs that are of no import
// to the parser and are used to make the grammar easier to read
// for humans.
//
WS : [ \t\r\n\f]+ -> channel(HIDDEN) ;

从4.5开始，您还可以像enumerations一样，使用lexer rules之上的以下构造定义通道名称：

channels { WSCHANNEL, MYHIDDEN }

Lexer-rules.md

Lexer Rules

Lexical Modes

Lexer Rule Elements

Recursive Lexer Rules

Redundant String Literals

Lexer Rule Actions

Lexer Commands

skip

mode(), pushMode(), popMode, and more

type()

channel()

导航菜单

Lexer-rules.md

Lexer Rules

Lexical Modes

Lexer Rule Elements

Recursive Lexer Rules

Redundant String Literals

Lexer Rule Actions

Lexer Commands

skip

mode(), pushMode(), popMode, and more

type()

channel()

导航菜单

搜索