语法树与抽象语法树(parse tree & abstract syntax tree)

是在parsing阶段,derivation的图像化表示,parser tree focus on grammar的actual implemment,包括像white spaces, braces, keywords, parenthesis 等一些细节。 “parse tree” 也叫 “concrete syntax tree” ,它represents the syntactic structure of a string(or token stream)according to some context-free grammar.

The parse tree records a sequence of rules the parser applies to recognize the input.
parse tree记录解析器应用于识别输入的一系列规则
The root of the parse tree is labeled with the grammar start symbol.
parse tree的根是grammar的start non-terminal symbol。
Each interior node represents a non-terminal, that is it represents a grammar rule application, like expr, term, or factor in our case.
每个内部节点代表一个非终端,即它代表语法规则应用程序,如我们的情况下的expr,term或factor。
Each leaf node represents a token.
每个叶节点代表一个token。



AST is a tree representation of the abstract syntactic structure of source codewritten in a programming language.它 focus on source code的 各个 components 之间的abstract relationships。 it doesn’t need to contain all the syntactical elements
he AST captures the essence of the input while being smaller. AST在较小的情况下捕获了输入的本质



3 What’s the difference
1.every grammars for the same language will give different parse trees but should result to the same AST



  1. AST 不focus on the ways they are generated by a grammar.。 AST focuses on programming constructs。也就是operators:computational operators( ±*/ [a, b]),for operator(for [ expr, expr, expr, stmnt ])



ASTs uses operators/operations as root and interior nodes and it uses operands as their children.
AST使用运算符/操作作为根节点和内部节点,并使用操作数作为子节点。
ASTs do not use interior nodes to represent a grammar rule, unlike the parse tree does.
与解析树不同,AST不使用内部节点来表示语法规则
ASTs don’t represent every detail from the real syntax (that’s why they’re called abstract) - no rule nodes and no parentheses, for example.
AST并不代表真实语法中的每个细节(这就是为什么它们被称为抽象) - 例如,省略了节点和括号等细节。
ASTs are dense compared to a parse tree for the same language construct
与同一语言构造的解析树相比,AST是密集的。



https://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees



CSTs - Concrete Syntax Trees (a.k.a. Parse Trees) and ASTs - Abstract Syntax Trees (a.k.a. Syntax Trees) are two terms that come up a lot in compiler & parser nomenclature. However, it is not always understood what is the difference between the two, because they are closely related.



In this article I attempt to explain the difference, at least for myself.



Concrete Syntax Trees
CSTs are a representation of grammars in a tree-like form. Here’s a definition from the Dragon book [1]:



A parse tree pictorially shows how the start symbol of a grammar derives a string in the language.
That is, the CST is a one-to-one mapping from the grammar to a tree-form. An example will make this easier to understand.



Consider parsing the C statement return a + 2;. Following the ANSI C grammar defined in K&R2 and appearing in its YACC representation in various places on the web, the parse tree of this statement looks thus:



And this is not even the complete tree. The path between conditional_expression and additive_expression goes through the sequence: logical_or_expression, logical_and_expression, inclusive_or_expression, exclusive_or_expression, and_expression, equality_expression, relational_expression and shift_expression. I kid you not! There are two more nodes between unary_expression and constant as well.



Indeed, this is a very formal representation that strictly shows how the parser understands the statement return a + 2;, but it is not a very useful representation to work with, is it?



This is where ASTs come in.



Abstract Syntax Trees
ASTs are simplified syntactic representations of the source code, and they’re most often expressed by the data structures of the language used for implementation [2]. Quoting from the Dragon book again:



Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.
ASTs don’t show the whole syntactic clutter, but represent the parsed string in a structured way, discarding all information that may be important for parsing the string, but isn’t needed for analyzing it.



Here’s an AST for return a + 2;, as generated by pycparser:



Looks like something much more manageable, doesn’t it? Note that an AST conveys information in the internal nodes as well, and not only at the leaves. So, instead of going through a long list of expression nodes just to represent the type of the expression, a simple op attribute in a BinOp node does the job. Also, the AST doesn’t hold syntactic nodes like ‘;’, because these are useless for analysis.



ASTs are usually the last product of the front-end of a compiler. They represent the structure of the code in a compact and useful way that facilitates convenient analysis and further processing.



From a CST to an AST
Parsers usually either construct ASTs directly in their actions, or first construct CSTs and then convert them to ASTs. It’s interesting to note that CSTs are trivial to build once you have a grammar written [3], and some tools will build it automatically. After all, a CST is just another representation of the grammar. But beware of tools that promise to automatically build ASTs - this can work in a general sense only for trivial grammars, and what you’ll get is most likely a CST rather than an AST.



For some languages, it is far from trivial to construct an AST from a CST. A good example is the syntax of type declarations in C, which is known to be tricky.



Consider the declaration int* arp[5]; - an array of 5 pointers to int. Here’s its parse tree, according to the ANSI C grammar:



Clutter aside, there are two other problems here that make analysis difficult:



Note that the pointer node is actually above the array, so working through the CST directly, it can be thought that we’ve declared a pointer to an array! This is the “reverse logic of declarations” problem of C.
Where is the array name? It’s buried deep down in the tree, which isn’t very convenient.
Because of these problems, it would be very difficult to work directly with a CST for C, without converting it to AST. pycparser converts to AST on the fly (without generating an explicit CST), and here’s the AST that represents int* arp[5];:



Both problems are solved. The pointer is now clearly below the array, and the variable name was raised to the top-most node. This AST can be just read out loud to produce the correct interpretation of the declaration: “arp is an array of 5 pointers to int”.



Conclusion
The front-end of a compiler can be seen as a process that transforms the input from its most concrete form (statements in the source language) to its most abstract form - stored in internal data structures and ready for analysis & translation.



CSTs and ASTs are two steps in this process. CSTs are more concrete, as they represent the input in a tree-like form, in the way it was parsed by the parser. This form only conveys syntactic information, and thus is simple to create from a grammar, and difficult to analyze.



ASTs are more abstract. They drop all the syntactic clutter and focus on the structure of the input, representing it in a hierarchical data structure useful for analysis & translation.



ASTs are either built on-the-fly in the parser’s actions, or constructed from a CST at a later stage. In either case, ASTs are not always simple to construct from the grammar / CST, and for some languages and constructs require non-trivial processing.



[1] “Compilers: Principles, Techniques, and Tools” by Aho, Sethi and Ullman
[2] For dynamical languages like Perl, Python, Ruby, Lisp etc. it’s as simple as just shoving all sub-productions into a single array and returning it. The result will be a recursively nested array of arrays that represents the tree. This representation is easy to build, but not so easy to traverse meaningfully, so it’s rarely used.
[3] Which makes their implementations flexible. Unlike the CST, which is tied quite strictly to the formal grammar of the language, ASTs are “freehand” data structures that can be built in different ways, augmenting as much information as required by the next stages of the compiler/translator.


Category golang