Interpreter example¶
This example discusses some of the more advanced features of tree-gen, using an interpreter for a very simple language as an example. This is, however, not a completely integrated or even functional interpreter; we just go over some of the concepts and things you may run into when you would make such an interpreter.
External nodes & multiple trees¶
When trees get large and complicated, you may want to split a tree up
into multiple files. This is supported within tree-gen, as long as there
are no circular references to nodes between the files. In this example,
we’ve split up the expression-like nodes into a tree separate from the
statement-like nodes. The former, value.tree
, has nothing special
going on in this regard, while the latter, program.tree
, effectively
includes the former. The files are listed at the bottom of this page
for your convenience.
tree-gen itself is unaware of any links between files. For
program.tree
, it simply generates code assuming that the types
referred to through the external
edges exist. For this to be true,
you have to tell tree-gen to include the external tree’s generated
header file at the top of its generated header file using the include
directive. Note also that, as tree-gen is unaware of the external tree,
you’ll have to use the full C++ TitleCase type name, rather than the
snake_case name for normal edges.
Integration with Flex/Bison (or other parsers)¶
When making an interpreter (or parser of any kind) with tree-gen, you’ll
probably want to use a tokenizer/parser generator such as Flex/Bison.
With Bison at least, you’ll quickly run into a fundamental issue when
trying to store tree-gen nodes in the generated yyval
union:
union in C++ can only consist of types with trivial constructors, and
tree-gen nodes are not among those. The author is unfamiliar with
other frameworks like it, but this will probably apply for any C-style
parser generator.
The solution is to use raw pointers to the generated node types in this enumeration. For instance, you may end up with the following (which would be generated by Bison):
union node_type {
program::Program *prog;
program::Assignment *asgn;
value::Literal *lit;
value::Reference *ref;
// ...
};
To create a new node, instead of using tree::base::make
,
you use the new
keyword directly.
node_type n1, n2;
n1.lit = new value::Literal(2);
n2.ref = new value::Reference("x");
The generated constructors won’t help you much for non-leaf nodes,
because they expect edge wrappers (One, Maybe, etc.) around the
parameters, so you’ll typically want to use the constructor without
arguments. Instead, to assign child nodes, you can use
Maybe::set_raw()
, One::set_raw()
, Any::add_raw()
,
or Many::add_raw()
. These functions take ownership of a new-allocated
raw pointer to a node - exactly what we need here. Note that because they
take ownership, you don’t have to (and shouldn’t!) delete manually.
node_type n3;
n3.asgn = new program::Assignment();
n3.asgn->rhs.set_raw(n1.lit);
n3.asgn->lhs.set_raw(n2.ref);
node_type n4;
n4.prog = new program::Program();
n4.prog->statements.add_raw(n3.asgn);
Bison will ultimately pass you the final parsed node, which you then have
to put into a One
edge yourself:
auto tree = tree::base::One<program::Program>();
tree.set_raw(n4.prog);
tree->dump();
↓
Program(
variables: []
statements: [
Assignment(
lhs: <
Reference(
name: x
target --> !MISSING
)
>
rhs: <
Literal(
value: 2
)
>
)
]
)
Note that when you’re using Bison, the enum and its types is largely hidden from you. In the actions the above would look more like this:
// Action for a literal:
{ $$ = new value::Literal(std::stoi($1)); }
// Action for a reference:
{ $$ = new value::Reference($1); /* and deallocate $1, assuming it's a char* */ }
// Action for a assignment:
{ $$ = new value::Assignment(); $$->lhs.set_raw($1); $$->rhs.set_raw($3); }
and so on.
Syntax error recovery¶
Some parser generators (like Bison) allow you to specify recovery rules
in case of a syntax error, so the parser doesn’t just die immediately.
This is necessary to emit more than a single error message at a time.
To help you store recovery information in the tree for post-mortem
analysis, tree-gen has a special error
directive that you can place
inside a node class. For example, if you make a recovery rule that
assumes any semicolon encountered after a syntax error produces a broken
statement, you can make a node like this:
# An erroneous statement.
erroneous_statement {
error;
}
Such nodes behave exactly like any other node, with one exception: they always indicate that they’re not well-formed.
auto err_stmt = tree::base::make<program::ErroneousStatement>();
ASSERT_RAISES(tree::base::NotWellFormed, err_stmt.check_well_formed());
↓
tree::base::NotWellFormed exception message: ErroneousStatement error node in tree
Line number information (and annotations in general)¶
When you’re doing any kind of serious parsing, you’ll want to store as much line number and other contextual information as possible, to make your error messages as clear as possible. You could of course add this information to every node in the tree specification file, perhaps using inheritance to prevent excessive repetition, but this gets annoying fast, especially when it comes to tree-gen’s well-formedness rules.
Instead, you should be using annotations for this. Annotations are
objects added to nodes without you ever having to declare that you’re
going to add them. If that sounds like black magic to you in the context
of C++, well, that’s because it is: internally annotations are stored
as a map from std::type_index
to a C++11 backport of std::any
…
it’s complicated. But you don’t have to worry about it. What matters is
that each and every node generated by tree-gen (or anything else that
inherits from tree::annotatable::Annotatable
) can have zero or one
annotation of every C++ type attached to it.
The use of annotations is not limited to metadata (although that is its primary purpose); they also allow bits of code to temporarily attach their own data to tree nodes, without the tree definitions needing to be updated to reflect this. This is especially useful when you’re using tree-gen in a library, and you want to let users operate on your trees. After all, they wouldn’t be able to modify the tree definition file without forking and recompiling your library.
tree-gen has only one special case for annotations, intended for adding
line number information to debug dumps. The location
directive is
used for this:
// Source location annotation type. The generated dumper will see if this
// annotation exists for a node, and if so, write it to the debug dump using
// the << stream operator.
location primitives::SourceLocation
Without going into too much detail about the Annotatable interface (just look at the API docs for that), here’s an example of how it would work.
n1.lit->set_annotation(primitives::SourceLocation("test", 1, 5));
n2.ref->set_annotation(primitives::SourceLocation("test", 1, 1));
n3.asgn->set_annotation(primitives::SourceLocation("test", 1, 1));
n4.prog->set_annotation(primitives::SourceLocation("test", 1, 1));
tree->dump();
↓
Program( # test:1:1
variables: []
statements: [
Assignment( # test:1:1
lhs: <
Reference( # test:1:1
name: x
target --> !MISSING
)
>
rhs: <
Literal( # test:1:5
value: 2
)
>
)
]
)
Note that we can still use these pointers despite ownership having been passed to the node objects because they refer to the same piece of memory. In practice, though, you would do this during the parser actions, just after constructing them.
When serializing and deserializing, annotations are ignored by default;
they can be of any type whatsoever, and C++ can’t dynamically introspect
which types can be (de)serialized and which can’t be, after all. So even
though the example SourceLocation object extends
tree::annotatable::Serializable
, this doesn’t work automagically.
auto node = tree::base::make<value::Literal>(2);
node->set_annotation(primitives::SourceLocation("test", 1, 1));
std::string cbor = tree::base::serialize(node);
auto node2 = tree::base::deserialize<value::Literal>(cbor);
ASSERT(!node2->has_annotation<primitives::SourceLocation>());
// However, you *can* register annotations types for serialization and
// deserialization if you want to through the
// ``tree::annotatable::serdes_registry`` singleton. After that, it will
// work.
tree::annotatable::serdes_registry.add<primitives::SourceLocation>("loc");
cbor = tree::base::serialize(node);
node2 = tree::base::deserialize<value::Literal>(cbor);
ASSERT(node2->has_annotation<primitives::SourceLocation>());
Two add()
methods are provided. The one used here assumes that the
type has an appropriate serialize()
method and an appropriate
constructor for deserialization, the other allows you to specify them
manually using ``std::function``s. The name is optional but strongly
recommended; if not used, a C++-compiler-specific identifier will be used
for the type.
Similar to links, annotations aren’t copied as you might expect by
copy()
and clone()
. Specifically, annotations are stored as
std::shared_ptr``s to unknown C++ objects, and therefore copying a
node only copies the references to the annotations. To fully clone
annotations, you'll either have to serialize and deserialize the node
they belong to (after registering with ``serdes_registry
of course), or
clone them manually.
Visitor pattern¶
In the directory example, we avoided the difficulty of dealing with edges
to incomplete types, such as a One<value::Rvalue>
, aside from a
single as_file()
call halfway through. Spamming such typecasts to,
for instance, evaluate an rvalue expression during constant propagation
or interpreting, is not very scalable. It can also lead to headache down
the line if you ever have to add more subclasses, as it’s easy to forget
or not bother to check for unknown types when initially writing that code
that way. tree-gen also generates the necessary classes for the visitor
pattern for you to avoid this.
In the visitor pattern, you define a class that implements the
appropriate visitor interface, and (in this case) the nodes will call
methods on this class depending on their type when you pass the visitor
object to them through their visit()
method. Because the visitor
interface classes are also generated by tree-gen and have sane default
behavior, your code will “by default” be future-proof, in the sense that
you’ll get an error if something unexpected happens, rather than it
failing silently and maybe crashing down the line.
Two visitor base classes are provided, Visitor<T=void>
and
RecursiveVisitor
. In either case, you have to override
visit_node()
, which is called when a node is encountered for which
no specialization is available; this should just throw an appropriate
exception. You can then optionally override any node in the node class
hierarchy to provide the actual behavior.
The two classes differ in the default behavior when an unspecialized
node is encountered: Visitor
always defers up the class hierarchy,
while RecursiveVisitor
provides a default implementation for node
types with edges, where it simply recursively visits the child nodes
depth-first.
The T type in Visitor<T>
refers to the return type for the visit
methods. It’s fixed to void
for RecursiveVisitor
, because if a
node has multiple children, there is nothing sensible to return.
Let’s look at what an expression evaluator looks like for this.
class RvalueEvaluator : public value::Visitor<int> {
public:
int visit_node(value::Node &node) override {
throw std::runtime_error("unknown node type");
}
int visit_literal(value::Literal &node) override {
return node.value;
}
int visit_negate(value::Negate &node) override {
return -node.oper->visit(*this);
}
int visit_add(value::Add &node) override {
return node.lhs->visit(*this) + node.rhs->visit(*this);
}
int visit_sub(value::Sub &node) override {
return node.lhs->visit(*this) - node.rhs->visit(*this);
}
int visit_mul(value::Mul &node) override {
return node.lhs->visit(*this) * node.rhs->visit(*this);
}
int visit_div(value::Div &node) override {
return node.lhs->visit(*this) / node.rhs->visit(*this);
}
int visit_reference(value::Reference &node) override {
return node.target->value;
}
};
// 2 + (3 * 4) = 14
auto expr = tree::base::make<value::Add>(
tree::base::make<value::Literal>(2),
tree::base::make<value::Mul>(
tree::base::make<value::Literal>(3),
tree::base::make<value::Literal>(4)
)
);
RvalueEvaluator eval{};
ASSERT(expr->visit(eval) == 14);
If we wouldn’t have implemented visit_node()
, the
RvalueEvaluator eval{};
line would break, as RvalueEvaluator would
be an abstract class because of it. And while the above may look pretty
complete, it is in fact not:
ASSERT_RAISES(std::runtime_error, value::ErroneousValue().visit(eval));
↓
std::runtime_error exception message: unknown node type
The visitor pattern is more powerful than recursively calling functions
for other reasons as well, because they can be specialized through
inheritance, state can be maintained in the class as the tree is
traversed (the dump()
method is actually implemented this way
internally), it can be written inline within a function despite the
recursive nature, and so on. But it’s a lot more verbose than the
as_*()
alternative for simple cases. The choice between visitor
pattern and casts is therefore usually a matter of personal preference.
Complete file listings¶
value.tree¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | // Attach \file docstrings to the generated files for Doxygen. # Implementation for value tree classes. source # Header for value tree classes. header // Include tree base classes. include "tree-base.hpp" tree_namespace tree::base // Include primitive types. include "primitives.hpp" import primitives // Initialization function to use to construct default values for the tree base // classes and primitives. initialize_function primitives::initialize serdes_functions primitives::serialize primitives::deserialize // Source location annotation type. The generated dumper will see if this // annotation exists for a node, and if so, write it to the debug dump using // the << stream operator. location primitives::SourceLocation // Set the namespace for the generated classes and attach a docstring. # Namespace for value tree classes. namespace value # Variable instance. variable { # The name of the variable. name: primitives::Str; # The current value of the variable while interpreting. value: primitives::Int; } # Toplevel expression node. rvalue { # A literal integer. literal { # The value of the literal. value: primitives::Int; } # Unary operators. unop { # The operand. oper: One<rvalue>; # Negation. negate {} } # Binary operators. binop { # Left-hand side of the expression. lhs: One<rvalue>; # Right-hand side of the expression. rhs: One<rvalue>; # Addition. add {} # Subtraction. sub {} # Multiplication. mul {} # Division. div {} } # Toplevel node for assignable expressions. lvalue { # A variable reference. reference { # The name used to refer to the variable. name: primitives::Str; # The variable being referenced. // Note that we use a link here to allow multiple places in the // tree to refer to the same variable instance in our toy // interpreter, so we don't have to do a hashmap lookup each time // and a well-formed tree by definition cannot refer to variables // that don't exist. target: Link<variable>; } # An erroneous expression. erroneous_value { error; } } } |
program.tree¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | // Attach \file docstrings to the generated files for Doxygen. # Implementation for program tree classes. source # Header for program tree classes. header // Include tree base classes. include "tree-base.hpp" tree_namespace tree::base // Include primitive types. include "primitives.hpp" import primitives // Include the generated value tree header. include "value.hpp" import value // Initialization function to use to construct default values for the tree base // classes and primitives. initialize_function primitives::initialize serdes_functions primitives::serialize primitives::deserialize // Source location annotation type. The generated dumper will see if this // annotation exists for a node, and if so, write it to the debug dump using // the << stream operator. location primitives::SourceLocation // Set the namespace for the generated classes and attach a docstring. # Namespace for program tree classes. namespace program # A program is a collection of statements. program { # Variable declarations for the program. variables: external Any<value::Variable>; # The statements in the program. statements: Any<statement>; } # A statement. statement { # Conditional statement. if_else { # Branch condition. cond: external One<value::Rvalue>; # If block. if_block: Any<statement>; # Else block. else_block: Any<statement>; } # For loop, from start to (non-inclusive) end. for_loop { # Variable to assign. var: external One<value::Lvalue>; # Start value. start: external One<value::Rvalue>; # Stop value. stop: external One<value::Rvalue>; # The repeated code. block: Any<statement>; } # Assignment statement. assignment { # Variable to assign. lhs: external One<value::Lvalue>; # Expression to assign it to. rhs: external One<value::Rvalue>; } # Print statement. print { # Value to print. expr: external One<value::Rvalue>; } # An erroneous statement. erroneous_statement { error; } } |
primitives.hpp¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | /** \file * Defines primitives used in the generated directory tree structure. */ #pragma once #include <string> #include "tree-base.hpp" /** * Namespace with primitives used in the generated directory tree structure. */ namespace primitives { /** * Integer primitive. */ using Int = int; /** * Strings, used to represent filenames and file contents. */ using Str = std::string; /** * Initialization function. This must be specialized for any types used as * primitives in a tree that are actual C primitives (int, char, bool, etc), * as these are not initialized by the T() construct. */ template <class T> T initialize() { return T(); }; /** * Declare the default initializer for integers. It's declared inline to avoid * having to make a cpp file just for this. */ template <> inline Int initialize<Int>() { return 0; } /** * Serialization function. This must be specialized for any types used as * primitives in a tree. The default implementation doesn't do anything. */ template <typename T> void serialize(const T &obj, tree::cbor::MapWriter &map) { } /** * Serialization function for Int. */ template <> inline void serialize<Int>(const Int &obj, tree::cbor::MapWriter &map) { map.append_int("val", obj); } /** * Serialization function for Str. */ template <> inline void serialize<Str>(const Str &obj, tree::cbor::MapWriter &map) { map.append_string("val", obj); } /** * Deserialization function. This must be specialized for any types used as * primitives in a tree. The default implementation doesn't do anything. */ template <typename T> T deserialize(const tree::cbor::MapReader &map) { return initialize<T>(); } /** * Deserialization function for Int. */ template <> inline Int deserialize<Int>(const tree::cbor::MapReader &map) { return map.at("val").as_int(); } /** * Deserialization function for Str. */ template <> inline Str deserialize<Str>(const tree::cbor::MapReader &map) { return map.at("val").as_string(); } /** * Source location annotation object, containing source file line numbers etc. */ class SourceLocation: tree::annotatable::Serializable { public: std::string filename; int line; int column; SourceLocation( const std::string &filename, uint32_t line = 0, uint32_t column = 0 ) : filename(filename), line(line), column(column) { } /** * Serialization logic for SourceLocation. */ inline void serialize(tree::cbor::MapWriter &map) const override { map.append_string("filename", filename); map.append_int("line", line); map.append_int("column", column); } /** * Deserialization logic for SourceLocation. */ inline SourceLocation(const tree::cbor::MapReader &map) { filename = map.at("filename").as_string(); line = map.at("line").as_int(); column = map.at("column").as_int(); } }; /** * Stream << overload for source location objects. */ inline std::ostream& operator<<(std::ostream& os, const primitives::SourceLocation& object) { os << object.filename << ":" << object.line << ":" << object.column; return os; } } // namespace primitives |
CMakeLists.txt¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | cmake_minimum_required(VERSION 3.12 FATAL_ERROR) project(interpreter-example CXX) # Add the tree-gen repository root directory. Normally, your project would be # located outside of tree-gen's root, which means you can omit the second # argument. add_subdirectory(../.. tree-gen) # Generates the files for the value tree. generate_tree_py( "${CMAKE_CURRENT_SOURCE_DIR}/value.tree" "${CMAKE_CURRENT_BINARY_DIR}/value.hpp" "${CMAKE_CURRENT_BINARY_DIR}/value.cpp" "${CMAKE_CURRENT_BINARY_DIR}/value.py" ) # Generates the files for the program tree. generate_tree_py( "${CMAKE_CURRENT_SOURCE_DIR}/program.tree" "${CMAKE_CURRENT_BINARY_DIR}/program.hpp" "${CMAKE_CURRENT_BINARY_DIR}/program.cpp" "${CMAKE_CURRENT_BINARY_DIR}/program.py" ) add_executable( interpreter-example "${CMAKE_CURRENT_BINARY_DIR}/value.cpp" "${CMAKE_CURRENT_BINARY_DIR}/program.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/main.cpp" ) target_include_directories( interpreter-example # This directory for primitives.hpp: PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}" # Binary directory for directory.hpp: PRIVATE "${CMAKE_CURRENT_BINARY_DIR}" ) target_link_libraries(interpreter-example tree-lib) # The following lines only serve to register the example as a test, so you can # run it using `make test` or `ctest` as well. You don't need them as such in # your own project. enable_testing() add_test(interpreter-example interpreter-example) |
main.cpp¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 | #include <iostream> #include <cstdio> #include <stdexcept> #include "../utils.hpp" // Include the generated files. #include "value.hpp" #include "program.hpp" // Note: the // comment contents of main(), together with the MARKER lines and // the output of the program, are used to automatically turn this into a // restructured-text page for ReadTheDocs. int main() { // ******************* // Interpreter example // ******************* // // This example discusses some of the more advanced features of tree-gen, // using an interpreter for a very simple language as an example. This is, // however, not a completely integrated or even functional interpreter; we // just go over some of the concepts and things you may run into when you // would make such an interpreter. // // External nodes & multiple trees // =============================== // // When trees get large and complicated, you may want to split a tree up // into multiple files. This is supported within tree-gen, as long as there // are no circular references to nodes between the files. In this example, // we've split up the expression-like nodes into a tree separate from the // statement-like nodes. The former, ``value.tree``, has nothing special // going on in this regard, while the latter, ``program.tree``, effectively // includes the former. The files are listed at the bottom of this page // for your convenience. // // tree-gen itself is unaware of any links between files. For // ``program.tree``, it simply generates code assuming that the types // referred to through the ``external`` edges exist. For this to be true, // you have to tell tree-gen to include the external tree's generated // header file at the top of its generated header file using the ``include`` // directive. Note also that, as tree-gen is unaware of the external tree, // you'll have to use the full C++ TitleCase type name, rather than the // snake_case name for normal edges. // // Integration with Flex/Bison (or other parsers) // ============================================== // // When making an interpreter (or parser of any kind) with tree-gen, you'll // probably want to use a tokenizer/parser generator such as Flex/Bison. // With Bison at least, you'll quickly run into a fundamental issue when // trying to store tree-gen nodes in the generated ``yyval`` union: // union in C++ can only consist of types with trivial constructors, and // tree-gen nodes are not among those. The author is unfamiliar with // other frameworks like it, but this will probably apply for any C-style // parser generator. // // The solution is to use raw pointers to the generated node types in this // enumeration. For instance, you may end up with the following (which would // be generated by Bison): union node_type { program::Program *prog; program::Assignment *asgn; value::Literal *lit; value::Reference *ref; // ... }; MARKER // To create a new node, instead of using ``tree::base::make``, // you use the ``new`` keyword directly. node_type n1, n2; n1.lit = new value::Literal(2); n2.ref = new value::Reference("x"); MARKER // The generated constructors won't help you much for non-leaf nodes, // because they expect edge wrappers (One, Maybe, etc.) around the // parameters, so you'll typically want to use the constructor without // arguments. Instead, to assign child nodes, you can use // ``Maybe::set_raw()``, ``One::set_raw()``, ``Any::add_raw()``, // or ``Many::add_raw()``. These functions take ownership of a new-allocated // raw pointer to a node - exactly what we need here. Note that because they // take ownership, you don't have to (and shouldn't!) delete manually. node_type n3; n3.asgn = new program::Assignment(); n3.asgn->rhs.set_raw(n1.lit); n3.asgn->lhs.set_raw(n2.ref); node_type n4; n4.prog = new program::Program(); n4.prog->statements.add_raw(n3.asgn); MARKER // Bison will ultimately pass you the final parsed node, which you then have // to put into a ``One`` edge yourself: auto tree = tree::base::One<program::Program>(); tree.set_raw(n4.prog); tree->dump(); MARKER // Note that when you're using Bison, the enum and its types is largely // hidden from you. In the actions the above would look more like this: // // .. code-block:: none // // // Action for a literal: // { $$ = new value::Literal(std::stoi($1)); } // // // Action for a reference: // { $$ = new value::Reference($1); /* and deallocate $1, assuming it's a char* */ } // // // Action for a assignment: // { $$ = new value::Assignment(); $$->lhs.set_raw($1); $$->rhs.set_raw($3); } // // and so on. // // Syntax error recovery // --------------------- // // Some parser generators (like Bison) allow you to specify recovery rules // in case of a syntax error, so the parser doesn't just die immediately. // This is necessary to emit more than a single error message at a time. // To help you store recovery information in the tree for post-mortem // analysis, tree-gen has a special ``error`` directive that you can place // inside a node class. For example, if you make a recovery rule that // assumes any semicolon encountered after a syntax error produces a broken // statement, you can make a node like this: // // .. code-block:: none // // # An erroneous statement. // erroneous_statement { // error; // } // // Such nodes behave exactly like any other node, with one exception: they // always indicate that they're not well-formed. auto err_stmt = tree::base::make<program::ErroneousStatement>(); ASSERT_RAISES(tree::base::NotWellFormed, err_stmt.check_well_formed()); MARKER; // Line number information (and annotations in general) // ---------------------------------------------------- // // When you're doing any kind of serious parsing, you'll want to store as // much line number and other contextual information as possible, to make // your error messages as clear as possible. You could of course add this // information to every node in the tree specification file, perhaps using // inheritance to prevent excessive repetition, but this gets annoying fast, // especially when it comes to tree-gen's well-formedness rules. // // Instead, you should be using annotations for this. Annotations are // objects added to nodes without you ever having to declare that you're // going to add them. If that sounds like black magic to you in the context // of C++, well, that's because it is: internally annotations are stored // as a map from ``std::type_index`` to a C++11 backport of ``std::any``... // it's complicated. But you don't have to worry about it. What matters is // that each and every node generated by tree-gen (or anything else that // inherits from ``tree::annotatable::Annotatable``) can have zero or one // annotation of every C++ type attached to it. // // The use of annotations is not limited to metadata (although that is its // primary purpose); they also allow bits of code to temporarily attach // their own data to tree nodes, without the tree definitions needing to be // updated to reflect this. This is especially useful when you're using // tree-gen in a library, and you want to let users operate on your trees. // After all, they wouldn't be able to modify the tree definition file // without forking and recompiling your library. // // tree-gen has only one special case for annotations, intended for adding // line number information to debug dumps. The ``location`` directive is // used for this: // // .. code-block:: none // // // Source location annotation type. The generated dumper will see if this // // annotation exists for a node, and if so, write it to the debug dump using // // the << stream operator. // location primitives::SourceLocation // // Without going into too much detail about the Annotatable interface (just // look at the API docs for that), here's an example of how it would work. n1.lit->set_annotation(primitives::SourceLocation("test", 1, 5)); n2.ref->set_annotation(primitives::SourceLocation("test", 1, 1)); n3.asgn->set_annotation(primitives::SourceLocation("test", 1, 1)); n4.prog->set_annotation(primitives::SourceLocation("test", 1, 1)); tree->dump(); MARKER // Note that we can still use these pointers despite ownership having been // passed to the node objects because they refer to the same piece of // memory. In practice, though, you would do this during the parser actions, // just after constructing them. // // When serializing and deserializing, annotations are ignored by default; // they can be of any type whatsoever, and C++ can't dynamically introspect // which types can be (de)serialized and which can't be, after all. So even // though the example SourceLocation object extends // ``tree::annotatable::Serializable``, this doesn't work automagically. auto node = tree::base::make<value::Literal>(2); node->set_annotation(primitives::SourceLocation("test", 1, 1)); std::string cbor = tree::base::serialize(node); auto node2 = tree::base::deserialize<value::Literal>(cbor); ASSERT(!node2->has_annotation<primitives::SourceLocation>()); // However, you *can* register annotations types for serialization and // deserialization if you want to through the // ``tree::annotatable::serdes_registry`` singleton. After that, it will // work. tree::annotatable::serdes_registry.add<primitives::SourceLocation>("loc"); cbor = tree::base::serialize(node); node2 = tree::base::deserialize<value::Literal>(cbor); ASSERT(node2->has_annotation<primitives::SourceLocation>()); MARKER // Two ``add()`` methods are provided. The one used here assumes that the // type has an appropriate ``serialize()`` method and an appropriate // constructor for deserialization, the other allows you to specify them // manually using ``std::function``s. The name is optional but strongly // recommended; if not used, a C++-compiler-specific identifier will be used // for the type. // // Similar to links, annotations aren't copied as you might expect by // ``copy()`` and ``clone()``. Specifically, annotations are stored as // ``std::shared_ptr``s to unknown C++ objects, and therefore copying a // node only copies the references to the annotations. To fully clone // annotations, you'll either have to serialize and deserialize the node // they belong to (after registering with ``serdes_registry`` of course), or // clone them manually. // Visitor pattern // =============== // // In the directory example, we avoided the difficulty of dealing with edges // to incomplete types, such as a ``One<value::Rvalue>``, aside from a // single ``as_file()`` call halfway through. Spamming such typecasts to, // for instance, evaluate an rvalue expression during constant propagation // or interpreting, is not very scalable. It can also lead to headache down // the line if you ever have to add more subclasses, as it's easy to forget // or not bother to check for unknown types when initially writing that code // that way. tree-gen also generates the necessary classes for the visitor // pattern for you to avoid this. // // In the visitor pattern, you define a class that implements the // appropriate visitor interface, and (in this case) the nodes will call // methods on this class depending on their type when you pass the visitor // object to them through their ``visit()`` method. Because the visitor // interface classes are also generated by tree-gen and have sane default // behavior, your code will "by default" be future-proof, in the sense that // you'll get an error if something unexpected happens, rather than it // failing silently and maybe crashing down the line. // // Two visitor base classes are provided, ``Visitor<T=void>`` and // ``RecursiveVisitor``. In either case, you have to override // ``visit_node()``, which is called when a node is encountered for which // no specialization is available; this should just throw an appropriate // exception. You can then optionally override any node in the node class // hierarchy to provide the actual behavior. // // The two classes differ in the default behavior when an unspecialized // node is encountered: ``Visitor`` always defers up the class hierarchy, // while ``RecursiveVisitor`` provides a default implementation for node // types with edges, where it simply recursively visits the child nodes // depth-first. // // The T type in ``Visitor<T>`` refers to the return type for the visit // methods. It's fixed to ``void`` for ``RecursiveVisitor``, because if a // node has multiple children, there is nothing sensible to return. // // Let's look at what an expression evaluator looks like for this. class RvalueEvaluator : public value::Visitor<int> { public: int visit_node(value::Node &node) override { throw std::runtime_error("unknown node type"); } int visit_literal(value::Literal &node) override { return node.value; } int visit_negate(value::Negate &node) override { return -node.oper->visit(*this); } int visit_add(value::Add &node) override { return node.lhs->visit(*this) + node.rhs->visit(*this); } int visit_sub(value::Sub &node) override { return node.lhs->visit(*this) - node.rhs->visit(*this); } int visit_mul(value::Mul &node) override { return node.lhs->visit(*this) * node.rhs->visit(*this); } int visit_div(value::Div &node) override { return node.lhs->visit(*this) / node.rhs->visit(*this); } int visit_reference(value::Reference &node) override { return node.target->value; } }; // 2 + (3 * 4) = 14 auto expr = tree::base::make<value::Add>( tree::base::make<value::Literal>(2), tree::base::make<value::Mul>( tree::base::make<value::Literal>(3), tree::base::make<value::Literal>(4) ) ); RvalueEvaluator eval{}; ASSERT(expr->visit(eval) == 14); MARKER // If we wouldn't have implemented ``visit_node()``, the // ``RvalueEvaluator eval{};`` line would break, as RvalueEvaluator would // be an abstract class because of it. And while the above may look pretty // complete, it is in fact not: ASSERT_RAISES(std::runtime_error, value::ErroneousValue().visit(eval)); MARKER // The visitor pattern is more powerful than recursively calling functions // for other reasons as well, because they can be specialized through // inheritance, state can be maintained in the class as the tree is // traversed (the ``dump()`` method is actually implemented this way // internally), it can be written inline within a function despite the // recursive nature, and so on. But it's a lot more verbose than the // ``as_*()`` alternative for simple cases. The choice between visitor // pattern and casts is therefore usually a matter of personal preference. MARKER return 0; } |