Welcome to tree-gen’s documentation!¶
tree-gen is a program that generates C++ source code for tree structures based on a simple and concise input language. The API documentation is generated by Doxygen.
Overview¶
tree-gen is a code generator that outputs all the repetitive stuff needed for defining a proper tree structure in C++.
Trees are described by *.tree files. Such a file consists of a number of directives at the top and one or more tree node descriptions. Each tree node gets its own C++ class, with a bunch of methods defined on it for traversal, the visitor pattern, safe typecasting, cloning, and so on. It would be extremely tedious and error prone to define all this stuff manually, hence the generator.
It is important to realize and perhaps not immediately obvious that the recursive structure in a *.tree file represents C++ inheritance rather than the actual tree structure. You may for instance encounter something like this:
expression {
addition {
...
}
subtraction {
...
}
}
That doesn’t mean an expression consists of an addition and a subtraction, but rather that an addition is a type of expression, and subtraction is another.
Base classes and primitives¶
tree-gen trees consist of three kinds of objects: nodes, edges, and primitives.
The nodes are classes generated by tree-gen based on some base class. This is typically tree::base::Base. The class inheritance tree of the node types is used for “union” types, i.e. to allow some node in the tree to be one of a number of types. For example, some node may want to have an expression child node, which in practice would be for instance an addition node or a subtraction node. At any level of this class hierarchy, nodes can be given attributes to store data. These attributes are either edges or primitives.
Edges define the relation between some parent node and some number of child nodes. Six edge classes are defined, resulting in different relations between the nodes.
tree::base::Maybe<T>
: the parent will own zero or one child node of (super)type T.tree::base::One<T>
: the parent will logically own exactly one child node of (super)type T.tree::base::Any<T>
: the parent will own zero or more child nodes of (super)type T.tree::base::Many<T>
: the parent will logically own one or more child node of (super)type T.tree::base::OptLink<T>
: the parent links to zero or one child node of (super)type T elsewhere in the tree.tree::base::Link<T>
: the parent logically links to exactly one child node of (super)type T elsewhere in the tree.
The “logically” term used above alludes to the fact that One/Many/Link don’t necessarily need to have at least one child at all times. They may for instance be empty during construction. But at least one child node is required for the tree to be considered to be “complete”. Furthermore, completeness with respect to some root node requires that any Link or non-null OptLink refers to a node that is actually reachable from that root node.
Maybe and One are based on std::shared_ptr
The namespace for the Base and edge classes can be set using the
tree_namespace
directive in case you want to override them or add
functionality. Note that you can use “using One = tree::base::One;” etc. if
you only want to override part of the classes. You can also opt to use the
tree-all.[ch]pp.inc
files, which define the same structures that exist
in the tree
namespace, but allows you to override some of the standard
library types they are based on, for example to use a different base
exception type. If the tree_namespace directive is not specified, the
namespace that the node classes are generated into will be used by default.
Finally, primitives are used to store the actual data, forming the leaves of the tree. They can generally be any externally-defined C++ type, although a few template functions need to be defined and specialized for the types you want to use. These are:
template <typename T>
T initialize() { ... }
template <typename T>
void serialize(const T &obj, tree::cbor::MapWriter &map) { ... }
template <typename T>
T deserialize(const tree::cbor::MapReader &map) { ... }
The initialize function is used to construct a primitive of type T in such a
way that it has a defined value. This is in contrast to the behavior of C’s
own primitive types (int, bool, etc.), which are specified to be undefined
until assigned. This is avoided using this initialize function to prevent
confusion caused by undefined behavior. Typically, initialize() can be
implemented using just return T{};
for the general case to defer to the
constructor for complex types, and then be specialized for the primitive
types that you want to use.
The serialize and deserialize functions are optional. When given, logic is generated to serialize and deserialize entire trees to and from a CBOR representation.
In addition to the type-safe primitives declared in the tree description file, nodes can also be annotated using arbitrary types. Specifically, every node can be annotated with zero or one annotation of every C++ type. So, any struct or class you make can be attached to any node in the tree, regardless of the tree specification. This is useful, for instance, to carry metadata, such as line number or debugging information. When tree-gen is used to generate some user-facing tree structure in a library, users of the library can also use this to add their own data to a tree as they operate on it, to prevent them from having to convert to their own data structure first.
Note that the above requires RTTI to be enabled and supported by the compiler, but there’s really no good reason to disable that nowadays.
Defining node types¶
Nodes have the following form in the tree file:
# [documentation for node class]
[snake_case_node_name] {
# [documentation for primitive]
[snake_case_member_name]: [C++ namespace path];
# [documentation for edge/child node]
[snake_case_member_name]: [Maybe|One|Any|Many|OptLink|Link]<[snake_case_node_name]>;
# [documentation for external edge/child node]
[snake_case_member_name]: external [Maybe|One|Any|Many|OptLink|Link]<[C++ namespace path]>;
[zero or more specializations of this node; recursive structure]
[optional reorder directive]
}
Nodes can have zero or more primitives, zero or more edges/child nodes, zero or more external edges/child nodes, and zero or more specializations.
The difference between normal edges and external edges is that the type for a normal edge must resolve to another node in the same tree file, while an external edge may use any C++ type name for the node class. That allows you to refer to nodes defined in other tree files as well. Note however that the generator expects external nodes to actually be node classes, such that it can properly generate the debug dump indentation logic, among other things.
Note that while node type names are specified in snake_case in the tree file, the name will be converted to TitleCase for the generated class names to follow decent C++ naming conventions. This is just because it’s easier to convert snake case to title case than vice versa, and the generator uses both forms (title case for the class names, snake case for methods and members).
Note also the comments explicitly added to the example. Comments using a #
are interpreted as docstrings by the generator; they are copied into the
source code using javadoc-style comment blocks. This also means you can’t
use #
-“comments” everywhere, since they’re actually a grammatical
construct. Therefore, tree-gen also supports //
comments. These are
completely ignored by the parser.
The order in which nodes are defined doesn’t matter. This allows you to make recursive tree structures, without having to worry about forward declarations. The generator handles those. The order in which the attributes of a node are defined does matter, though, as it’s used for the order in which the values can be passed to the node constructor; they are ordered subclass to superclass, top to bottom.
The order of the primitives and child notes (a.k.a. fields) is used in a few places, primarily the constructor and for debug dump output generation. The order is as specified within each class hierarchy level, but the fields of the more-specialized classes come before the fields of their ancestors.
When additions are made to a tree, it may be required to maintain
compatibility with older tree versions by maintaining field order. In order
to support this, the above field order can be overridden using a reorder
directive. This directive has the following syntax:
# [optional documentation (ignored)]
reorder(a, b, c);
where a, b, c
is the desired order of the fields. Any fields not explicitly
specified in the list will automatically appear at the end, using the default
order.
Directives¶
The following directives exist. They should be placed at the top of the tree file, before the first node. Their order doesn’t matter, unless otherwise specified. Note that most directives are required.
source
: used to specify documentation for the generated source file; any docstring above the directive is copied into the file as file-level doxygen documentation.header
: used to specify documentation for the generated header file; any docstring above the directive is copied into the file as file-level doxygen documentation.python
: used to specify documentation for the generated Python file (if enabled via the command line); any docstring above the directive is copied into the Python file’s docstring.tree_namespace <namespace::path>
: the namespace that the Base and edge classes live in. You should set this totree::base
unless you intend to override or use different base classes. This is unused in Python.initialize_function <namespace::path::initialize>
: the name (including namespace path leading up to it) of theT()
function used for getting the default value of any of the used primitive classes. Unused in Python; here, the default constructor is used regardless of type (as it always exists and is meaningful in Python).serdes_functions <namespace::path::serialize> <namespace::path::deserialize>
: optionally, the names (including namespace paths leading up to them) of the functions used to respectively serialize and deserialize primitive classes. If not specified, serialization functionality is disabled in C++. Python always has serdes support, but support is augmented when a serdes function pair is specified. The Python generator interprets the namespace path as a module path. Refer to the serialization section for more info.location <namespace_path::SourceLocation>
: optionally, the name of the source location annotation class (including namespace path leading up to it). The debug dump will look for an annotation of this type on each node when doing a debug dump, and if it exists, uses it to add source information to the node, by streaming out a#
followed by the stream overload for the class. The base class needs to be capable of annotations if you use this. This is not supported in Python.include "<path>"
: adds an#include
statement to the top of the generated C++ header file.src_include "<path>"
: likeinclude
, but adds to the top of the generated C++ source file only.import ...
,from ... import ...
: adds Python import statements to the top of the generated module.namespace <name>
: used to specify the namespace for the generated tree classes. As in C++, you need multiple of these to specify the full path. The docstring for the first annotation of this type that has a docstring in front is used to document the innermost namespace javadoc-style for Doxygen documentation. This is not used in Python.
Generated APIs¶
The following methods are generated for each node class:
- a constructor with all members of the node in its signature as optional values, defaulting to the values returned by the initializer function.
bool is_complete() const
: returns whether the node its called on and any subtree rooted in that node is fully defined (no empty One, Many or Link, all array entries in Any/Many nonempty, and all Links and defined OptLinks reachable from the root node).%NodeType type() const
: returns the type of this node, using the also-generated %NodeType enumeration.One<Node> copy() const
: returns a shallow copy of this node.One<Node> clone() const
: returns a deep copy of this node.- A value-based equality operator. Note that this ignores equality of any annotations.
void visit(Visitor &visitor)
(C++ only): implements the visitor pattern using the also-generated abstract Visitor/RecursiveVisitor classes. See below.void dump(std::ostream &out=std::cout, int indent=0)
: does a debug dump of the node to the given stream with the given indentation level.SomeNodeType *as_some_node_type()
(C++ only): does the equivalent of adynamic_cast
to the given node type, returningthis
if the type is correct ornullptr
if not.
Note that the prototypes for the Python equivalents differ as appropriate.
An implicit node class simply named Node
is always generated, serving as
the base class for all other nodes. In C++, it is what derives from the
Base
class defined in the namespace specified using tree_namespace
. In
Python, the Node
class always derives directly from object
.
Tree traversal¶
Tree traversal is accomplished by starting at the root and working your way down. The attributes specified in the tree file appear as public members of the node classes. Maybe, One, OptLink, and Link can be dereferenced simply using the * or -> operators; Any and Many use the [] indexation operator. Note that these are all null- and range-checked, unlike the stdlib equivalents.
Tree traversal is complicated by the fact that you often don’t know exactly what the type of a node is. For example, an expression node may actually be a subtraction or addition. For this, tree-gen supports a few different patterns in C++:
Visitor pattern. You define a class inheriting from the generated
Visitor
orRecursiveVisitor
classes. These abstract classes provide or require you to implement functions for each node type. The appropriate one then gets called when you pass an instance of your visitor class to thevisit()
method on a node. You must always overridevisit_node()
, usually to throw a suitable exception in case an unexpected node is encountered. The default implementation for all the other node types differs between the two classes. ForVisitor
, the default implementation falls back to the next more generically typed function (visit_subtraction()
falls back tovisit_expression()
,visit_expression()
falls back tovisit_node()
). ForRecursiveVisitor
, the default implementation for non-leaf node types recursively callsvisit()
for all child nodes, thus recursively traversing the tree. For leaf nodes, both visitor classes have the same behavior.Using the `as_()` methods.* Given for instance an expression node, you might do
if (auto addition = expression.as_addition()) { ... } else if (auto subtraction = expression.as_subtraction()) { ... } else { ... }
Because a
nullptr
evaluates to false in C++, the blocks will only be executed if the cast succeeds. Be careful copypasting though; you can accidentally use theaddition
variable in thesubtraction
block if you’d want, but it’s obviously null in that case. C++ scoping is weird.Using a switch statement. You might do
switch (expression.type()) { case NodeType::Addition: ... case NodeType::Subtraction: ... }
This doesn’t handle the cast for you, but in cases where you only need to switch based on the type and don’t need access to members of the nodes this is more descriptive than the if/else form.
Just choose the method that makes the most sense within context. Python is
so much more dynamic in general that you’re better off using its duck typing
functionality or using isinstance()
directly, so no special features are
generated for this.
Note that tree-gen trees do not contain allow traversal back toward the root of a tree. Supporting this would greatly complicate the internals and the user-facing API, because then you wouldn’t be able to just move nodes around. It is therefore key that you design your trees and interfaces such that this information is not needed. If this is somehow impossible, you’ll have to manage the links back up the tree manually using (Opt)Link edges.
Serialization and deserialization¶
Optionally, logic to serialize and deserialize trees can be generated in addition to the APIs above. To enable this, you have to tell tree-gen how to serialize and deserialize the primitive types that you use in the tree through two templated functions that must be specialized for all primitive types (similar and in addition to the initialization function):
template <typename T>
void serialize(const T &obj, tree::cbor::MapWriter &map) { ... }
template <typename T>
T deserialize(const tree::cbor::MapReader &map) { ... }
The former must serialize obj
by calling the various append_*()
functions
on the given tree::cbor::MapWriter
object. The latter must perform the
reverse operation. The namespace(s) and names of these functions must be
provided to tree-gen using the optional serdes_functions
directive.
Specifying this directive enables the serialization and deserialization
logic.
Once enabled, the entry point for serializing a tree is tree::base::serialize() or tree::base::serialize_file(), and deserializing is tree::base::deserialize() or tree::base::deserialize_file(). The internal serialization and deserialization functions in the edge classes and generated node classes are public and could also be used, but these require more boilerplate due to link handling.
If you attempt to use these methods for a tree without
serialization/deserialization support enabled, you’ll receive a template
error. Template errors are notoriously difficult to understand, so you have
been warned. Don’t use them unless you enable support through the
serdes_functions
directive. Attempting to serialize a tree that is not
well-formed will lead to a tree::base::NotWellFormed exception.
Serialization format¶
The serialization format makes use of the RFC7049 CBOR data representation, serving as a compromise between the readability of JSON and speed/simplicity of the serialization and deserialization logic. Specifically, CBOR can be losslessly converted to and from JSON using third-party tools if need be for debugging (at least for as far as it’s used here), but is itself a simple binary format.
Each edge and each primitive receives its own map in the tree. Note that Any and Many are internally represented as a vector of One edges; therefore, Any/Many results in two nested edge objects. The data for nodes and their annotations is stored along with the One/Maybe edges. The keys for the maps are the following:
empty Maybe:
{
"@T": "?",
"@t": null
}
filled Maybe:
{
"@T": "?",
"@i": <sequence number>,
"@t": "<TitleCase node type name>",
"<snake_case_attribute_name>": { <attribute data> },
...
"{<annotation type>}": { <annotation data> },
...
}
One:
{
"@T": "1",
"@i": <sequence number>,
"@t": "<TitleCase node type name>",
"<snake_case_attribute_name>": { <attribute data> },
...
"{<annotation type>}": { <annotation data> },
...
}
Any:
{
"@T": "*",
"@d": [
<`One` object for each item>
]
}
Many:
{
"@T": "+",
"@d": [
<`One` object for each item>
]
}
empty OptLink:
{
"@T": "@",
"@l": null
}
filled OptLink:
{
"@T": "@",
"@l": <sequence number of linked node>
}
Link:
{
"@T": "$",
"@l": <sequence number of linked node>
}
The @T
entry stores which type of edge an object is. This is not strictly
needed when deserializing as this information should be known by context;
instead, it is used as a validity check. The @i
sequence numbers are unique
integers for all One/Maybe edges in the tree, which are used to recover the
(Opt)Link pointers through their @l
key. The @t
key is used to recover
subtype information, as the tree only knows the abstract supertype of a node
in general. These are, however, not necessarily globally unique; they only
need to distinguish between subclasses, and therefore just the TitleCase name
of the node type without C++ namespace is sufficient.
Keys that start with a {
and close with a }
are used for annotations.
The string enclosed within the {}
in the key is used to store the
annotation type. The identifier used can either be generated automatically by
C++ using typeid(T).name()
, or can be specified manually to have more
control. Any annotation type that doesn’t have a serialization and
deserialization function registered for it within
tree::annotatable::serdes_registry is silently ignored in either direction.
All remaining keys map to the node attributes using their snake_case name.
The corresponding value is (recursively) one of the structures above for
edges, or a map containing user-specified key/value pairs for primitives,
serialized and deserialized using the functions specified by the
serdes_functions
directive.
The names of the keys are chosen such that, when ordered by ASCII value, the order is as specified, and such that there can never be name conflicts. The commonly used keys and values are short to minimize serialization and deserialization overhead.
Python support¶
In addition to C++, tree-gen can also generate pure-Python objects to represent the tree. As in C++, each node type becomes a class, and class hierarchy is used similarly. The edge classes don’t exist, however; instead, their functionality is baked into the generated code to prevent an unnecessary user-facing level of indirection during tree traversal that cannot be hidden in Python. Despite this, the generated structures provide a similar level of type safety as the C++ structures do; all writes to fields of node classes are type-checked.
Where applicable, the generated Python file assumes that the Python module tree corresponds with the namespace hierarchy in the C++ world. This applies to the following things:
- primitive type declarations;
- external node types;
- (de)serialization functions.
The initialization function isn’t used in Python; rather, the default constructor of the primitive types is used.
The connection between the C++ and Python world is handled through CBOR serialization and deserialization rather than providing Python wrappers around the C++ objects. While this approach isn’t as performant as the alternative, it allows a much simpler and more Pythonic interface to be exposed to the user. Serdes is therefore perhaps the most important feature of the Python generator. Note however that it is entirely up to the program or library using tree-gen to provide the requisite interface between the C++ and Python worlds (for example through swig); tree-gen just ensures that the Python and C++ tree implementations always remain in sync.
The serialization and deserialization functions naturally have a different signature in Python than they do in C++. More specifically, the functions must look like this:
def serialize(typ, val):
# typ is either a type (for primitives) or a str (for annotations). In
# the former case, isinstance(val, typ) may be assumed. Return value must
# be an object consisting of only the following types:
# - `dict`s with `str` keys;
# - `list`s;
# - `int`s between -2^63 and 2^63-1 inclusive;
# - `float`s;
# - `True`, `False`, or `None`;
# - `str`s;
# - `bytes` objects.
# The toplevel object must be a dict.
pass # TODO
def deserialize(typ, val):
# typ is either a type (for primitives) or a str (for annotations). val
# is a structure representing the CBOR serialization using the same
# Python primitives that can be returned by serialize(). The return value
# must be an instance of the specified `typ`e for primitives, but can be
# anything for annotations; for unknown annotation types, just returning
# val as-is is the recommended fallback behavior.
pass # TODO
Primitive serialization and deserialization can also be done using class methods:
class SomePrimitive:
def serialize_cbor(self):
# Called instead of serialize(SomePrimitive, self) if it exists.
pass # TODO
@staticmethod
def deserialize_cbor(cbor):
# Called instead of deserialize(SomePrimitive, cbor) if it exists.
pass # TODO
If no serdes_function
directive is specified, serdes logic is still
generated on the Python end; in this case, only the class method variant is
used, and annotations must be directly serializable to CBOR for them to be
serialized (they will be silently ignored if they’re not).
Directory tree example¶
This example illustrates the tree system with a Windows-like directory
tree structure. The System
root node consists of one or more drives,
each of which has a drive letter and a root directory with named files
and subdirectories. To make things a little more interesting, a
symlink-like “mount” is added as a file type, which links to another
directory.
Using this tree, this example should teach you the basics of using
tree-gen trees. The tutorial runs you through the C++ code of
main.cpp
and Python code of main.py
, but be sure to check out
directory.tree
, primitives.hpp
, primitives.hpp
, and (if you
want to reproduce it in your own project) CMakeLists.txt
copied to
the bottom of this page as well. You will also find the complete
main.cpp
and main.py
there.
Let’s start by making the root node using tree::base::make()
. This
works somewhat like std::make_shared()
, but instead of returning a
shared_ptr
it returns a One
edge. This might come off as a bit
odd, considering trees in graph theory start with a node rather than an
edge. This is just a choice, though; a side effect of how the internal
``shared_ptr``s work and how Link/OptLink edges work (if you’d store the
tree as the root structure directly, the root node wouldn’t always be
stored in the same place on the heap, breaking link nodes).
auto system = tree::base::make<directory::System>();
At all times, you can use the dump()
method on a node to see what it
looks like for debugging purposes. It looks like this:
system->dump();
↓
System(
drives: !MISSING
)
While this system node exists as a tree in memory and tree-gen seems happy, our system tree is at this time not “well-formed”. A tree node (or an edge to one) is only considered well-formed if all of the following are true:
- All
One
edges in the tree connect to exactly one node.- All
Many
edges in the tree connect to at least one node.- All
Link
and non-emptyOptLink
nodes refer to a node reachable from the root node.- Each node in the tree is only used by a non-link node once.
Currently, the second requirement is not met.
ASSERT(!system.is_well_formed());
You can get slightly more information using check_well_formed()
;
instead or returning a boolean, it will throw a NotWellFormed
exception with an error message.
ASSERT_RAISES(tree::base::NotWellFormed, system.check_well_formed());
↓
tree::base::NotWellFormed exception message: 'Many' edge of type N9directory5DriveE is empty
Note that the name of the type is “mangled”. The exact output will vary from compiler to compiler, or even from project to project. But hopefully it’ll be readable enough to make sense of in general.
To fix that, let’s add a default drive node to the tree. This should get
drive letter A
, because primitives::initialize() is specialized to
return that for ``Letter``s (see primitives.hpp).
system->drives.add(tree::base::make<directory::Drive>());
Drive
has a One
edge that is no empty, though, so the tree still
isn’t well-formed. Let’s add one of those as well.
system->drives[0]->root_dir = tree::base::make<directory::Directory>();
Now we have a well-formed tree. Let’s have a look:
system.check_well_formed();
system->dump();
↓
System(
drives: [
Drive(
letter: A
root_dir: <
Directory(
entries: []
name:
)
>
)
]
)
We can just change the drive letter by assignment, as you would expect.
system->drives[0]->letter = 'C';
Before we add files and directories, let’s make a shorthand variable for the root directory. Because root_dir is an edge to another node rather than the node itself, and thus acts like a pointer or reference to it, we can just copy it into a variable and modify the variable to update the tree. Note that you’d normally just write “auto” for the type for brevity; the type is shown here to make explicit what it turns into.
directory::One<directory::Directory> root = system->drives[0]->root_dir;
Let’s make a “Program Files” subdirectory and add it to the root.
auto programs = tree::base::make<directory::Directory>(
tree::base::Any<directory::Entry>{},
"Program Files");
root->entries.add(programs);
system.check_well_formed();
That’s quite verbose. But in most cases it can be written way shorter. Here’s the same with the less versatile but also less verbose emplace() method (which avoids the tree::make() call, but doesn’t allow an insertion index to be specified) and with a “using namespace” for the tree. emplace() can also be chained, allowing multiple files and directories to be added at once in this case.
{
using namespace directory;
root->entries.emplace<Directory>(Any<Entry>{}, "Windows")
.emplace<Directory>(Any<Entry>{}, "Users")
.emplace<File>("lots of hibernation data", "hiberfil.sys")
.emplace<File>("lots of page file data", "pagefile.sys")
.emplace<File>("lots of swap data", "swapfile.sys");
}
system.check_well_formed();
In order to look for a file in a directory, you’ll have to make your own function to iterate over them. After all, tree-gen doesn’t know that the name field is a key; it has no concept of a key-value store. This is simple enough to make, but to prevent this example from getting out of hand we’ll just use indices for now.
Let’s try to read the “lots of data” string from pagefile.sys.
ASSERT(root->entries[4]->name == "pagefile.sys");
// ASSERT(root->entries[4]->contents == "lots of page file data");
// '-> No member named 'contents' in 'directory::Entry'
Hm, that didn’t work so well, because C++ doesn’t know that entry 4 happens to be a file rather than a directory or a mount. We have to tell it to cast to a file first (which throws a std::bad_cast if it’s not actually a file). The easiest way to do that is like this:
ASSERT(root->entries[4]->as_file()->contents == "lots of page file data");
No verbose casts required; tree-gen will make member functions for all the possible subtypes.
While it’s possible to put the same node in a tree twice (without using a link), this is not allowed. This isn’t checked until a well-formedness check is performed, however (and in fact can’t be without having access to the root node).
root->entries.add(root->entries[0]);
ASSERT_RAISES(tree::base::NotWellFormed, system.check_well_formed());
↓
tree::base::NotWellFormed exception message: Duplicate node of type N9directory5EntryEat address 0x56462575c440 found in tree
Note that we can index nodes Python-style with negative integers for add() and remove(), so remove(-1) will remove the broken node we just added. Note that the -1 is not actually necessary, though, as it is the default.
root->entries.remove(-1);
system.check_well_formed();
We can, of course, add copies of nodes. That’s what copy (shallow) and clone (deep) are for. Usually you’ll want a deep copy, but in this case shallow is fine, because a File has no child nodes. Note that, even for a deep copy, links are not modified; it is intended to copy a subtree to another part of the same tree. To make a complete copy of a tree that maintains link semantics (i.e. does not link back to the original tree) the best way would be to go through serialize/deserialize.
root->entries.add(root->entries[0]->copy());
system.check_well_formed();
Note that the generated classes don’t care that there are now two directories named Program Files in the root. As far as they’re concerned, they’re two different directories with the same name. Let’s remove it again, though.
root->entries.remove();
Something we haven’t looked at yet are links. Links are edges in the tree that, well, turn it into something that isn’t strictly a tree anymore. While One/Maybe/Any/Many require that nodes are unique, Link/OptLink require that they are not unique, and are present elsewhere in the tree. Let’s make a new directory, and mount it in the Users directory.
auto user_dir = tree::base::make<directory::Directory>(
tree::base::Any<directory::Entry>{},
"");
root->entries.emplace<directory::Mount>(user_dir, "SomeUser");
Note that user_dir is not yet part of the tree. emplace works simply because it doesn’t check whether the directory is in the tree yet. But the tree is no longer well-formed now.
ASSERT_RAISES(tree::base::NotWellFormed, system.check_well_formed());
↓
tree::base::NotWellFormed exception message: Link to node of type N9directory9DirectoryE at address 0x56462575ccc0 not found in tree
To make it work again, we can add it as a root directory to a second drive.
system->drives.emplace<directory::Drive>('D', user_dir);
system.check_well_formed();
A good way to confuse a filesystem is to make loops. tree-gen is okay with this, though.
system->drives[1]->root_dir->entries.emplace<directory::Mount>(root, "evil link to C:");
system.check_well_formed();
The only place where it matters is in the dump function, which only goes one level deep. After that, it’ll just print an ellipsis.
system->dump();
↓
System(
drives: [
Drive(
letter: C
root_dir: <
Directory(
entries: [
Directory(
entries: []
name: Program Files
)
Directory(
entries: []
name: Windows
)
Directory(
entries: []
name: Users
)
File(
contents: lots of hibernation data
name: hiberfil.sys
)
File(
contents: lots of page file data
name: pagefile.sys
)
File(
contents: lots of swap data
name: swapfile.sys
)
Mount(
target --> <
Directory(
entries: [
Mount(
target --> <
...
>
name: evil link to C:
)
]
name:
)
>
name: SomeUser
)
]
name:
)
>
)
Drive(
letter: D
root_dir: <
Directory(
entries: [
Mount(
target --> <
Directory(
entries: [
Directory(
entries: []
name: Program Files
)
Directory(
entries: []
name: Windows
)
Directory(
entries: []
name: Users
)
File(
contents: lots of hibernation data
name: hiberfil.sys
)
File(
contents: lots of page file data
name: pagefile.sys
)
File(
contents: lots of swap data
name: swapfile.sys
)
Mount(
target --> <
...
>
name: SomeUser
)
]
name:
)
>
name: evil link to C:
)
]
name:
)
>
)
]
)
Now that we have a nice tree, let’s try the serialization and deserialization functionality. Serializing is easy:
std::string cbor = tree::base::serialize(system);
Let’s write it to a file; we’ll load this in Python later.
{
std::ofstream cbor_output;
cbor_output.open("tree.cbor", std::ios::out | std::ios::trunc | std::ios::binary);
cbor_output << cbor;
}
Let’s also have a look at a hexdump of that.
int count = 0;
for (char c : cbor) {
if (count == 16) {
std::printf("\n");
count = 0;
} else if (count > 0 && count % 4 == 0) {
std::printf(" ");
}
std::printf("%02X ", (uint8_t)c);
count++;
}
std::printf("\n");
↓
BF 62 40 54 61 3F 62 40 69 00 62 40 74 66 53 79
73 74 65 6D 66 64 72 69 76 65 73 BF 62 40 54 61
2B 62 40 64 9F BF 62 40 54 61 31 62 40 69 01 62
40 74 65 44 72 69 76 65 66 6C 65 74 74 65 72 BF
63 76 61 6C 18 43 FF 68 72 6F 6F 74 5F 64 69 72
BF 62 40 54 61 31 62 40 69 02 62 40 74 69 44 69
72 65 63 74 6F 72 79 67 65 6E 74 72 69 65 73 BF
62 40 54 61 2A 62 40 64 9F BF 62 40 54 61 31 62
40 69 03 62 40 74 69 44 69 72 65 63 74 6F 72 79
67 65 6E 74 72 69 65 73 BF 62 40 54 61 2A 62 40
64 9F FF FF 64 6E 61 6D 65 BF 63 76 61 6C 6D 50
72 6F 67 72 61 6D 20 46 69 6C 65 73 FF FF BF 62
40 54 61 31 62 40 69 04 62 40 74 69 44 69 72 65
63 74 6F 72 79 67 65 6E 74 72 69 65 73 BF 62 40
54 61 2A 62 40 64 9F FF FF 64 6E 61 6D 65 BF 63
76 61 6C 67 57 69 6E 64 6F 77 73 FF FF BF 62 40
54 61 31 62 40 69 05 62 40 74 69 44 69 72 65 63
74 6F 72 79 67 65 6E 74 72 69 65 73 BF 62 40 54
61 2A 62 40 64 9F FF FF 64 6E 61 6D 65 BF 63 76
61 6C 65 55 73 65 72 73 FF FF BF 62 40 54 61 31
62 40 69 06 62 40 74 64 46 69 6C 65 68 63 6F 6E
74 65 6E 74 73 BF 63 76 61 6C 78 18 6C 6F 74 73
20 6F 66 20 68 69 62 65 72 6E 61 74 69 6F 6E 20
64 61 74 61 FF 64 6E 61 6D 65 BF 63 76 61 6C 6C
68 69 62 65 72 66 69 6C 2E 73 79 73 FF FF BF 62
40 54 61 31 62 40 69 07 62 40 74 64 46 69 6C 65
68 63 6F 6E 74 65 6E 74 73 BF 63 76 61 6C 76 6C
6F 74 73 20 6F 66 20 70 61 67 65 20 66 69 6C 65
20 64 61 74 61 FF 64 6E 61 6D 65 BF 63 76 61 6C
6C 70 61 67 65 66 69 6C 65 2E 73 79 73 FF FF BF
62 40 54 61 31 62 40 69 08 62 40 74 64 46 69 6C
65 68 63 6F 6E 74 65 6E 74 73 BF 63 76 61 6C 71
6C 6F 74 73 20 6F 66 20 73 77 61 70 20 64 61 74
61 FF 64 6E 61 6D 65 BF 63 76 61 6C 6C 73 77 61
70 66 69 6C 65 2E 73 79 73 FF FF BF 62 40 54 61
31 62 40 69 09 62 40 74 65 4D 6F 75 6E 74 66 74
61 72 67 65 74 BF 62 40 54 61 24 62 40 6C 0B FF
64 6E 61 6D 65 BF 63 76 61 6C 68 53 6F 6D 65 55
73 65 72 FF FF FF FF 64 6E 61 6D 65 BF 63 76 61
6C 60 FF FF FF BF 62 40 54 61 31 62 40 69 0A 62
40 74 65 44 72 69 76 65 66 6C 65 74 74 65 72 BF
63 76 61 6C 18 44 FF 68 72 6F 6F 74 5F 64 69 72
BF 62 40 54 61 31 62 40 69 0B 62 40 74 69 44 69
72 65 63 74 6F 72 79 67 65 6E 74 72 69 65 73 BF
62 40 54 61 2A 62 40 64 9F BF 62 40 54 61 31 62
40 69 0C 62 40 74 65 4D 6F 75 6E 74 66 74 61 72
67 65 74 BF 62 40 54 61 24 62 40 6C 02 FF 64 6E
61 6D 65 BF 63 76 61 6C 6F 65 76 69 6C 20 6C 69
6E 6B 20 74 6F 20 43 3A FF FF FF FF 64 6E 61 6D
65 BF 63 76 61 6C 60 FF FF FF FF FF FF
You can pull that through http://cbor.me and https://jsonformatter.org to inspect the output, if you like.
We can deserialize it into a complete copy of the tree.
auto system2 = tree::base::deserialize<directory::System>(cbor);
system2->dump();
↓
System(
drives: [
Drive(
letter: C
root_dir: <
Directory(
entries: [
Directory(
entries: []
name: Program Files
)
Directory(
entries: []
name: Windows
)
Directory(
entries: []
name: Users
)
File(
contents: lots of hibernation data
name: hiberfil.sys
)
File(
contents: lots of page file data
name: pagefile.sys
)
File(
contents: lots of swap data
name: swapfile.sys
)
Mount(
target --> <
Directory(
entries: [
Mount(
target --> <
...
>
name: evil link to C:
)
]
name:
)
>
name: SomeUser
)
]
name:
)
>
)
Drive(
letter: D
root_dir: <
Directory(
entries: [
Mount(
target --> <
Directory(
entries: [
Directory(
entries: []
name: Program Files
)
Directory(
entries: []
name: Windows
)
Directory(
entries: []
name: Users
)
File(
contents: lots of hibernation data
name: hiberfil.sys
)
File(
contents: lots of page file data
name: pagefile.sys
)
File(
contents: lots of swap data
name: swapfile.sys
)
Mount(
target --> <
...
>
name: SomeUser
)
]
name:
)
>
name: evil link to C:
)
]
name:
)
>
)
]
)
Note that equality for two link edges is satisfied only if they point to the exact same node. That’s not the case for the links in our two entirely separate trees, so the two trees register as unequal.
ASSERT(!system2.equals(system));
To be sure no data was lost, we’ll have to check the CBOR and debug dumps instead.
ASSERT(tree::base::serialize(system) == tree::base::serialize(system2));
std::ostringstream ss1{};
system->dump(ss1);
std::ostringstream ss2{};
system2->dump(ss2);
ASSERT(ss1.str() == ss2.str());
Python usage¶
Let us now turn our attention to tree-gen’s Python support. As you may remember from the introduction, tree-gen does not provide a direct interface between Python and C++ using a tool like SWIG; rather, it provides conversion between a CBOR tree representation and the in-memory representations of both languages. It then becomes trivial for users of the tree-gen structure to provide an API to users that converts from one to the other, since only a single string of bytes has to be transferred.
Recall that we wrote the CBOR representation of the tree we constructed in the C++ example to a file. Let’s try loading this file with Python.
from directory import *
with open(os.path.join(TEST_DIR, 'tree.cbor'), 'rb') as f:
tree = System.deserialize(f.read())
tree.check_well_formed()
print(tree)
↓
System(
drives: [
Drive(
letter: C
root_dir: <
Directory(
entries: [
Directory(
entries: -
name: Program Files
)
Directory(
entries: -
name: Windows
)
Directory(
entries: -
name: Users
)
File(
contents: lots of hibernation data
name: hiberfil.sys
)
File(
contents: lots of page file data
name: pagefile.sys
)
File(
contents: lots of swap data
name: swapfile.sys
)
Mount(
target --> <
Directory(
entries: [
Mount(
target --> <
...
>
name: evil link to C:
)
]
name:
)
>
name: SomeUser
)
]
name:
)
>
)
Drive(
letter: D
root_dir: <
Directory(
entries: [
Mount(
target --> <
Directory(
entries: [
Directory(
entries: -
name: Program Files
)
Directory(
entries: -
name: Windows
)
Directory(
entries: -
name: Users
)
File(
contents: lots of hibernation data
name: hiberfil.sys
)
File(
contents: lots of page file data
name: pagefile.sys
)
File(
contents: lots of swap data
name: swapfile.sys
)
Mount(
target --> <
...
>
name: SomeUser
)
]
name:
)
>
name: evil link to C:
)
]
name:
)
>
)
]
)
And there we go; the same structure in pure Python.
As you might expect from a pure-Python structure, field access is much less verbose than it is in C++.
print(tree.drives[0].letter)
print(tree.drives[0].root_dir.entries[0].name)
↓
C
Program Files
Perhaps contrary to Python intuition however, the structure provides type-safety. You can’t assign values to fields that don’t exist…
tree.color_of_a_banana = 'yellow'
↓
Traceback (most recent call last):
File "main.py", line 48, in <module>
tree.color_of_a_banana = 'yellow'
AttributeError: 'System' object has no attribute 'color_of_a_banana'
… or assign values of incorrect types to ones that do:
tree.drives[0] = 'not-a-drive'
↓
Traceback (most recent call last):
File "main.py", line 55, in <module>
tree.drives[0] = 'not-a-drive'
File "/home/docs/checkouts/readthedocs.org/user_builds/tree-gen/checkouts/stable/cbuild/examples/directory/directory.py", line 490, in __setitem__
.format(val, idx, self._T))
TypeError: object 'not-a-drive' is not an instance of 0
Note that the setters will however try to “typecast” objects they don’t know about (to be more specific, anything that isn’t an instance of Node). So if we try to assign a string directly to tree.drives, you get something you might not have expected:
tree.drives = 'not-a-set-of-drives'
↓
Traceback (most recent call last):
File "main.py", line 65, in <module>
tree.drives = 'not-a-set-of-drives'
File "/home/docs/checkouts/readthedocs.org/user_builds/tree-gen/checkouts/stable/cbuild/examples/directory/directory.py", line 1576, in drives
val = MultiDrive(val)
File "/home/docs/checkouts/readthedocs.org/user_builds/tree-gen/checkouts/stable/cbuild/examples/directory/directory.py", line 472, in __init__
.format(val, idx, self._T))
TypeError: object 'n' at index 0 is not an instance of <class 'directory.Drive'>
The setter for the drives property is trying to cast the string to a MultiDrive, which interprets its string input as any other iterable, and ultimately tries to cast the first letter to a Drive. There’s not much that can be done about this; it’s just how duck typing works.
Manipulation of trees in Python works just as you might expect it to in Python:
tree.drives.append(Drive('E'))
tree.drives[-1].root_dir = Directory([
File('test', 'test-contents'),
File('test2', 'test2-contents'),
])
del tree.drives[1]
assert not tree.is_well_formed()
tree.drives[0].root_dir.entries[-1].target = tree.drives[1].root_dir
tree.check_well_formed()
print(tree)
↓
System(
drives: [
Drive(
letter: C
root_dir: <
Directory(
entries: [
Directory(
entries: -
name: Program Files
)
Directory(
entries: -
name: Windows
)
Directory(
entries: -
name: Users
)
File(
contents: lots of hibernation data
name: hiberfil.sys
)
File(
contents: lots of page file data
name: pagefile.sys
)
File(
contents: lots of swap data
name: swapfile.sys
)
Mount(
target --> <
Directory(
entries: [
File(
contents: test
name: test-contents
)
File(
contents: test2
name: test2-contents
)
]
name:
)
>
name: SomeUser
)
]
name:
)
>
)
Drive(
letter: E
root_dir: <
Directory(
entries: [
File(
contents: test
name: test-contents
)
File(
contents: test2
name: test2-contents
)
]
name:
)
>
)
]
)
Once the Python code is done manipulating the structure, it can be serialized again.
cbor = tree.serialize()
assert cbor == System.deserialize(cbor).serialize()
count = 0
for c in cbor:
if count == 16:
print()
count = 0
elif count > 0 and count % 4 == 0:
print(' ', end='')
print('%02X ' % c, end='')
count += 1
print()
↓
A3 62 40 69 00 62 40 74 66 53 79 73 74 65 6D 66
64 72 69 76 65 73 A2 62 40 54 61 2B 62 40 64 82
A5 62 40 54 61 31 62 40 69 01 62 40 74 65 44 72
69 76 65 66 6C 65 74 74 65 72 A1 63 76 61 6C 18
43 68 72 6F 6F 74 5F 64 69 72 A5 62 40 54 61 31
62 40 69 02 62 40 74 69 44 69 72 65 63 74 6F 72
79 67 65 6E 74 72 69 65 73 A2 62 40 54 61 2A 62
40 64 87 A5 62 40 54 61 31 62 40 69 03 62 40 74
69 44 69 72 65 63 74 6F 72 79 67 65 6E 74 72 69
65 73 A2 62 40 54 61 2A 62 40 64 80 64 6E 61 6D
65 A1 63 76 61 6C 6D 50 72 6F 67 72 61 6D 20 46
69 6C 65 73 A5 62 40 54 61 31 62 40 69 04 62 40
74 69 44 69 72 65 63 74 6F 72 79 67 65 6E 74 72
69 65 73 A2 62 40 54 61 2A 62 40 64 80 64 6E 61
6D 65 A1 63 76 61 6C 67 57 69 6E 64 6F 77 73 A5
62 40 54 61 31 62 40 69 05 62 40 74 69 44 69 72
65 63 74 6F 72 79 67 65 6E 74 72 69 65 73 A2 62
40 54 61 2A 62 40 64 80 64 6E 61 6D 65 A1 63 76
61 6C 65 55 73 65 72 73 A5 62 40 54 61 31 62 40
69 06 62 40 74 64 46 69 6C 65 68 63 6F 6E 74 65
6E 74 73 A1 63 76 61 6C 78 18 6C 6F 74 73 20 6F
66 20 68 69 62 65 72 6E 61 74 69 6F 6E 20 64 61
74 61 64 6E 61 6D 65 A1 63 76 61 6C 6C 68 69 62
65 72 66 69 6C 2E 73 79 73 A5 62 40 54 61 31 62
40 69 07 62 40 74 64 46 69 6C 65 68 63 6F 6E 74
65 6E 74 73 A1 63 76 61 6C 76 6C 6F 74 73 20 6F
66 20 70 61 67 65 20 66 69 6C 65 20 64 61 74 61
64 6E 61 6D 65 A1 63 76 61 6C 6C 70 61 67 65 66
69 6C 65 2E 73 79 73 A5 62 40 54 61 31 62 40 69
08 62 40 74 64 46 69 6C 65 68 63 6F 6E 74 65 6E
74 73 A1 63 76 61 6C 71 6C 6F 74 73 20 6F 66 20
73 77 61 70 20 64 61 74 61 64 6E 61 6D 65 A1 63
76 61 6C 6C 73 77 61 70 66 69 6C 65 2E 73 79 73
A5 62 40 54 61 31 62 40 69 09 62 40 74 65 4D 6F
75 6E 74 64 6E 61 6D 65 A1 63 76 61 6C 68 53 6F
6D 65 55 73 65 72 66 74 61 72 67 65 74 A2 62 40
54 61 24 62 40 6C 0B 64 6E 61 6D 65 A1 63 76 61
6C 60 A5 62 40 54 61 31 62 40 69 0A 62 40 74 65
44 72 69 76 65 66 6C 65 74 74 65 72 A1 63 76 61
6C 18 45 68 72 6F 6F 74 5F 64 69 72 A5 62 40 54
61 31 62 40 69 0B 62 40 74 69 44 69 72 65 63 74
6F 72 79 67 65 6E 74 72 69 65 73 A2 62 40 54 61
2A 62 40 64 82 A5 62 40 54 61 31 62 40 69 0C 62
40 74 64 46 69 6C 65 68 63 6F 6E 74 65 6E 74 73
A1 63 76 61 6C 64 74 65 73 74 64 6E 61 6D 65 A1
63 76 61 6C 6D 74 65 73 74 2D 63 6F 6E 74 65 6E
74 73 A5 62 40 54 61 31 62 40 69 0D 62 40 74 64
46 69 6C 65 68 63 6F 6E 74 65 6E 74 73 A1 63 76
61 6C 65 74 65 73 74 32 64 6E 61 6D 65 A1 63 76
61 6C 6E 74 65 73 74 32 2D 63 6F 6E 74 65 6E 74
73 64 6E 61 6D 65 A1 63 76 61 6C 60
Complete file listings¶
directory.tree¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | // Attach \file docstrings to the generated files for Doxygen. # Implementation for classes representing a Windows directory tree. source # Header for classes representing a Windows directory tree. header // Include tree base classes. include "tree-base.hpp" tree_namespace tree::base // Include primitive types. include "primitives.hpp" import primitives // Initialization function to use to construct default values for the tree base // classes and primitives. initialize_function primitives::initialize serdes_functions primitives::serialize primitives::deserialize // Set the namespace for the generated classes and attach a docstring. # Namespace for classes representing a Windows directory tree. namespace directory # Root node, containing the drives and associated directory trees. system { # The drives available on the system. There must be at least one. drives: Many<drive>; } # Represents a drive. drive { # The drive letter used to identify it. letter: primitives::Letter; # Root directory. root_dir: One<directory>; } # Represents a directory entry. entry { # Name of the directory entry. name: primitives::String; # Represents a regular file. file { # The file contents. contents: primitives::String; } # Represents a (sub)directory. directory { # The directory contents. Note that directories can be empty. entries: Any<entry>; } # Represents a link to another directory. mount { # The directory linked to. target: Link<directory>; } } |
primitives.hpp¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | /** \file * Defines primitives used in the generated directory tree structure. */ #pragma once #include <string> /** * Namespace with primitives used in the generated directory tree structure. */ namespace primitives { /** * Letter primitive, used to represent drive letters. */ using Letter = char; /** * Strings, used to represent filenames and file contents. */ using String = std::string; /** * Initialization function. This must be specialized for any types used as * primitives in a tree that are actual C primitives (int, char, bool, etc), * as these are not initialized by the T() construct. */ template <class T> T initialize() { return T(); }; /** * Declare the default initializer for drive letters. It's declared inline * to avoid having to make a cpp file just for this. */ template <> inline Letter initialize<Letter>() { return 'A'; } /** * Serialization function. This must be specialized for any types used as * primitives in a tree. The default implementation doesn't do anything. */ template <typename T> void serialize(const T &obj, tree::cbor::MapWriter &map) { } /** * Serialization function for Letter. */ template <> inline void serialize<Letter>(const Letter &obj, tree::cbor::MapWriter &map) { map.append_int("val", obj); } /** * Serialization function for String. */ template <> inline void serialize<String>(const String &obj, tree::cbor::MapWriter &map) { map.append_string("val", obj); } /** * Deserialization function. This must be specialized for any types used as * primitives in a tree. The default implementation doesn't do anything. */ template <typename T> T deserialize(const tree::cbor::MapReader &map) { return initialize<T>(); } /** * Deserialization function for Letter. */ template <> inline Letter deserialize<Letter>(const tree::cbor::MapReader &map) { return map.at("val").as_int(); } /** * Deserialization function for String. */ template <> inline String deserialize<String>(const tree::cbor::MapReader &map) { return map.at("val").as_string(); } } // namespace primitives |
primitives.py¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | class Letter(str): """Letter primitive, used to represent drive letters.""" def __new__(cls, s='A'): return str.__new__(cls, s) class String(str): """Strings, used to represent filenames and file contents.""" pass def serialize(typ, val): """Serialization function.""" # Serialization formats of primitives. if typ is Letter: return {'val': ord(val)} if typ is String: return {'val': val} # Serialization formats of annotations. if isinstance(typ, str): return val # Some unknown type. raise TypeError('no known serialization of type %r, value %r' % (typ, val)) def deserialize(typ, val): """Deserialization function.""" # Serialization formats of primitives. if typ is Letter: return Letter(chr(val['val'])) if typ is String: return String(val['val']) # Serialization formats of annotations. if isinstance(typ, str): return val # Some unknown type. raise TypeError('no known deserialization for type %r, value %r' % (typ, val)) |
CMakeLists.txt¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | cmake_minimum_required(VERSION 3.12 FATAL_ERROR) project(directory-example CXX) # Add the tree-gen repository root directory. Normally, your project would be # located outside of tree-gen's root, which means you can omit the second # argument. add_subdirectory(../.. tree-gen) # Generates the files for the directory tree. generate_tree_py( "${CMAKE_CURRENT_SOURCE_DIR}/directory.tree" "${CMAKE_CURRENT_BINARY_DIR}/directory.hpp" "${CMAKE_CURRENT_BINARY_DIR}/directory.cpp" "${CMAKE_CURRENT_BINARY_DIR}/directory.py" ) add_executable( directory-example "${CMAKE_CURRENT_BINARY_DIR}/directory.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/main.cpp" ) target_include_directories( directory-example # This directory for primitives.hpp: PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}" # Binary directory for directory.hpp: PRIVATE "${CMAKE_CURRENT_BINARY_DIR}" ) target_link_libraries(directory-example tree-lib) # The following lines only serve to register the example as a test, so you can # run it using `make test` or `ctest` as well. You don't need them as such in # your own project. enable_testing() add_test( NAME directory-example COMMAND directory-example WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR} ) # Only add the Python test if CMake is new enough for us to not have to bother # with FindPythonInterp. if(NOT ${CMAKE_VERSION} VERSION_LESS "3.12") find_package(Python3 COMPONENTS Interpreter) if(${Python3_FOUND}) add_test( NAME directory-example-py COMMAND ${Python3_EXECUTABLE} main.py ${CMAKE_CURRENT_BINARY_DIR} WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR} ) endif() endif() |
main.cpp¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 | #include <iostream> #include <cstdio> #include <stdexcept> #include "../utils.hpp" // Include the generated file. #include "directory.hpp" // Note: the // comment contents of main(), together with the MARKER lines and // the output of the program, are used to automatically turn this into a // restructured-text page for ReadTheDocs. int main() { // ********************** // Directory tree example // ********************** // // This example illustrates the tree system with a Windows-like directory // tree structure. The ``System`` root node consists of one or more drives, // each of which has a drive letter and a root directory with named files // and subdirectories. To make things a little more interesting, a // symlink-like "mount" is added as a file type, which links to another // directory. // // Using this tree, this example should teach you the basics of using // tree-gen trees. The tutorial runs you through the C++ code of // ``main.cpp`` and Python code of ``main.py``, but be sure to check out // ``directory.tree``, ``primitives.hpp``, ``primitives.hpp``, and (if you // want to reproduce it in your own project) ``CMakeLists.txt`` copied to // the bottom of this page as well. You will also find the complete // ``main.cpp`` and ``main.py`` there. // // Let's start by making the root node using ``tree::base::make()``. This // works somewhat like ``std::make_shared()``, but instead of returning a // ``shared_ptr`` it returns a ``One`` edge. This might come off as a bit // odd, considering trees in graph theory start with a node rather than an // edge. This is just a choice, though; a side effect of how the internal // ``shared_ptr``s work and how Link/OptLink edges work (if you'd store the // tree as the root structure directly, the root node wouldn't always be // stored in the same place on the heap, breaking link nodes). auto system = tree::base::make<directory::System>(); MARKER // At all times, you can use the ``dump()`` method on a node to see what it // looks like for debugging purposes. It looks like this: system->dump(); MARKER // While this system node exists as a tree in memory and tree-gen seems // happy, our system tree is at this time not "well-formed". A tree node (or // an edge to one) is only considered well-formed if all of the following // are true: // // - All ``One`` edges in the tree connect to exactly one node. // - All ``Many`` edges in the tree connect to at least one node. // - All ``Link`` and non-empty ``OptLink`` nodes refer to a node reachable // from the root node. // - Each node in the tree is only used by a non-link node once. // // Currently, the second requirement is not met. ASSERT(!system.is_well_formed()); MARKER // You can get slightly more information using ``check_well_formed()``; // instead or returning a boolean, it will throw a ``NotWellFormed`` // exception with an error message. ASSERT_RAISES(tree::base::NotWellFormed, system.check_well_formed()); MARKER // Note that the name of the type is // `"mangled" <https://en.wikipedia.org/wiki/Name_mangling#C++>`_. The exact // output will vary from compiler to compiler, or even from project to // project. But hopefully it'll be readable enough to make sense of in // general. // To fix that, let's add a default drive node to the tree. This should get // drive letter ``A``, because primitives::initialize() is specialized to // return that for ``Letter``s (see primitives.hpp). system->drives.add(tree::base::make<directory::Drive>()); MARKER // ``Drive`` has a ``One`` edge that is no empty, though, so the tree still // isn't well-formed. Let's add one of those as well. system->drives[0]->root_dir = tree::base::make<directory::Directory>(); MARKER // Now we have a well-formed tree. Let's have a look: system.check_well_formed(); system->dump(); MARKER // We can just change the drive letter by assignment, as you would expect. system->drives[0]->letter = 'C'; MARKER // Before we add files and directories, let's make a shorthand variable for // the root directory. Because root_dir is an edge to another node rather // than the node itself, and thus acts like a pointer or reference to it, // we can just copy it into a variable and modify the variable to update the // tree. Note that you'd normally just write "auto" for the type for // brevity; the type is shown here to make explicit what it turns into. directory::One<directory::Directory> root = system->drives[0]->root_dir; MARKER // Let's make a "Program Files" subdirectory and add it to the root. auto programs = tree::base::make<directory::Directory>( tree::base::Any<directory::Entry>{}, "Program Files"); root->entries.add(programs); system.check_well_formed(); MARKER // That's quite verbose. But in most cases it can be written way shorter. // Here's the same with the less versatile but also less verbose emplace() // method (which avoids the tree::make() call, but doesn't allow an // insertion index to be specified) and with a "using namespace" for the // tree. emplace() can also be chained, allowing multiple files and // directories to be added at once in this case. { using namespace directory; root->entries.emplace<Directory>(Any<Entry>{}, "Windows") .emplace<Directory>(Any<Entry>{}, "Users") .emplace<File>("lots of hibernation data", "hiberfil.sys") .emplace<File>("lots of page file data", "pagefile.sys") .emplace<File>("lots of swap data", "swapfile.sys"); } system.check_well_formed(); MARKER // In order to look for a file in a directory, you'll have to make your own // function to iterate over them. After all, tree-gen doesn't know that the // name field is a key; it has no concept of a key-value store. This is // simple enough to make, but to prevent this example from getting out of // hand we'll just use indices for now. // Let's try to read the "lots of data" string from pagefile.sys. ASSERT(root->entries[4]->name == "pagefile.sys"); // ASSERT(root->entries[4]->contents == "lots of page file data"); // '-> No member named 'contents' in 'directory::Entry' MARKER // Hm, that didn't work so well, because C++ doesn't know that entry 4 // happens to be a file rather than a directory or a mount. We have to tell // it to cast to a file first (which throws a std::bad_cast if it's not // actually a file). The easiest way to do that is like this: ASSERT(root->entries[4]->as_file()->contents == "lots of page file data"); MARKER // No verbose casts required; tree-gen will make member functions for all // the possible subtypes. // While it's possible to put the same node in a tree twice (without using // a link), this is not allowed. This isn't checked until a well-formedness // check is performed, however (and in fact can't be without having access // to the root node). root->entries.add(root->entries[0]); ASSERT_RAISES(tree::base::NotWellFormed, system.check_well_formed()); MARKER // Note that we can index nodes Python-style with negative integers for // add() and remove(), so remove(-1) will remove the broken node we just // added. Note that the -1 is not actually necessary, though, as it is the // default. root->entries.remove(-1); system.check_well_formed(); MARKER // We *can*, of course, add copies of nodes. That's what copy (shallow) and // clone (deep) are for. Usually you'll want a deep copy, but in this case // shallow is fine, because a File has no child nodes. Note that, even for // a deep copy, links are not modified; it is intended to copy a subtree to // another part of the same tree. To make a complete copy of a tree that // maintains link semantics (i.e. does not link back to the original tree) // the best way would be to go through serialize/deserialize. root->entries.add(root->entries[0]->copy()); system.check_well_formed(); MARKER // Note that the generated classes don't care that there are now two // directories named Program Files in the root. As far as they're concerned, // they're two different directories with the same name. Let's remove it // again, though. root->entries.remove(); MARKER // Something we haven't looked at yet are links. Links are edges in the tree // that, well, turn it into something that isn't strictly a tree anymore. // While One/Maybe/Any/Many require that nodes are unique, Link/OptLink // require that they are *not* unique, and are present elsewhere in the // tree. Let's make a new directory, and mount it in the Users directory. auto user_dir = tree::base::make<directory::Directory>( tree::base::Any<directory::Entry>{}, ""); root->entries.emplace<directory::Mount>(user_dir, "SomeUser"); MARKER // Note that user_dir is not yet part of the tree. emplace works simply // because it doesn't check whether the directory is in the tree yet. But // the tree is no longer well-formed now. ASSERT_RAISES(tree::base::NotWellFormed, system.check_well_formed()); MARKER // To make it work again, we can add it as a root directory to a second // drive. system->drives.emplace<directory::Drive>('D', user_dir); system.check_well_formed(); MARKER // A good way to confuse a filesystem is to make loops. tree-gen is okay // with this, though. system->drives[1]->root_dir->entries.emplace<directory::Mount>(root, "evil link to C:"); system.check_well_formed(); MARKER // The only place where it matters is in the dump function, which only goes // one level deep. After that, it'll just print an ellipsis. system->dump(); MARKER // Now that we have a nice tree, let's try the serialization and // deserialization functionality. Serializing is easy: std::string cbor = tree::base::serialize(system); MARKER // Let's write it to a file; we'll load this in Python later. { std::ofstream cbor_output; cbor_output.open("tree.cbor", std::ios::out | std::ios::trunc | std::ios::binary); cbor_output << cbor; } MARKER // Let's also have a look at a hexdump of that. int count = 0; for (char c : cbor) { if (count == 16) { std::printf("\n"); count = 0; } else if (count > 0 && count % 4 == 0) { std::printf(" "); } std::printf("%02X ", (uint8_t)c); count++; } std::printf("\n"); MARKER // You can pull that through http://cbor.me and https://jsonformatter.org // to inspect the output, if you like. // We can deserialize it into a complete copy of the tree. auto system2 = tree::base::deserialize<directory::System>(cbor); system2->dump(); MARKER // Note that equality for two link edges is satisfied only if they point to // the exact same node. That's not the case for the links in our two // entirely separate trees, so the two trees register as unequal. ASSERT(!system2.equals(system)); MARKER // To be sure no data was lost, we'll have to check the CBOR and debug dumps // instead. ASSERT(tree::base::serialize(system) == tree::base::serialize(system2)); std::ostringstream ss1{}; system->dump(ss1); std::ostringstream ss2{}; system2->dump(ss2); ASSERT(ss1.str() == ss2.str()); MARKER return 0; } |
main.py¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | import sys, os, traceback TEST_DIR = os.path.realpath(sys.argv[1]) sys.path.append(TEST_DIR) def marker(): print('###MARKER###') # Note: the # | comment contents of this file, together with the marker() lines # and the output of the program, are used to automatically turn this into a # restructured-text page for ReadTheDocs. The output is appended to the C++ # output. # | Python usage # | ============ # | Let us now turn our attention to tree-gen's Python support. As you may # | remember from the introduction, tree-gen does not provide a direct interface # | between Python and C++ using a tool like SWIG; rather, it provides # | conversion between a CBOR tree representation and the in-memory # | representations of both languages. It then becomes trivial for users of the # | tree-gen structure to provide an API to users that converts from one to the # | other, since only a single string of bytes has to be transferred. # | Recall that we wrote the CBOR representation of the tree we constructed in # | the C++ example to a file. Let's try loading this file with Python. from directory import * with open(os.path.join(TEST_DIR, 'tree.cbor'), 'rb') as f: tree = System.deserialize(f.read()) tree.check_well_formed() print(tree) marker() # | And there we go; the same structure in pure Python. # | As you might expect from a pure-Python structure, field access is much less # | verbose than it is in C++. print(tree.drives[0].letter) print(tree.drives[0].root_dir.entries[0].name) marker() # | Perhaps contrary to Python intuition however, the structure provides # | type-safety. You can't assign values to fields that don't exist... try: tree.color_of_a_banana = 'yellow' except AttributeError: traceback.print_exc(file=sys.stdout) marker() # | ... or assign values of incorrect types to ones that do: try: tree.drives[0] = 'not-a-drive' except TypeError: traceback.print_exc(file=sys.stdout) marker() # | Note that the setters will however try to "typecast" objects they don't know # | about (to be more specific, anything that isn't an instance of Node). So if # | we try to assign a string directly to tree.drives, you get something you # | might not have expected: try: tree.drives = 'not-a-set-of-drives' except TypeError: traceback.print_exc(file=sys.stdout) marker() # | The setter for the drives property is trying to cast the string to a # | MultiDrive, which interprets its string input as any other iterable, and # | ultimately tries to cast the first letter to a Drive. There's not much that # | can be done about this; it's just how duck typing works. # | Manipulation of trees in Python works just as you might expect it to in # | Python: tree.drives.append(Drive('E')) tree.drives[-1].root_dir = Directory([ File('test', 'test-contents'), File('test2', 'test2-contents'), ]) del tree.drives[1] assert not tree.is_well_formed() tree.drives[0].root_dir.entries[-1].target = tree.drives[1].root_dir tree.check_well_formed() print(tree) marker() # | Once the Python code is done manipulating the structure, it can be # | serialized again. cbor = tree.serialize() assert cbor == System.deserialize(cbor).serialize() count = 0 for c in cbor: if count == 16: print() count = 0 elif count > 0 and count % 4 == 0: print(' ', end='') print('%02X ' % c, end='') count += 1 print() marker() |
Interpreter example¶
This example discusses some of the more advanced features of tree-gen, using an interpreter for a very simple language as an example. This is, however, not a completely integrated or even functional interpreter; we just go over some of the concepts and things you may run into when you would make such an interpreter.
External nodes & multiple trees¶
When trees get large and complicated, you may want to split a tree up
into multiple files. This is supported within tree-gen, as long as there
are no circular references to nodes between the files. In this example,
we’ve split up the expression-like nodes into a tree separate from the
statement-like nodes. The former, value.tree
, has nothing special
going on in this regard, while the latter, program.tree
, effectively
includes the former. The files are listed at the bottom of this page
for your convenience.
tree-gen itself is unaware of any links between files. For
program.tree
, it simply generates code assuming that the types
referred to through the external
edges exist. For this to be true,
you have to tell tree-gen to include the external tree’s generated
header file at the top of its generated header file using the include
directive. Note also that, as tree-gen is unaware of the external tree,
you’ll have to use the full C++ TitleCase type name, rather than the
snake_case name for normal edges.
Integration with Flex/Bison (or other parsers)¶
When making an interpreter (or parser of any kind) with tree-gen, you’ll
probably want to use a tokenizer/parser generator such as Flex/Bison.
With Bison at least, you’ll quickly run into a fundamental issue when
trying to store tree-gen nodes in the generated yyval
union:
union in C++ can only consist of types with trivial constructors, and
tree-gen nodes are not among those. The author is unfamiliar with
other frameworks like it, but this will probably apply for any C-style
parser generator.
The solution is to use raw pointers to the generated node types in this enumeration. For instance, you may end up with the following (which would be generated by Bison):
union node_type {
program::Program *prog;
program::Assignment *asgn;
value::Literal *lit;
value::Reference *ref;
// ...
};
To create a new node, instead of using tree::base::make
,
you use the new
keyword directly.
node_type n1, n2;
n1.lit = new value::Literal(2);
n2.ref = new value::Reference("x");
The generated constructors won’t help you much for non-leaf nodes,
because they expect edge wrappers (One, Maybe, etc.) around the
parameters, so you’ll typically want to use the constructor without
arguments. Instead, to assign child nodes, you can use
Maybe::set_raw()
, One::set_raw()
, Any::add_raw()
,
or Many::add_raw()
. These functions take ownership of a new-allocated
raw pointer to a node - exactly what we need here. Note that because they
take ownership, you don’t have to (and shouldn’t!) delete manually.
node_type n3;
n3.asgn = new program::Assignment();
n3.asgn->rhs.set_raw(n1.lit);
n3.asgn->lhs.set_raw(n2.ref);
node_type n4;
n4.prog = new program::Program();
n4.prog->statements.add_raw(n3.asgn);
Bison will ultimately pass you the final parsed node, which you then have
to put into a One
edge yourself:
auto tree = tree::base::One<program::Program>();
tree.set_raw(n4.prog);
tree->dump();
↓
Program(
variables: []
statements: [
Assignment(
lhs: <
Reference(
name: x
target --> !MISSING
)
>
rhs: <
Literal(
value: 2
)
>
)
]
)
Note that when you’re using Bison, the enum and its types is largely hidden from you. In the actions the above would look more like this:
// Action for a literal:
{ $$ = new value::Literal(std::stoi($1)); }
// Action for a reference:
{ $$ = new value::Reference($1); /* and deallocate $1, assuming it's a char* */ }
// Action for a assignment:
{ $$ = new value::Assignment(); $$->lhs.set_raw($1); $$->rhs.set_raw($3); }
and so on.
Syntax error recovery¶
Some parser generators (like Bison) allow you to specify recovery rules
in case of a syntax error, so the parser doesn’t just die immediately.
This is necessary to emit more than a single error message at a time.
To help you store recovery information in the tree for post-mortem
analysis, tree-gen has a special error
directive that you can place
inside a node class. For example, if you make a recovery rule that
assumes any semicolon encountered after a syntax error produces a broken
statement, you can make a node like this:
# An erroneous statement.
erroneous_statement {
error;
}
Such nodes behave exactly like any other node, with one exception: they always indicate that they’re not well-formed.
auto err_stmt = tree::base::make<program::ErroneousStatement>();
ASSERT_RAISES(tree::base::NotWellFormed, err_stmt.check_well_formed());
↓
tree::base::NotWellFormed exception message: ErroneousStatement error node in tree
Line number information (and annotations in general)¶
When you’re doing any kind of serious parsing, you’ll want to store as much line number and other contextual information as possible, to make your error messages as clear as possible. You could of course add this information to every node in the tree specification file, perhaps using inheritance to prevent excessive repetition, but this gets annoying fast, especially when it comes to tree-gen’s well-formedness rules.
Instead, you should be using annotations for this. Annotations are
objects added to nodes without you ever having to declare that you’re
going to add them. If that sounds like black magic to you in the context
of C++, well, that’s because it is: internally annotations are stored
as a map from std::type_index
to a C++11 backport of std::any
…
it’s complicated. But you don’t have to worry about it. What matters is
that each and every node generated by tree-gen (or anything else that
inherits from tree::annotatable::Annotatable
) can have zero or one
annotation of every C++ type attached to it.
The use of annotations is not limited to metadata (although that is its primary purpose); they also allow bits of code to temporarily attach their own data to tree nodes, without the tree definitions needing to be updated to reflect this. This is especially useful when you’re using tree-gen in a library, and you want to let users operate on your trees. After all, they wouldn’t be able to modify the tree definition file without forking and recompiling your library.
tree-gen has only one special case for annotations, intended for adding
line number information to debug dumps. The location
directive is
used for this:
// Source location annotation type. The generated dumper will see if this
// annotation exists for a node, and if so, write it to the debug dump using
// the << stream operator.
location primitives::SourceLocation
Without going into too much detail about the Annotatable interface (just look at the API docs for that), here’s an example of how it would work.
n1.lit->set_annotation(primitives::SourceLocation("test", 1, 5));
n2.ref->set_annotation(primitives::SourceLocation("test", 1, 1));
n3.asgn->set_annotation(primitives::SourceLocation("test", 1, 1));
n4.prog->set_annotation(primitives::SourceLocation("test", 1, 1));
tree->dump();
↓
Program( # test:1:1
variables: []
statements: [
Assignment( # test:1:1
lhs: <
Reference( # test:1:1
name: x
target --> !MISSING
)
>
rhs: <
Literal( # test:1:5
value: 2
)
>
)
]
)
Note that we can still use these pointers despite ownership having been passed to the node objects because they refer to the same piece of memory. In practice, though, you would do this during the parser actions, just after constructing them.
When serializing and deserializing, annotations are ignored by default;
they can be of any type whatsoever, and C++ can’t dynamically introspect
which types can be (de)serialized and which can’t be, after all. So even
though the example SourceLocation object extends
tree::annotatable::Serializable
, this doesn’t work automagically.
auto node = tree::base::make<value::Literal>(2);
node->set_annotation(primitives::SourceLocation("test", 1, 1));
std::string cbor = tree::base::serialize(node);
auto node2 = tree::base::deserialize<value::Literal>(cbor);
ASSERT(!node2->has_annotation<primitives::SourceLocation>());
// However, you *can* register annotations types for serialization and
// deserialization if you want to through the
// ``tree::annotatable::serdes_registry`` singleton. After that, it will
// work.
tree::annotatable::serdes_registry.add<primitives::SourceLocation>("loc");
cbor = tree::base::serialize(node);
node2 = tree::base::deserialize<value::Literal>(cbor);
ASSERT(node2->has_annotation<primitives::SourceLocation>());
Two add()
methods are provided. The one used here assumes that the
type has an appropriate serialize()
method and an appropriate
constructor for deserialization, the other allows you to specify them
manually using ``std::function``s. The name is optional but strongly
recommended; if not used, a C++-compiler-specific identifier will be used
for the type.
Similar to links, annotations aren’t copied as you might expect by
copy()
and clone()
. Specifically, annotations are stored as
std::shared_ptr``s to unknown C++ objects, and therefore copying a
node only copies the references to the annotations. To fully clone
annotations, you'll either have to serialize and deserialize the node
they belong to (after registering with ``serdes_registry
of course), or
clone them manually.
Visitor pattern¶
In the directory example, we avoided the difficulty of dealing with edges
to incomplete types, such as a One<value::Rvalue>
, aside from a
single as_file()
call halfway through. Spamming such typecasts to,
for instance, evaluate an rvalue expression during constant propagation
or interpreting, is not very scalable. It can also lead to headache down
the line if you ever have to add more subclasses, as it’s easy to forget
or not bother to check for unknown types when initially writing that code
that way. tree-gen also generates the necessary classes for the visitor
pattern for you to avoid this.
In the visitor pattern, you define a class that implements the
appropriate visitor interface, and (in this case) the nodes will call
methods on this class depending on their type when you pass the visitor
object to them through their visit()
method. Because the visitor
interface classes are also generated by tree-gen and have sane default
behavior, your code will “by default” be future-proof, in the sense that
you’ll get an error if something unexpected happens, rather than it
failing silently and maybe crashing down the line.
Two visitor base classes are provided, Visitor<T=void>
and
RecursiveVisitor
. In either case, you have to override
visit_node()
, which is called when a node is encountered for which
no specialization is available; this should just throw an appropriate
exception. You can then optionally override any node in the node class
hierarchy to provide the actual behavior.
The two classes differ in the default behavior when an unspecialized
node is encountered: Visitor
always defers up the class hierarchy,
while RecursiveVisitor
provides a default implementation for node
types with edges, where it simply recursively visits the child nodes
depth-first.
The T type in Visitor<T>
refers to the return type for the visit
methods. It’s fixed to void
for RecursiveVisitor
, because if a
node has multiple children, there is nothing sensible to return.
Let’s look at what an expression evaluator looks like for this.
class RvalueEvaluator : public value::Visitor<int> {
public:
int visit_node(value::Node &node) override {
throw std::runtime_error("unknown node type");
}
int visit_literal(value::Literal &node) override {
return node.value;
}
int visit_negate(value::Negate &node) override {
return -node.oper->visit(*this);
}
int visit_add(value::Add &node) override {
return node.lhs->visit(*this) + node.rhs->visit(*this);
}
int visit_sub(value::Sub &node) override {
return node.lhs->visit(*this) - node.rhs->visit(*this);
}
int visit_mul(value::Mul &node) override {
return node.lhs->visit(*this) * node.rhs->visit(*this);
}
int visit_div(value::Div &node) override {
return node.lhs->visit(*this) / node.rhs->visit(*this);
}
int visit_reference(value::Reference &node) override {
return node.target->value;
}
};
// 2 + (3 * 4) = 14
auto expr = tree::base::make<value::Add>(
tree::base::make<value::Literal>(2),
tree::base::make<value::Mul>(
tree::base::make<value::Literal>(3),
tree::base::make<value::Literal>(4)
)
);
RvalueEvaluator eval{};
ASSERT(expr->visit(eval) == 14);
If we wouldn’t have implemented visit_node()
, the
RvalueEvaluator eval{};
line would break, as RvalueEvaluator would
be an abstract class because of it. And while the above may look pretty
complete, it is in fact not:
ASSERT_RAISES(std::runtime_error, value::ErroneousValue().visit(eval));
↓
std::runtime_error exception message: unknown node type
The visitor pattern is more powerful than recursively calling functions
for other reasons as well, because they can be specialized through
inheritance, state can be maintained in the class as the tree is
traversed (the dump()
method is actually implemented this way
internally), it can be written inline within a function despite the
recursive nature, and so on. But it’s a lot more verbose than the
as_*()
alternative for simple cases. The choice between visitor
pattern and casts is therefore usually a matter of personal preference.
Complete file listings¶
value.tree¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | // Attach \file docstrings to the generated files for Doxygen. # Implementation for value tree classes. source # Header for value tree classes. header // Include tree base classes. include "tree-base.hpp" tree_namespace tree::base // Include primitive types. include "primitives.hpp" import primitives // Initialization function to use to construct default values for the tree base // classes and primitives. initialize_function primitives::initialize serdes_functions primitives::serialize primitives::deserialize // Source location annotation type. The generated dumper will see if this // annotation exists for a node, and if so, write it to the debug dump using // the << stream operator. location primitives::SourceLocation // Set the namespace for the generated classes and attach a docstring. # Namespace for value tree classes. namespace value # Variable instance. variable { # The name of the variable. name: primitives::Str; # The current value of the variable while interpreting. value: primitives::Int; } # Toplevel expression node. rvalue { # A literal integer. literal { # The value of the literal. value: primitives::Int; } # Unary operators. unop { # The operand. oper: One<rvalue>; # Negation. negate {} } # Binary operators. binop { # Left-hand side of the expression. lhs: One<rvalue>; # Right-hand side of the expression. rhs: One<rvalue>; # Addition. add {} # Subtraction. sub {} # Multiplication. mul {} # Division. div {} } # Toplevel node for assignable expressions. lvalue { # A variable reference. reference { # The name used to refer to the variable. name: primitives::Str; # The variable being referenced. // Note that we use a link here to allow multiple places in the // tree to refer to the same variable instance in our toy // interpreter, so we don't have to do a hashmap lookup each time // and a well-formed tree by definition cannot refer to variables // that don't exist. target: Link<variable>; } # An erroneous expression. erroneous_value { error; } } } |
program.tree¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | // Attach \file docstrings to the generated files for Doxygen. # Implementation for program tree classes. source # Header for program tree classes. header // Include tree base classes. include "tree-base.hpp" tree_namespace tree::base // Include primitive types. include "primitives.hpp" import primitives // Include the generated value tree header. include "value.hpp" import value // Initialization function to use to construct default values for the tree base // classes and primitives. initialize_function primitives::initialize serdes_functions primitives::serialize primitives::deserialize // Source location annotation type. The generated dumper will see if this // annotation exists for a node, and if so, write it to the debug dump using // the << stream operator. location primitives::SourceLocation // Set the namespace for the generated classes and attach a docstring. # Namespace for program tree classes. namespace program # A program is a collection of statements. program { # Variable declarations for the program. variables: external Any<value::Variable>; # The statements in the program. statements: Any<statement>; } # A statement. statement { # Conditional statement. if_else { # Branch condition. cond: external One<value::Rvalue>; # If block. if_block: Any<statement>; # Else block. else_block: Any<statement>; } # For loop, from start to (non-inclusive) end. for_loop { # Variable to assign. var: external One<value::Lvalue>; # Start value. start: external One<value::Rvalue>; # Stop value. stop: external One<value::Rvalue>; # The repeated code. block: Any<statement>; } # Assignment statement. assignment { # Variable to assign. lhs: external One<value::Lvalue>; # Expression to assign it to. rhs: external One<value::Rvalue>; } # Print statement. print { # Value to print. expr: external One<value::Rvalue>; } # An erroneous statement. erroneous_statement { error; } } |
primitives.hpp¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | /** \file * Defines primitives used in the generated directory tree structure. */ #pragma once #include <string> #include "tree-base.hpp" /** * Namespace with primitives used in the generated directory tree structure. */ namespace primitives { /** * Integer primitive. */ using Int = int; /** * Strings, used to represent filenames and file contents. */ using Str = std::string; /** * Initialization function. This must be specialized for any types used as * primitives in a tree that are actual C primitives (int, char, bool, etc), * as these are not initialized by the T() construct. */ template <class T> T initialize() { return T(); }; /** * Declare the default initializer for integers. It's declared inline to avoid * having to make a cpp file just for this. */ template <> inline Int initialize<Int>() { return 0; } /** * Serialization function. This must be specialized for any types used as * primitives in a tree. The default implementation doesn't do anything. */ template <typename T> void serialize(const T &obj, tree::cbor::MapWriter &map) { } /** * Serialization function for Int. */ template <> inline void serialize<Int>(const Int &obj, tree::cbor::MapWriter &map) { map.append_int("val", obj); } /** * Serialization function for Str. */ template <> inline void serialize<Str>(const Str &obj, tree::cbor::MapWriter &map) { map.append_string("val", obj); } /** * Deserialization function. This must be specialized for any types used as * primitives in a tree. The default implementation doesn't do anything. */ template <typename T> T deserialize(const tree::cbor::MapReader &map) { return initialize<T>(); } /** * Deserialization function for Int. */ template <> inline Int deserialize<Int>(const tree::cbor::MapReader &map) { return map.at("val").as_int(); } /** * Deserialization function for Str. */ template <> inline Str deserialize<Str>(const tree::cbor::MapReader &map) { return map.at("val").as_string(); } /** * Source location annotation object, containing source file line numbers etc. */ class SourceLocation: tree::annotatable::Serializable { public: std::string filename; int line; int column; SourceLocation( const std::string &filename, uint32_t line = 0, uint32_t column = 0 ) : filename(filename), line(line), column(column) { } /** * Serialization logic for SourceLocation. */ inline void serialize(tree::cbor::MapWriter &map) const override { map.append_string("filename", filename); map.append_int("line", line); map.append_int("column", column); } /** * Deserialization logic for SourceLocation. */ inline SourceLocation(const tree::cbor::MapReader &map) { filename = map.at("filename").as_string(); line = map.at("line").as_int(); column = map.at("column").as_int(); } }; /** * Stream << overload for source location objects. */ inline std::ostream& operator<<(std::ostream& os, const primitives::SourceLocation& object) { os << object.filename << ":" << object.line << ":" << object.column; return os; } } // namespace primitives |
CMakeLists.txt¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | cmake_minimum_required(VERSION 3.12 FATAL_ERROR) project(interpreter-example CXX) # Add the tree-gen repository root directory. Normally, your project would be # located outside of tree-gen's root, which means you can omit the second # argument. add_subdirectory(../.. tree-gen) # Generates the files for the value tree. generate_tree_py( "${CMAKE_CURRENT_SOURCE_DIR}/value.tree" "${CMAKE_CURRENT_BINARY_DIR}/value.hpp" "${CMAKE_CURRENT_BINARY_DIR}/value.cpp" "${CMAKE_CURRENT_BINARY_DIR}/value.py" ) # Generates the files for the program tree. generate_tree_py( "${CMAKE_CURRENT_SOURCE_DIR}/program.tree" "${CMAKE_CURRENT_BINARY_DIR}/program.hpp" "${CMAKE_CURRENT_BINARY_DIR}/program.cpp" "${CMAKE_CURRENT_BINARY_DIR}/program.py" ) add_executable( interpreter-example "${CMAKE_CURRENT_BINARY_DIR}/value.cpp" "${CMAKE_CURRENT_BINARY_DIR}/program.cpp" "${CMAKE_CURRENT_SOURCE_DIR}/main.cpp" ) target_include_directories( interpreter-example # This directory for primitives.hpp: PRIVATE "${CMAKE_CURRENT_SOURCE_DIR}" # Binary directory for directory.hpp: PRIVATE "${CMAKE_CURRENT_BINARY_DIR}" ) target_link_libraries(interpreter-example tree-lib) # The following lines only serve to register the example as a test, so you can # run it using `make test` or `ctest` as well. You don't need them as such in # your own project. enable_testing() add_test(interpreter-example interpreter-example) |
main.cpp¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 | #include <iostream> #include <cstdio> #include <stdexcept> #include "../utils.hpp" // Include the generated files. #include "value.hpp" #include "program.hpp" // Note: the // comment contents of main(), together with the MARKER lines and // the output of the program, are used to automatically turn this into a // restructured-text page for ReadTheDocs. int main() { // ******************* // Interpreter example // ******************* // // This example discusses some of the more advanced features of tree-gen, // using an interpreter for a very simple language as an example. This is, // however, not a completely integrated or even functional interpreter; we // just go over some of the concepts and things you may run into when you // would make such an interpreter. // // External nodes & multiple trees // =============================== // // When trees get large and complicated, you may want to split a tree up // into multiple files. This is supported within tree-gen, as long as there // are no circular references to nodes between the files. In this example, // we've split up the expression-like nodes into a tree separate from the // statement-like nodes. The former, ``value.tree``, has nothing special // going on in this regard, while the latter, ``program.tree``, effectively // includes the former. The files are listed at the bottom of this page // for your convenience. // // tree-gen itself is unaware of any links between files. For // ``program.tree``, it simply generates code assuming that the types // referred to through the ``external`` edges exist. For this to be true, // you have to tell tree-gen to include the external tree's generated // header file at the top of its generated header file using the ``include`` // directive. Note also that, as tree-gen is unaware of the external tree, // you'll have to use the full C++ TitleCase type name, rather than the // snake_case name for normal edges. // // Integration with Flex/Bison (or other parsers) // ============================================== // // When making an interpreter (or parser of any kind) with tree-gen, you'll // probably want to use a tokenizer/parser generator such as Flex/Bison. // With Bison at least, you'll quickly run into a fundamental issue when // trying to store tree-gen nodes in the generated ``yyval`` union: // union in C++ can only consist of types with trivial constructors, and // tree-gen nodes are not among those. The author is unfamiliar with // other frameworks like it, but this will probably apply for any C-style // parser generator. // // The solution is to use raw pointers to the generated node types in this // enumeration. For instance, you may end up with the following (which would // be generated by Bison): union node_type { program::Program *prog; program::Assignment *asgn; value::Literal *lit; value::Reference *ref; // ... }; MARKER // To create a new node, instead of using ``tree::base::make``, // you use the ``new`` keyword directly. node_type n1, n2; n1.lit = new value::Literal(2); n2.ref = new value::Reference("x"); MARKER // The generated constructors won't help you much for non-leaf nodes, // because they expect edge wrappers (One, Maybe, etc.) around the // parameters, so you'll typically want to use the constructor without // arguments. Instead, to assign child nodes, you can use // ``Maybe::set_raw()``, ``One::set_raw()``, ``Any::add_raw()``, // or ``Many::add_raw()``. These functions take ownership of a new-allocated // raw pointer to a node - exactly what we need here. Note that because they // take ownership, you don't have to (and shouldn't!) delete manually. node_type n3; n3.asgn = new program::Assignment(); n3.asgn->rhs.set_raw(n1.lit); n3.asgn->lhs.set_raw(n2.ref); node_type n4; n4.prog = new program::Program(); n4.prog->statements.add_raw(n3.asgn); MARKER // Bison will ultimately pass you the final parsed node, which you then have // to put into a ``One`` edge yourself: auto tree = tree::base::One<program::Program>(); tree.set_raw(n4.prog); tree->dump(); MARKER // Note that when you're using Bison, the enum and its types is largely // hidden from you. In the actions the above would look more like this: // // .. code-block:: none // // // Action for a literal: // { $$ = new value::Literal(std::stoi($1)); } // // // Action for a reference: // { $$ = new value::Reference($1); /* and deallocate $1, assuming it's a char* */ } // // // Action for a assignment: // { $$ = new value::Assignment(); $$->lhs.set_raw($1); $$->rhs.set_raw($3); } // // and so on. // // Syntax error recovery // --------------------- // // Some parser generators (like Bison) allow you to specify recovery rules // in case of a syntax error, so the parser doesn't just die immediately. // This is necessary to emit more than a single error message at a time. // To help you store recovery information in the tree for post-mortem // analysis, tree-gen has a special ``error`` directive that you can place // inside a node class. For example, if you make a recovery rule that // assumes any semicolon encountered after a syntax error produces a broken // statement, you can make a node like this: // // .. code-block:: none // // # An erroneous statement. // erroneous_statement { // error; // } // // Such nodes behave exactly like any other node, with one exception: they // always indicate that they're not well-formed. auto err_stmt = tree::base::make<program::ErroneousStatement>(); ASSERT_RAISES(tree::base::NotWellFormed, err_stmt.check_well_formed()); MARKER; // Line number information (and annotations in general) // ---------------------------------------------------- // // When you're doing any kind of serious parsing, you'll want to store as // much line number and other contextual information as possible, to make // your error messages as clear as possible. You could of course add this // information to every node in the tree specification file, perhaps using // inheritance to prevent excessive repetition, but this gets annoying fast, // especially when it comes to tree-gen's well-formedness rules. // // Instead, you should be using annotations for this. Annotations are // objects added to nodes without you ever having to declare that you're // going to add them. If that sounds like black magic to you in the context // of C++, well, that's because it is: internally annotations are stored // as a map from ``std::type_index`` to a C++11 backport of ``std::any``... // it's complicated. But you don't have to worry about it. What matters is // that each and every node generated by tree-gen (or anything else that // inherits from ``tree::annotatable::Annotatable``) can have zero or one // annotation of every C++ type attached to it. // // The use of annotations is not limited to metadata (although that is its // primary purpose); they also allow bits of code to temporarily attach // their own data to tree nodes, without the tree definitions needing to be // updated to reflect this. This is especially useful when you're using // tree-gen in a library, and you want to let users operate on your trees. // After all, they wouldn't be able to modify the tree definition file // without forking and recompiling your library. // // tree-gen has only one special case for annotations, intended for adding // line number information to debug dumps. The ``location`` directive is // used for this: // // .. code-block:: none // // // Source location annotation type. The generated dumper will see if this // // annotation exists for a node, and if so, write it to the debug dump using // // the << stream operator. // location primitives::SourceLocation // // Without going into too much detail about the Annotatable interface (just // look at the API docs for that), here's an example of how it would work. n1.lit->set_annotation(primitives::SourceLocation("test", 1, 5)); n2.ref->set_annotation(primitives::SourceLocation("test", 1, 1)); n3.asgn->set_annotation(primitives::SourceLocation("test", 1, 1)); n4.prog->set_annotation(primitives::SourceLocation("test", 1, 1)); tree->dump(); MARKER // Note that we can still use these pointers despite ownership having been // passed to the node objects because they refer to the same piece of // memory. In practice, though, you would do this during the parser actions, // just after constructing them. // // When serializing and deserializing, annotations are ignored by default; // they can be of any type whatsoever, and C++ can't dynamically introspect // which types can be (de)serialized and which can't be, after all. So even // though the example SourceLocation object extends // ``tree::annotatable::Serializable``, this doesn't work automagically. auto node = tree::base::make<value::Literal>(2); node->set_annotation(primitives::SourceLocation("test", 1, 1)); std::string cbor = tree::base::serialize(node); auto node2 = tree::base::deserialize<value::Literal>(cbor); ASSERT(!node2->has_annotation<primitives::SourceLocation>()); // However, you *can* register annotations types for serialization and // deserialization if you want to through the // ``tree::annotatable::serdes_registry`` singleton. After that, it will // work. tree::annotatable::serdes_registry.add<primitives::SourceLocation>("loc"); cbor = tree::base::serialize(node); node2 = tree::base::deserialize<value::Literal>(cbor); ASSERT(node2->has_annotation<primitives::SourceLocation>()); MARKER // Two ``add()`` methods are provided. The one used here assumes that the // type has an appropriate ``serialize()`` method and an appropriate // constructor for deserialization, the other allows you to specify them // manually using ``std::function``s. The name is optional but strongly // recommended; if not used, a C++-compiler-specific identifier will be used // for the type. // // Similar to links, annotations aren't copied as you might expect by // ``copy()`` and ``clone()``. Specifically, annotations are stored as // ``std::shared_ptr``s to unknown C++ objects, and therefore copying a // node only copies the references to the annotations. To fully clone // annotations, you'll either have to serialize and deserialize the node // they belong to (after registering with ``serdes_registry`` of course), or // clone them manually. // Visitor pattern // =============== // // In the directory example, we avoided the difficulty of dealing with edges // to incomplete types, such as a ``One<value::Rvalue>``, aside from a // single ``as_file()`` call halfway through. Spamming such typecasts to, // for instance, evaluate an rvalue expression during constant propagation // or interpreting, is not very scalable. It can also lead to headache down // the line if you ever have to add more subclasses, as it's easy to forget // or not bother to check for unknown types when initially writing that code // that way. tree-gen also generates the necessary classes for the visitor // pattern for you to avoid this. // // In the visitor pattern, you define a class that implements the // appropriate visitor interface, and (in this case) the nodes will call // methods on this class depending on their type when you pass the visitor // object to them through their ``visit()`` method. Because the visitor // interface classes are also generated by tree-gen and have sane default // behavior, your code will "by default" be future-proof, in the sense that // you'll get an error if something unexpected happens, rather than it // failing silently and maybe crashing down the line. // // Two visitor base classes are provided, ``Visitor<T=void>`` and // ``RecursiveVisitor``. In either case, you have to override // ``visit_node()``, which is called when a node is encountered for which // no specialization is available; this should just throw an appropriate // exception. You can then optionally override any node in the node class // hierarchy to provide the actual behavior. // // The two classes differ in the default behavior when an unspecialized // node is encountered: ``Visitor`` always defers up the class hierarchy, // while ``RecursiveVisitor`` provides a default implementation for node // types with edges, where it simply recursively visits the child nodes // depth-first. // // The T type in ``Visitor<T>`` refers to the return type for the visit // methods. It's fixed to ``void`` for ``RecursiveVisitor``, because if a // node has multiple children, there is nothing sensible to return. // // Let's look at what an expression evaluator looks like for this. class RvalueEvaluator : public value::Visitor<int> { public: int visit_node(value::Node &node) override { throw std::runtime_error("unknown node type"); } int visit_literal(value::Literal &node) override { return node.value; } int visit_negate(value::Negate &node) override { return -node.oper->visit(*this); } int visit_add(value::Add &node) override { return node.lhs->visit(*this) + node.rhs->visit(*this); } int visit_sub(value::Sub &node) override { return node.lhs->visit(*this) - node.rhs->visit(*this); } int visit_mul(value::Mul &node) override { return node.lhs->visit(*this) * node.rhs->visit(*this); } int visit_div(value::Div &node) override { return node.lhs->visit(*this) / node.rhs->visit(*this); } int visit_reference(value::Reference &node) override { return node.target->value; } }; // 2 + (3 * 4) = 14 auto expr = tree::base::make<value::Add>( tree::base::make<value::Literal>(2), tree::base::make<value::Mul>( tree::base::make<value::Literal>(3), tree::base::make<value::Literal>(4) ) ); RvalueEvaluator eval{}; ASSERT(expr->visit(eval) == 14); MARKER // If we wouldn't have implemented ``visit_node()``, the // ``RvalueEvaluator eval{};`` line would break, as RvalueEvaluator would // be an abstract class because of it. And while the above may look pretty // complete, it is in fact not: ASSERT_RAISES(std::runtime_error, value::ErroneousValue().visit(eval)); MARKER // The visitor pattern is more powerful than recursively calling functions // for other reasons as well, because they can be specialized through // inheritance, state can be maintained in the class as the tree is // traversed (the ``dump()`` method is actually implemented this way // internally), it can be written inline within a function despite the // recursive nature, and so on. But it's a lot more verbose than the // ``as_*()`` alternative for simple cases. The choice between visitor // pattern and casts is therefore usually a matter of personal preference. MARKER return 0; } |