Overview

tree-gen is a code generator that outputs all the repetitive stuff needed for defining a proper tree structure in C++.

Trees are described by *.tree files. Such a file consists of a number of directives at the top and one or more tree node descriptions. Each tree node gets its own C++ class, with a bunch of methods defined on it for traversal, the visitor pattern, safe typecasting, cloning, and so on. It would be extremely tedious and error prone to define all this stuff manually, hence the generator.

It is important to realize and perhaps not immediately obvious that the recursive structure in a *.tree file represents C++ inheritance rather than the actual tree structure. You may for instance encounter something like this:

expression {
    addition {
        ...
    }
    subtraction {
        ...
    }
}

That doesn’t mean an expression consists of an addition and a subtraction, but rather that an addition is a type of expression, and subtraction is another.

Base classes and primitives

tree-gen trees consist of three kinds of objects: nodes, edges, and primitives.

The nodes are classes generated by tree-gen based on some base class. This is typically tree::base::Base. The class inheritance tree of the node types is used for “union” types, i.e. to allow some node in the tree to be one of a number of types. For example, some node may want to have an expression child node, which in practice would be for instance an addition node or a subtraction node. At any level of this class hierarchy, nodes can be given attributes to store data. These attributes are either edges or primitives.

Edges define the relation between some parent node and some number of child nodes. Six edge classes are defined, resulting in different relations between the nodes.

  • tree::base::Maybe<T>: the parent will own zero or one child node of (super)type T.
  • tree::base::One<T>: the parent will logically own exactly one child node of (super)type T.
  • tree::base::Any<T>: the parent will own zero or more child nodes of (super)type T.
  • tree::base::Many<T>: the parent will logically own one or more child node of (super)type T.
  • tree::base::OptLink<T>: the parent links to zero or one child node of (super)type T elsewhere in the tree.
  • tree::base::Link<T>: the parent logically links to exactly one child node of (super)type T elsewhere in the tree.

The “logically” term used above alludes to the fact that One/Many/Link don’t necessarily need to have at least one child at all times. They may for instance be empty during construction. But at least one child node is required for the tree to be considered to be “complete”. Furthermore, completeness with respect to some root node requires that any Link or non-null OptLink refers to a node that is actually reachable from that root node.

Maybe and One are based on std::shared_ptr, Any and Many are based on std::vector<One>, and OptLink and Link are based on std::weak_ptr, but these types are abstracted away from the user almost entirely. Notably, all exposed dereference operations are null- and range-checked, throwing exceptions if there’s a problem, rather than causing a segmentation fault down the line.

The namespace for the Base and edge classes can be set using the tree_namespace directive in case you want to override them or add functionality. Note that you can use “using One = tree::base::One;” etc. if you only want to override part of the classes. You can also opt to use the tree-all.[ch]pp.inc files, which define the same structures that exist in the tree namespace, but allows you to override some of the standard library types they are based on, for example to use a different base exception type. If the tree_namespace directive is not specified, the namespace that the node classes are generated into will be used by default.

Finally, primitives are used to store the actual data, forming the leaves of the tree. They can generally be any externally-defined C++ type, although a few template functions need to be defined and specialized for the types you want to use. These are:

template <typename T>
T initialize() { ... }

template <typename T>
void serialize(const T &obj, tree::cbor::MapWriter &map) { ... }

template <typename T>
T deserialize(const tree::cbor::MapReader &map) { ... }

The initialize function is used to construct a primitive of type T in such a way that it has a defined value. This is in contrast to the behavior of C’s own primitive types (int, bool, etc.), which are specified to be undefined until assigned. This is avoided using this initialize function to prevent confusion caused by undefined behavior. Typically, initialize() can be implemented using just return T{}; for the general case to defer to the constructor for complex types, and then be specialized for the primitive types that you want to use.

The serialize and deserialize functions are optional. When given, logic is generated to serialize and deserialize entire trees to and from a CBOR representation.

In addition to the type-safe primitives declared in the tree description file, nodes can also be annotated using arbitrary types. Specifically, every node can be annotated with zero or one annotation of every C++ type. So, any struct or class you make can be attached to any node in the tree, regardless of the tree specification. This is useful, for instance, to carry metadata, such as line number or debugging information. When tree-gen is used to generate some user-facing tree structure in a library, users of the library can also use this to add their own data to a tree as they operate on it, to prevent them from having to convert to their own data structure first.

Note that the above requires RTTI to be enabled and supported by the compiler, but there’s really no good reason to disable that nowadays.

Defining node types

Nodes have the following form in the tree file:

# [documentation for node class]
[snake_case_node_name] {

    # [documentation for primitive]
    [snake_case_member_name]: [C++ namespace path];

    # [documentation for edge/child node]
    [snake_case_member_name]: [Maybe|One|Any|Many|OptLink|Link]<[snake_case_node_name]>;

    # [documentation for external edge/child node]
    [snake_case_member_name]: external [Maybe|One|Any|Many|OptLink|Link]<[C++ namespace path]>;

    [zero or more specializations of this node; recursive structure]

    [optional reorder directive]
}

Nodes can have zero or more primitives, zero or more edges/child nodes, zero or more external edges/child nodes, and zero or more specializations.

The difference between normal edges and external edges is that the type for a normal edge must resolve to another node in the same tree file, while an external edge may use any C++ type name for the node class. That allows you to refer to nodes defined in other tree files as well. Note however that the generator expects external nodes to actually be node classes, such that it can properly generate the debug dump indentation logic, among other things.

Note that while node type names are specified in snake_case in the tree file, the name will be converted to TitleCase for the generated class names to follow decent C++ naming conventions. This is just because it’s easier to convert snake case to title case than vice versa, and the generator uses both forms (title case for the class names, snake case for methods and members).

Note also the comments explicitly added to the example. Comments using a # are interpreted as docstrings by the generator; they are copied into the source code using javadoc-style comment blocks. This also means you can’t use #-“comments” everywhere, since they’re actually a grammatical construct. Therefore, tree-gen also supports // comments. These are completely ignored by the parser.

The order in which nodes are defined doesn’t matter. This allows you to make recursive tree structures, without having to worry about forward declarations. The generator handles those. The order in which the attributes of a node are defined does matter, though, as it’s used for the order in which the values can be passed to the node constructor; they are ordered subclass to superclass, top to bottom.

The order of the primitives and child notes (a.k.a. fields) is used in a few places, primarily the constructor and for debug dump output generation. The order is as specified within each class hierarchy level, but the fields of the more-specialized classes come before the fields of their ancestors.

When additions are made to a tree, it may be required to maintain compatibility with older tree versions by maintaining field order. In order to support this, the above field order can be overridden using a reorder directive. This directive has the following syntax:

# [optional documentation (ignored)]
reorder(a, b, c);

where a, b, c is the desired order of the fields. Any fields not explicitly specified in the list will automatically appear at the end, using the default order.

Directives

The following directives exist. They should be placed at the top of the tree file, before the first node. Their order doesn’t matter, unless otherwise specified. Note that most directives are required.

  • source: used to specify documentation for the generated source file; any docstring above the directive is copied into the file as file-level doxygen documentation.
  • header: used to specify documentation for the generated header file; any docstring above the directive is copied into the file as file-level doxygen documentation.
  • python: used to specify documentation for the generated Python file (if enabled via the command line); any docstring above the directive is copied into the Python file’s docstring.
  • tree_namespace <namespace::path>: the namespace that the Base and edge classes live in. You should set this to tree::base unless you intend to override or use different base classes. This is unused in Python.
  • initialize_function <namespace::path::initialize>: the name (including namespace path leading up to it) of the T() function used for getting the default value of any of the used primitive classes. Unused in Python; here, the default constructor is used regardless of type (as it always exists and is meaningful in Python).
  • serdes_functions <namespace::path::serialize> <namespace::path::deserialize>: optionally, the names (including namespace paths leading up to them) of the functions used to respectively serialize and deserialize primitive classes. If not specified, serialization functionality is disabled in C++. Python always has serdes support, but support is augmented when a serdes function pair is specified. The Python generator interprets the namespace path as a module path. Refer to the serialization section for more info.
  • location <namespace_path::SourceLocation>: optionally, the name of the source location annotation class (including namespace path leading up to it). The debug dump will look for an annotation of this type on each node when doing a debug dump, and if it exists, uses it to add source information to the node, by streaming out a # followed by the stream overload for the class. The base class needs to be capable of annotations if you use this. This is not supported in Python.
  • include "<path>": adds an #include statement to the top of the generated C++ header file.
  • src_include "<path>": like include, but adds to the top of the generated C++ source file only.
  • import ..., from ... import ...: adds Python import statements to the top of the generated module.
  • namespace <name>: used to specify the namespace for the generated tree classes. As in C++, you need multiple of these to specify the full path. The docstring for the first annotation of this type that has a docstring in front is used to document the innermost namespace javadoc-style for Doxygen documentation. This is not used in Python.

Generated APIs

The following methods are generated for each node class:

  • a constructor with all members of the node in its signature as optional values, defaulting to the values returned by the initializer function.
  • bool is_complete() const: returns whether the node its called on and any subtree rooted in that node is fully defined (no empty One, Many or Link, all array entries in Any/Many nonempty, and all Links and defined OptLinks reachable from the root node).
  • %NodeType type() const: returns the type of this node, using the also-generated %NodeType enumeration.
  • One<Node> copy() const: returns a shallow copy of this node.
  • One<Node> clone() const: returns a deep copy of this node.
  • A value-based equality operator. Note that this ignores equality of any annotations.
  • void visit(Visitor &visitor) (C++ only): implements the visitor pattern using the also-generated abstract Visitor/RecursiveVisitor classes. See below.
  • void dump(std::ostream &out=std::cout, int indent=0): does a debug dump of the node to the given stream with the given indentation level.
  • SomeNodeType *as_some_node_type() (C++ only): does the equivalent of a dynamic_cast to the given node type, returning this if the type is correct or nullptr if not.

Note that the prototypes for the Python equivalents differ as appropriate.

An implicit node class simply named Node is always generated, serving as the base class for all other nodes. In C++, it is what derives from the Base class defined in the namespace specified using tree_namespace. In Python, the Node class always derives directly from object.

Tree traversal

Tree traversal is accomplished by starting at the root and working your way down. The attributes specified in the tree file appear as public members of the node classes. Maybe, One, OptLink, and Link can be dereferenced simply using the * or -> operators; Any and Many use the [] indexation operator. Note that these are all null- and range-checked, unlike the stdlib equivalents.

Tree traversal is complicated by the fact that you often don’t know exactly what the type of a node is. For example, an expression node may actually be a subtraction or addition. For this, tree-gen supports a few different patterns in C++:

  • Visitor pattern. You define a class inheriting from the generated Visitor or RecursiveVisitor classes. These abstract classes provide or require you to implement functions for each node type. The appropriate one then gets called when you pass an instance of your visitor class to the visit() method on a node. You must always override visit_node(), usually to throw a suitable exception in case an unexpected node is encountered. The default implementation for all the other node types differs between the two classes. For Visitor, the default implementation falls back to the next more generically typed function (visit_subtraction() falls back to visit_expression(), visit_expression() falls back to visit_node()). For RecursiveVisitor, the default implementation for non-leaf node types recursively calls visit() for all child nodes, thus recursively traversing the tree. For leaf nodes, both visitor classes have the same behavior.

  • Using the `as_()` methods.* Given for instance an expression node, you might do

    if (auto addition = expression.as_addition()) {
      ...
    } else if (auto subtraction = expression.as_subtraction()) {
      ...
    } else {
      ...
    }
    

    Because a nullptr evaluates to false in C++, the blocks will only be executed if the cast succeeds. Be careful copypasting though; you can accidentally use the addition variable in the subtraction block if you’d want, but it’s obviously null in that case. C++ scoping is weird.

  • Using a switch statement. You might do

    switch (expression.type()) {
      case NodeType::Addition:
        ...
      case NodeType::Subtraction:
        ...
    }
    

    This doesn’t handle the cast for you, but in cases where you only need to switch based on the type and don’t need access to members of the nodes this is more descriptive than the if/else form.

Just choose the method that makes the most sense within context. Python is so much more dynamic in general that you’re better off using its duck typing functionality or using isinstance() directly, so no special features are generated for this.

Note that tree-gen trees do not contain allow traversal back toward the root of a tree. Supporting this would greatly complicate the internals and the user-facing API, because then you wouldn’t be able to just move nodes around. It is therefore key that you design your trees and interfaces such that this information is not needed. If this is somehow impossible, you’ll have to manage the links back up the tree manually using (Opt)Link edges.

Serialization and deserialization

Optionally, logic to serialize and deserialize trees can be generated in addition to the APIs above. To enable this, you have to tell tree-gen how to serialize and deserialize the primitive types that you use in the tree through two templated functions that must be specialized for all primitive types (similar and in addition to the initialization function):

template <typename T>
void serialize(const T &obj, tree::cbor::MapWriter &map) { ... }

template <typename T>
T deserialize(const tree::cbor::MapReader &map) { ... }

The former must serialize obj by calling the various append_*() functions on the given tree::cbor::MapWriter object. The latter must perform the reverse operation. The namespace(s) and names of these functions must be provided to tree-gen using the optional serdes_functions directive. Specifying this directive enables the serialization and deserialization logic.

Once enabled, the entry point for serializing a tree is tree::base::serialize() or tree::base::serialize_file(), and deserializing is tree::base::deserialize() or tree::base::deserialize_file(). The internal serialization and deserialization functions in the edge classes and generated node classes are public and could also be used, but these require more boilerplate due to link handling.

If you attempt to use these methods for a tree without serialization/deserialization support enabled, you’ll receive a template error. Template errors are notoriously difficult to understand, so you have been warned. Don’t use them unless you enable support through the serdes_functions directive. Attempting to serialize a tree that is not well-formed will lead to a tree::base::NotWellFormed exception.

Serialization format

The serialization format makes use of the RFC7049 CBOR data representation, serving as a compromise between the readability of JSON and speed/simplicity of the serialization and deserialization logic. Specifically, CBOR can be losslessly converted to and from JSON using third-party tools if need be for debugging (at least for as far as it’s used here), but is itself a simple binary format.

Each edge and each primitive receives its own map in the tree. Note that Any and Many are internally represented as a vector of One edges; therefore, Any/Many results in two nested edge objects. The data for nodes and their annotations is stored along with the One/Maybe edges. The keys for the maps are the following:

empty Maybe:

    {
        "@T": "?",
        "@t": null
    }

filled Maybe:

    {
        "@T": "?",
        "@i": <sequence number>,
        "@t": "<TitleCase node type name>",
        "<snake_case_attribute_name>": { <attribute data> },
        ...
        "{<annotation type>}": { <annotation data> },
        ...
    }

One:

    {
        "@T": "1",
        "@i": <sequence number>,
        "@t": "<TitleCase node type name>",
        "<snake_case_attribute_name>": { <attribute data> },
        ...
        "{<annotation type>}": { <annotation data> },
        ...
    }

Any:

    {
        "@T": "*",
        "@d": [
            <`One` object for each item>
        ]
    }

Many:

    {
        "@T": "+",
        "@d": [
            <`One` object for each item>
        ]
    }

empty OptLink:

    {
        "@T": "@",
        "@l": null
    }

filled OptLink:

    {
        "@T": "@",
        "@l": <sequence number of linked node>
    }

Link:

    {
        "@T": "$",
        "@l": <sequence number of linked node>
    }

The @T entry stores which type of edge an object is. This is not strictly needed when deserializing as this information should be known by context; instead, it is used as a validity check. The @i sequence numbers are unique integers for all One/Maybe edges in the tree, which are used to recover the (Opt)Link pointers through their @l key. The @t key is used to recover subtype information, as the tree only knows the abstract supertype of a node in general. These are, however, not necessarily globally unique; they only need to distinguish between subclasses, and therefore just the TitleCase name of the node type without C++ namespace is sufficient.

Keys that start with a { and close with a } are used for annotations. The string enclosed within the {} in the key is used to store the annotation type. The identifier used can either be generated automatically by C++ using typeid(T).name(), or can be specified manually to have more control. Any annotation type that doesn’t have a serialization and deserialization function registered for it within tree::annotatable::serdes_registry is silently ignored in either direction.

All remaining keys map to the node attributes using their snake_case name. The corresponding value is (recursively) one of the structures above for edges, or a map containing user-specified key/value pairs for primitives, serialized and deserialized using the functions specified by the serdes_functions directive.

The names of the keys are chosen such that, when ordered by ASCII value, the order is as specified, and such that there can never be name conflicts. The commonly used keys and values are short to minimize serialization and deserialization overhead.

Python support

In addition to C++, tree-gen can also generate pure-Python objects to represent the tree. As in C++, each node type becomes a class, and class hierarchy is used similarly. The edge classes don’t exist, however; instead, their functionality is baked into the generated code to prevent an unnecessary user-facing level of indirection during tree traversal that cannot be hidden in Python. Despite this, the generated structures provide a similar level of type safety as the C++ structures do; all writes to fields of node classes are type-checked.

Where applicable, the generated Python file assumes that the Python module tree corresponds with the namespace hierarchy in the C++ world. This applies to the following things:

  • primitive type declarations;
  • external node types;
  • (de)serialization functions.

The initialization function isn’t used in Python; rather, the default constructor of the primitive types is used.

The connection between the C++ and Python world is handled through CBOR serialization and deserialization rather than providing Python wrappers around the C++ objects. While this approach isn’t as performant as the alternative, it allows a much simpler and more Pythonic interface to be exposed to the user. Serdes is therefore perhaps the most important feature of the Python generator. Note however that it is entirely up to the program or library using tree-gen to provide the requisite interface between the C++ and Python worlds (for example through swig); tree-gen just ensures that the Python and C++ tree implementations always remain in sync.

The serialization and deserialization functions naturally have a different signature in Python than they do in C++. More specifically, the functions must look like this:

def serialize(typ, val):
    # typ is either a type (for primitives) or a str (for annotations). In
    # the former case, isinstance(val, typ) may be assumed. Return value must
    # be an object consisting of only the following types:
    #  - `dict`s with `str` keys;
    #  - `list`s;
    #  - `int`s between -2^63 and 2^63-1 inclusive;
    #  - `float`s;
    #  - `True`, `False`, or `None`;
    #  - `str`s;
    #  - `bytes` objects.
    # The toplevel object must be a dict.
    pass # TODO

def deserialize(typ, val):
    # typ is either a type (for primitives) or a str (for annotations). val
    # is a structure representing the CBOR serialization using the same
    # Python primitives that can be returned by serialize(). The return value
    # must be an instance of the specified `typ`e for primitives, but can be
    # anything for annotations; for unknown annotation types, just returning
    # val as-is is the recommended fallback behavior.
    pass # TODO

Primitive serialization and deserialization can also be done using class methods:

class SomePrimitive:
    def serialize_cbor(self):
        # Called instead of serialize(SomePrimitive, self) if it exists.
        pass # TODO

    @staticmethod
    def deserialize_cbor(cbor):
        # Called instead of deserialize(SomePrimitive, cbor) if it exists.
        pass # TODO

If no serdes_function directive is specified, serdes logic is still generated on the Python end; in this case, only the class method variant is used, and annotations must be directly serializable to CBOR for them to be serialized (they will be silently ignored if they’re not).