← BackJan 7, 2026

Choosing an Efficient Representation for Hierarchical Data Structures

Hierarchies appear in many domains, from DOM trees to file systems and scene graphs. Two common C++ patterns to model such structures are a vector of child pointers and an intrusive first‑child/next‑sibling linkage. Each has distinct performance and maintenance trade‑offs that inform the best choice for a given workload.

Hierarchical data is ubiquitous in software systems – think HTML DOM trees, operating‑system file directories, or the node graph of a 3D renderer. A convenient way to model these relationships is to give every node a reference to its children. While the idea is simple, the implementation strategy can dramatically impact memory usage, cache behaviour, and code complexity. ## Classic Array‑of‑Pointers Approach The most straightforward representation is an array or vector that stores pointers (or smart pointers) to each child node: ``` struct node { std::vector children; }; ``` In production code you might own the children via `std::unique_ptr` or store them inline if the tree is very small, but the pattern itself remains the same. ### Advantages * **O(1) indexed access** – retrieving the *n*th child is as simple as `children[n]`. * **Cache locality for sequential traversal** – if the vector is contiguous, iterating over children typically benefits from spatial locality. * **Easy to extend** – inserting or removing a child is a standard container operation. ### Drawbacks * **Dynamic allocation per child** – each push_back allocates memory for the vector and for the node it points to, introducing multiple allocations that are hard to control. * **Allocation overhead** – the allocator must decide how much space to reserve. A naive implementation may allocate far more than necessary, while a conservative one may trigger frequent re‑allocations. * **Pointer indirection during traversal** – even though the vector itself is contiguous, accessing a child still requires a pointer dereference, and each node owns a separate allocation. * **Extra indirection for sibling navigation** – to iterate over the children, you still need to dereference the child pointers, which can lead to cache misses. #### Mitigations If the maximum number of children is known and small, a `std::array` can eliminate dynamic allocations. For variable‑size but usually tiny children lists, a small‑buffer‑optimized vector (e.g. boost::container::small_vector) keeps the first few children on the stack. ## Allocation‑Free Intrusive Representation A classic alternative replaces the vector with a pair of raw pointers that describe a **first child / next sibling** relationship: ``` struct node { node* first_child; node* next_sibling; }; ``` Here the tree becomes a set of **intrusive** linked lists: each node points to its first descendant and to its next neighbour in the parent’s child list. No per‑child allocation is required beyond the node itself. ### Advantages * **Zero additional allocations** – the only memory required is the node object, making the system leaner and easier to pool or reuse. * **Cache‑friendly traversal** – traversing children walks a straight line across memory; every access stays within a single allocation. * **Intrusive benefits** – because the linkage lives inside the node, you can embed the data structure in classes that already allocate themselves, avoiding double allocation overhead. ### Drawbacks * **Linear index lookup** – accessing the *k*th child now requires *k* pointer hops. * **More complex child manipulation** – inserting or deleting a node involves adjusting two pointers, which is error‑prone without careful helper functions. * **No direct sibling count** – you must walk the list to determine the number of children. ## Choosing the Right Representation | Criterion | Vector‑of‑Pointers | First‑Child/Next‑Sibling | |-----------|-------------------|--------------------------| | Indexed access | **Yes** | Linear | Memory overhead | High (per child allocation) | Low (intrusive) | Cache locality | Good (contiguous vector) | Very good (single allocation per node) | Simplicity of manipulation | High | Medium | Use case | Situations requiring frequent index lookup | Heavy traversal, memory‑tight environments, intrusive containers \n A 3‑D engine that constantly walks through an entity hierarchy tends to favour the intrusive layout, because the bulk of CPU cycles are spent stepping through many nodes rather than random index access. Conversely, a DOM library where scripts frequently query children by index or need fast sibling jumps may lean toward a vector‑based model. ### Practical Tips 1. **Profile your workload** – measure which operation dominates (traversal vs. random access). 2. **Consider allocator behaviour** – if you can supply a custom allocator that batches node allocations, the vector approach can be less problematic. 3. **Use helper abstractions** – write functions like `children(node*)` or `child_at(node*, size_t)` to encapsulate the pointer arithmetic and keep code readable. 4. **Beware of cache effects** – even a vector can suffer if node objects themselves are scattered; combining the two strategies (e.g., storing nodes in a single heap block and maintaining a vector of indices) may yield the best of both worlds. ## Conclusion Representing a hierarchy effectively boils down to a trade‑off between **allocation simplicity** and **data locality** versus **indexed access speed**. The classic vector of pointers offers predictable random access at the cost of multiple allocations and pointer dereferences. The first‑child/next‑sibling style eliminates extra memory management entirely and promotes linear traversal, at the expense of index lookups. By aligning the choice with the dominant access pattern of your application, you can build a system that is both efficient and maintainable.