Quetzal-CoaTL
The Coalescence Template Library
Loading...
Searching...
No Matches
Parsing a Newick string

Introduction


Background

Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas.

Yet not very efficient, Newick is a simple and popular format for representing trees, and provides a rather useful abstraction in coalescence theory for representing the shared history of populations (species trees, population trees) or gene genealogies that coalesce into these trees.

Quetzal provides some utilities to parse, generate and manipulate such format.


Grammar

Whitespace here refer to any of the following: spaces, tabs, carriage returns, and linefeeds.

  • Whitespace within number is prohibited.
  • Whitespace outside of node names is ignored.
  • Grammar characters (semicolon, parentheses, comma, and colon) are prohibited
  • Comments are enclosed in square brackets.

Objectives

In this tutorial section you will learn how to:

  • Parse a Newick string to a Quetzal k-ary tree class,
  • Customize the default Quetzal tree properties to describe your data and problem correctly,
  • Parse a Newick string to your own legacy tree class

To a Quetzal k-ary tree


Default properties

In this example we will use the default (and simplistic) properties of the graph synthetized by the parsing. By properties we mean the type the data structures used to store arbitrary information along the vertices and edges of a graph.

For general graphs it could be anything, but since we are parsing Newick formats, we can reasonably expect ad minima:

  • a std::string to describe a vertex (the name of the node)
  • a double to describe an edge (the distance to a parent node).

Quetzal defaults to these minima when it tries to build a tree from a Newick string.

Remarks
In the next sections we will see that you can actually parse into more complex data structures - as long as vertices (resp. edges) remain constructible from a simple std::string (resp. double)!

Input

1#include "quetzal/quetzal.hpp"
2#include <cassert>
3
4// Define a namespace alias to shorten notation
6
7int main()
8{
9 std::string s1 = "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;";
10 std::string s2 = "(,,(,));";
11
12 // For default graph properties, leave <> empty
13 auto [tree1, root1] = newick::to_k_ary_tree<>(s1);
14 auto [tree2, root2] = newick::to_k_ary_tree<>(s2);
15
16 std::cout << "These graphs are " << (tree1.is_isomorphic(tree2) ? "toootally" : "not") << " isomorphic!"
17 << std::endl;
18
19 return 0;
20}
Generic components for parsing or generating Newick strings.
Definition io.hpp:28

Output

13
22
33
42
5These graphs are toootally isomorphic!

Custom properties

In the previous example we used default types to represent the properties of a vertex and an edge. These defaults are useful, but not always sufficient. For example, you may want to visit your tree graph and trigger specific events when specific nodes or edges are visited.

In this case, you would need to define your own small structures to represent the graph properties you want:

  • the vertex should be constructible from a std::string
  • the edge should be constructible from a double
  • the event would be to print to std::cout with a pretty display
Remarks
You can think of pretty much any type of events here:
  • convert the branch length to coalescence units, generations, etc
  • update a mutational state data member,
  • construct a slightly modified copy of the tree while visiting its clone
  • etc ...

Input

1#include "quetzal/quetzal.hpp"
2#include <iostream>
3
4// Your custom little vertex class
5struct vertex_info
6{
7 std::string name;
8};
9
10// Your custom little edge class
11struct edge_info
12{
13 double length;
14};
15
16// Make your custom vertex streamable for the visitor
17static inline auto &operator<<(std::ostream &os, vertex_info const &v)
18{
19 return os << quoted(v.name);
20}
21
22// we want to trigger a custom event when a vertex is discovered by the DFS
23template <class Graph> struct MyVisitor
24{
25 Graph &_tree;
26 MyVisitor(Graph &tree) : _tree(tree)
27 {
28 }
29 void operator()(auto stage, auto vertex)
30 {
31 switch (stage)
32 {
33 case boost::visit::pre:
34 // vertex_info g[u] is streamable!
35 std::cout << "Discover vertex " << vertex << " (" << _tree[vertex] << ")\n";
36 break;
37 case boost::visit::in:
38 break;
39 case boost::visit::post:
40 break;
41 default:
42 throw std::invalid_argument("received invalid DFS order");
43 }
44 }
45};
46
47// Define a namespace alias to shorten notation
49
50int main()
51{
52 std::string s = "(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;";
53
54 // Inject your custom littles classes here: the parser will generate these
55 // structures and embed them into the graph.
56 auto [tree, root] = newick::to_k_ary_tree<vertex_info, edge_info>(s);
57
58 // depth-first traversal of the vertices in the directed graph
59 MyVisitor visitor(tree);
60 tree.depth_first_search(root, visitor);
61
62 return 0;
63}
Definition newick_parser_2.cpp:24
Definition geography_dispersal_kernel_4.cpp:13
Definition coalescence_binary_tree_2.cpp:5

Output

13
22
3Discover vertex 0 ("F")
4Discover vertex 1 ("A")
5Discover vertex 2 ("B")
6Discover vertex 3 ("E")
7Discover vertex 4 ("C")
8Discover vertex 5 ("D")

Interfacing legacy code

If you have been coding for a while, you may already depend heavily on your own class to describe a tree graph, and not feel very excited about refactoring your whole project to switch to Quetzal trees. Fair enough.

This section shows how to use Quetzal parser to populate your own legacy class and then forget about Quetzal.

  1. You will need to parse Newick strings into an AST (Abstract Syntax Tree): this is a temporary data structure that represents the syntactic structure of the tree.
  2. This AST can then be iterated over to convert it into the data structure of your choice.
Note
  1. Quetzal does not know anything about your own classes, so you will need a bit of extra effort to write a recursion on the AST data
  2. But at least you don't have to worry about the parsing logic anymore!
  3. If the task becomes tedious, don't hesitate to ask for help !
  4. On most compilers but Apple Clang, you could simplify the example code by writing: subtree.left_child = std::make_unique<MyNode>(ast.name, ast.distance_to_parent); in the recursion function.

Input

1#include "quetzal/quetzal.hpp"
2#include <iostream> // std::cout
3#include <memory> // std::unique_ptr
4
5// Assume this binary tree class pervades your code
6struct MyNode
7{
8 std::string id; // the vertex name
9 double length; // the edge length
10 std::unique_ptr<MyNode> left_child; // never use bare pointers in interfaces
11 std::unique_ptr<MyNode> right_child;
12};
13
14// Define a namespace alias to shorten notation
16
17int main()
18{
19 std::string s = "(A:0.1,(C:0.3,D:0.4)B:0.5)E;";
20
21 // We define a shorter type alias
22 using ast_type = newick::ast::node;
23
24 // We initialize the root of Abstract Syntax Tree (AST)
25 ast_type ast;
26
27 // We parse the input string into the AST
28 auto ok = quetzal::parse(begin(s), end(s), newick::parser::tree, ast);
29
30 // We check by printing the AST if the string was successfully parsed
31 std::cout << quoted(s) << "\t " << (ok ? "Parsed" : "Failed");
32 if (ok)
33 std::cout << "\n\nAbstract Tree Syntax:\n\n" << ast << std::endl;
34
35 // Let's use a recursive lambda until C++23 makes this much simpler
36 static const auto populate = [](auto &tree_root, const auto &ast_root) {
37 static auto recurse = [](auto &subtree, const auto &ast, auto &self) mutable -> void {
38 // the idea is simple: traverse the AST ...
39 if (ast.children.size() == 2)
40 {
41 // ... by instantiating the children and propagating to the next subtree
42 subtree.left_child = std::unique_ptr<MyNode>(new MyNode{ast.name, ast.distance_to_parent});
43 self(*subtree.left_child, ast.children.front(), self);
44
45 subtree.right_child = std::unique_ptr<MyNode>(new MyNode{ast.name, ast.distance_to_parent});
46 self(*subtree.right_child, ast.children.back(), self);
47 }
48 };
49 recurse(tree_root, ast_root, recurse);
50 };
51
52 // Initialize your root using the AST data members
53 MyNode my_root{.id = ast.name, .length = ast.distance_to_parent};
54
55 // And then call the recursion to populate the children
56 populate(my_root, ast);
57
58 std::cout << (my_root.left_child == nullptr ? "Failed" : "Populated") << std::endl;
59 std::cout << (my_root.right_child == nullptr ? "Failed" : "Populated") << std::endl;
60
61 return 0;
62}
Definition newick_parser_3.cpp:7

Output

1"(A:0.1,(C:0.3,D:0.4)B:0.5)E;" Parsed
2
3Abstract Tree Syntax:
4
5"E"
6├──0.1──"A"
7└──0.5──"B"
8 ├──0.3──"C"
9 └──0.4──"D"
10
11Populated
12Populated