Getting started with treedata#

treedata.TreeData is a lightweight wrapper around anndata.AnnData which adds two additional attributes, obst and vart, to store trees for observations and variables. This tutorial will walk you through the basic usage of TreeData, including how to create, subset, and manipulate TreeData objects. If you want to learn more about how TreeData used to analyze single-cell lineage tracing data check out the Pycea getting started tutorial.

import networkx as nx
import numpy as np
import pandas as pd
import scanpy as sc
import treedata as td
import matplotlib.pyplot as plt
import random
def plot_tree(tree, color_attr=None):
    """Helper function for plotting trees."""
    plt.figure(figsize=(6, 3))
    node_colors = "lightgrey" if color_attr is None else [tree.nodes[node].get(color_attr) for node in tree.nodes()]
    pos = nx.drawing.nx_agraph.graphviz_layout(tree, prog="dot")
    nx.draw(tree, pos, with_labels=False, node_size=100, node_color=node_colors)
    leaf_nodes = [node for node in tree.nodes() if tree.out_degree(node) == 0]
    for node, (x, y) in pos.items():
        if node in leaf_nodes:
            plt.text(x, y + 10, s=node, rotation=90, fontsize=6, ha="center", va="top")
        else:
            plt.text(x, y, s=node, fontsize=6, ha="center", va="center")
    plt.show()

Initializing TreeData#

A TreeData object is initialized just like an AnnData object. Let’s create a TreeData object with 32 cells and 100 genes.

counts = pd.DataFrame(
    np.random.normal(size=(32, 100)),
    index=[f"Cell_{i:d}" for i in range(32)],
    columns=[f"Gene_{i:d}" for i in range(100)],
)
tdata = td.TreeData(counts)
tdata
TreeData object with n_obs × n_vars = 32 × 100

Now, suppose we have an nx.DiGraph tree representing relationships between cells, perhaps from a lineage tracing experiment

tree = nx.balanced_tree(r=2, h=5, create_using=nx.DiGraph)
leaves = [i for i in tree.nodes if tree.out_degree(i) == 0]
tree = nx.relabel_nodes(tree, {j: f"Cell_{i}" for i, j in enumerate(leaves)})
nx.set_node_attributes(tree, "lightblue", "color")
plot_tree(tree, "color")
../_images/20144a421b423503da453cec89b3abd6a10ec7d769bc7c456b1308be4fd1dc0a.png

We can store this tree in the obst attribute of the TreeData object

tdata.obst["lineage"] = tree
tdata
TreeData object with n_obs × n_vars = 32 × 100
    obs: 'tree'
    obst: 'lineage'

The obs dataframe now contains a column with tree membership

tdata.obs.head()
tree
Cell_0 lineage
Cell_1 lineage
Cell_2 lineage
Cell_3 lineage
Cell_4 lineage

A TreeData object can also be initialized with only the tree

td.TreeData(obst={"lineage": tree})
TreeData object with n_obs × n_vars = 32 × 0
    obs: 'tree'
    obst: 'lineage'

Or with only the tree and associated observation data

obs = pd.DataFrame({"cluster": np.random.randint(0, 5, size=32)}, index=[f"Cell_{i:d}" for i in range(32)])
td.TreeData(obst={"lineage": tree}, obs=obs)
TreeData object with n_obs × n_vars = 32 × 0
    obs: 'cluster', 'tree'
    obst: 'lineage'

Subsetting TreeData#

Similar to AnnData, we can subset TreeData along the obs and var axes, which provides a view of the TreeData object.

tdata_subset = tdata[["Cell_0", "Cell_1", "Cell_2", "Cell_10"], ["Gene_5", "Gene_19"]]
tdata_subset
View of TreeData object with n_obs × n_vars = 4 × 2
    obs: 'tree'
    obst: 'lineage'

Just like obs and obsm annotations, the trees stored in obst are subset along with the data. Specifically, a subtree is created with the selected leaves and their ancestors.

plot_tree(tdata_subset.obst["lineage"], "color")
../_images/1ef08fdf38ff4c8629896bb4c2153daeacb6e8079406fc460c0a52e0645193e1.png

Since the subtree is a view of the original tree, changes to node attributes in the subtree will be reflected in the original tree.

nx.set_node_attributes(tdata_subset.obst["lineage"], "lightgreen", "color")
plot_tree(tdata.obst["lineage"], "color")
../_images/8db50cdd05e98076476181fd5264028a8ea2771b09a66f87b255ebf904ef538e.png

Views and copies#

TreeData views such as tdata_subset store a reference to the original data

tdata_subset = tdata[:20, :]
tdata_subset
View of TreeData object with n_obs × n_vars = 20 × 100
    obs: 'tree'
    obst: 'lineage'

If we want a TreeData object with its own copy of the data, we can use the copy method.

tdata_subset.copy()
TreeData object with n_obs × n_vars = 20 × 100
    obs: 'tree'
    obst: 'lineage'

If we add new data to tdata_subset, it can no longer be a reference to tdata, so it auto-copies generating a data-storing object.

tdata_subset.obs["foo"] = "bar"
tdata_subset
TreeData object with n_obs × n_vars = 20 × 100
    obs: 'tree', 'foo'
    obst: 'lineage'

The toplogy of trees stored in obst and vart is frozen so you cannot add/remove nodes or edges. However, you can change node attributes.

try:
    tdata.obst["lineage"].remove_node("Cell_0")
except nx.NetworkXError as e:
    print(e)
tdata.obst["lineage"].nodes["Cell_1"]["color"] = "plum"
Frozen graph can't be modified

If you need to change the topology of the tree, you can create a copy, modify the tree, and then assign it back to the obst attribute.

tree = tdata.obst["lineage"].copy()
tree.remove_node("Cell_0")
tdata.obst["lineage"] = tree
plot_tree(tdata.obst["lineage"], "color")
../_images/de1b4e5f8b0844e063674b7301cefb0663bc476b40276b8ebd63af091f078459.png

Multiple trees#

TreeData can store multiple observation and variable trees in the obst and vart attributes. As an example, we’ll create two variable trees which could represent a hierarchical clustering of genes.

tree = nx.balanced_tree(r=2, h=5, create_using=nx.DiGraph)
genes_1 = random.sample(list(tdata.var_names), 32)
genes_2 = random.sample(list(tdata.var_names), 32)
tree_1 = nx.relabel_nodes(tree.copy(), {i + 31: j for i, j in enumerate(genes_1)})
tree_2 = nx.relabel_nodes(tree.copy(), {i + 31: j for i, j in enumerate(genes_2)})
plot_tree(tree_1)
../_images/bb7b0dd6987356728b7ce49eab236ae163986c553bcd0af62e48bd44c4329036.png

We’ll store these trees in vart as "clustering_1" and "clustering_2"

tdata.vart["clustering_1"] = tree_1
tdata.vart["clustering_2"] = tree_2

The tree column in the var dataframe now contains a list of tree memberships with some genes belonging to both trees.

tdata.var["tree"].value_counts()
tree
clustering_2                 22
clustering_1                 22
clustering_1,clustering_2    10
Name: count, dtype: int64

To dissallow overlapping trees, you can set the allow_overlap parameter to False.

del tdata.vart["clustering_2"]
tdata.allow_overlap = False
try:
    tdata.vart["clustering_2"] = tree_2
except ValueError as e:
    print(e)
Leaf names overlap with leaf names of other trees. Set `allow_overlap=True` to allow this.

Alignment#

By default, obs_names and var_names are aligned to the leaves of trees stored in obst and vart. However, TreeData supports multiple alignment types which can be declared using the alignment parameter. See the Alignment tutorial for more details.

For example, we can create a TreeData object with alignment='nodes' where all nodes in the tree are observed instead of just the leaves.

tree = nx.balanced_tree(r=2, h=5, create_using=nx.DiGraph)
tree = nx.relabel_nodes(tree, {i: str(i) for i in tree.nodes()})
nodes_tdata = td.TreeData(obs=pd.DataFrame(index=tree.nodes()), obst={"lineage": tree}, alignment="nodes")
nodes_tdata
TreeData object with n_obs × n_vars = 63 × 0
    obs: 'tree'
    obst: 'lineage'

Scverse integration#

Since the TreeDataobject has the same interface as the AnnData object, it can be used anywhere AnnData is used, including with scverse packages like scanpy and squidpy.

sc.tl.pca(tdata, svd_solver="arpack")
sc.pl.pca(tdata, size=100, na_color="tab:blue")
../_images/f78a7cbf6e0148228ef7d62771481a5c9f13bfe327cbade54290660a9b27e2f0.png

And if you ever need to convert a TreeData object to an AnnData object, you can use the to_adata method.

adata = tdata.to_adata()
adata
AnnData object with n_obs × n_vars = 32 × 100
    obs: 'tree'
    var: 'tree'
    uns: 'pca'
    obsm: 'X_pca'
    varm: 'PCs'