Turning Dendrograms Into Useful Tree Objects
Dendron Trees (Dendrograms). They’re a great way to visualize the application of hierarchical cluster analysis. I tried to convert a dendron graph to a standard python tree structure, but discovered the method is rather cumbersome. Without the dendron chart parsed into a tree structure it is difficult to use the original dendron or linkage structures to do anything useful with the produced linkage (for example, reccomending neighbouring leaves, or determining the weight of a branch of the dendron tree). I put this article together to help anyone trying to tackle a similar problem in the future.
Hypothetical example: using Ward’s method to hierarchically cluster a group of somethings that are in a distance matrix. Let’s imagine for the sake of this excercise, that we already have a distance matrix, but it’s in triangular form. we’re going to read into our python program with the following:
import pandas as pd from matplotlib import pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage, ward from scipy.spatial.distance import squareform import numpy as np import treelib import csv distance_matrix = pd.read_csv('/my_dm.csv') dm_condensed = squareform(distance_matrix) labels_for_distance_matrix = ["label1", "label2", "label3"]
Next we’ll apply Ward’s hierarchical custering method, using the Ward method from scipy.cluster.hierarchy. (The source code for this matches the Ward.D. clustering method from R packages if you are familiar with R, and hierarchical clustering in R). The Ward method creates a scipy.cluster.hierarchy “hierarchy/linkage” object. A hierarchy/linkage object is a compressed mapping of leafs and nodes that are nicely reflected in a dendrogam chart - the linkage structure is interesting, but I will leave its exploration as an excercise for the reader.
We’ll also create a dendrogram plot, using the matplotlib library and the linkage object we produced. We’ll save the plot into the
# z is a linkage object z = ward(dm_condensed) plt.figure(figsize=(25,10)) plt.title("Hierarchical Clustering of Widgets (Ward's Method)") plt.xlabel('idx') plt.ylabel('distance') dendrogram( z, leaf_rotation=90., leaf_font_size=8., labels=labels_for_distance_matrix, ) plt.savefig("dendrogram.pdf", bbox_inches='tight') plt.clf()
Once we’ve done the above, we should get a dendrogram that looks something like this:
Now for the fun part, we need to turn this into something that we can actually use in an application! I’ll use the treelib python library to do this.
# checks linkage validity if hierarchy.is_valid_linkage(z) == False: raise Exception("linkage is not valid") from scipy.cluster import hierarchy rootNode, nodelist = hierarchy.to_tree(z, rd=True)
At this point, we now have a hierarchy package tree object (not quite a treelib tree object yet). Unfortunately the hierarchy tree is not very useful and gives us limited methods to traverse the nodes and the leaves of the tree. We’ll convert the tree to a treelib tree by first writing the nodes of the tree to a csv file.
# Print the formatted nodes to a csv outfile = open('temp-vector-dump.csv', mode='w') w = csv.writer(outfile) i = len(nodelist)-1 while i >= 0: node = nodelist[i] if node.get_id() >= len(label_from_dm): left = node.get_left() right = node.get_right() try: leftID = left.get_id() except: leftID = "leaf" try: rightID = right.get_id() except: rightID = "leaf" if leftID < (len(label_from_dm) - 1): leftID = label_from_dm[leftID] if rightID < (len(label_from_dm) - 1): rightID = label_from_dm[rightID] node_id_left = node.get_id() node_id_right = node.get_id() if isinstance(leftID,str): left_data = consumption_vector_dict[leftID] node_id_left = str(node_id_left) + "," + left_data if isinstance(rightID,str): right_data = consumption_vector_dict[rightID] node_id_right = str(node_id_right) + "," + right_data plist1 = ",".join(map(str, [leftID, leftID, node_id_left])) plist2 = ",".join(map(str, [rightID, rightID, node_id_right])) w.writerows([plist1.split(',')]) w.writerows([plist2.split(',')]) i = i - 1 outfile.close() print("outfile created: temp_vector_dump.csv")
Add a little break so we can inspect the outfile and add any new leaves or nodes you would like to add. Then we continue to parse the csv file into a python treelib object.
print('.') print('.') print('.') print('.') print('.') print('.') print('.') print('.') input("Press Enter to continue...") widgetTree = treelib.Tree() try: tree_infile = open('temp-vector-dump.csv', mode='r') except: print("temp-vector-dump.csv not found. This file is required to generate the tree and weight vectors.") reader = csv.reader(tree_infile) first_row = True for rows in reader: if first_row: gtree.create_node(rows, rows) first_row = False if len(rows) == 4: gtree.create_node(rows, rows, rows, data = null) else: gtree.create_node(rows, rows, rows) tree_infile.close() # display the tree in the terminal widgetTree.show() # should produce something like this: Node1 ├── Node2 │ └── Node3 │ ├── WidgetA │ └── WidgetB └── Node4 ├── Node5 │ └── WidgetC └── Widget D
At this point we can traverse the nodes of the tree just like any other treelib tree object. The tree object has quite a few methods available, with the details of those methods in the documentation: Treelib Tree Docs.
Let me know if this was useful for you! I’ll try and turn this into a more extensible package, if there’s enough appetite for it.