Data Science for Biology. Hierarchical clustering in R using Adenovirus codon usage data.

8 min readOct 17, 2022

This exemplar tutorial is based on adenovirus data and R programming in area of hierarchical clustering. You may check the github repo materials here : https://github.com/DarkoMedin/Adenovirus_bioinformatics_hc.git

and the dataset link and information here https://archive.ics.uci.edu/ml/datasets/Codon+usage retrieved from the UCI Machine Learning repository and the additional references [1,2,3] may be found at the References section at the bottom of the tutorial.

First, some background on the adenoviruses. These viruses (at least the known ones), are mainly specific for Humans, although there are other animals which may be the hosts for these viruses. These are very common viruses and account for around 5% of all respiratory infections and cause some of problems such as common cold or pneumonia.

They actually belong to a family of viruses (keep in mind that adenovirus is not a species but rather a family of viruses) Adenoviridae and they are DNA viruses, meaning their genetic material is DNA and they use it as a main reproduction material. What is also interesting about adenoviruses is that they are frequently used as vectors for gene therapy research and as such very important in molecular biology research.

Here is the code to load the libraries and the dataset, codon_usage.csv in R or RStudio.

#Load the required libraries
#Load the required libraries
library(factoextra)
library(clv)
library(ggplot2)
library(pheatmap)#Import data
codon_usage <- read.csv("path/codon_usage.csv")
View(codon_usage)#Make the data object adenodat which will be used for data wrangling
adenodat=codon_usage

First important aspect is to examine the dataframe and the variables contained. As it can be seen first variable is “Kingdom” and there is a string vrl which signifies the virus kingdom. There are other variables like “DNAtype”, “SpeciesID”, “Ncodons” and “SpeciesName” along with the associated codon variables. From the data wrangling perspective “vrl” or virus kingdom and “SpeciesName” specifying the exact name of the virus species will be the focus. Human adenovirus data can be extracted directly, but for the exemplar project, i will first show how to extract the virus kingdom and the the Human adenovirus data from it.

Using grep() the new dataframe called advir will be created from adenodat (we can use the grep() to match a specific pattern). Next step is to use a t to extract only Human adenovirus data. Logically in the match pattern strings “Human” and “adenovirus” will be used. Easiest way to do this and functionally the best way is to create the object called matchpattern and store these two strings inside that object. R base wunction grepl() will enable identifying these strings in advir$SpeciesName vector and extracting the data for Human adenovirus

#Create advir dataframe containing only virus columns
advir <- adenodat[grep(c("vrl"), adenodat$Kingdom), ]#use a matchpattern object containing "Human" and "adenovirus" strings to extract 
#Human adenovirus data
matchpattern <- c("Human", "adenovirus")#Use another grepl to select rows containing "adenovirus" 
advir<- advir[grepl("Human adenovirus", advir$SpeciesName),]#Make Rownames => Species
rownames(advir)<-advir$SpeciesNameView(advir)

Once the adenovirus dataframe is finalized it can be checked using View(advir) for characteristics. The next part of the code is important because it can provide some information about optimal number of cluster. The fviz_nbclust() function can be used for that

#Using the silhouette score to evaluate k, keep in mind this is #statistically optimal
#but biologically theoretical part must also be evaluated
fviz_nbclust(advir[,6:69], hcut, method = "silhouette", k.max = 25) + labs(subtitle = "Find optimal number of clusters")

As it can be seen the silhouette score can be used to evaluate the optimal number clusters but not definitely, still domain based biological background needs to be accounted for.

Evaluation of the silhouette score plot :

Eventough the silhouette width tells the optimal k-2, that would be too low definitely if theoretical background is taken into account and there is another interesting situation where an increase in silhouette score can be seen between k=4 and k-6, after that there is a significant drop. My intuition combined with theoretical knowledge about adenoviruses tells me k=6 is ideal here. To have a full evaluation of this the setting can be varied k=4, and k=7 and compare the with probable optimal k=6.

First the k=4 setting…

#Creating the distance matrix
dmatrix=dist(advir[,6:69], method = "euclidean")
  
#The complete linkage pattern will be used to perform agglomerative #clustering while creating the ahc object using hclust() r base #function
ahc=hclust(dmatrix, method = "complete")#This will  create the dendrogram for k=4
fviz_dend(ahc, cex = 0.7, k = 4,
          k_colors = c("blue", "green3", "red", "aquamarine"),
          rect = TRUE, lower_rect = -0.1, rect_lty = 4

The distance matrix is computed using dist() and “euclidean distance after which hclust() provides the clustering objects using “complete” linkage method.

The fvizdend() is used to create the dendrogram and evaluate the clusters visually.

It can be seen from the taxonomy of the dataframe that there are multiple taxa of Human Adenoviruses like Human adenovirus A, B, C, E and the Human adenoviruses marked with type with tens of different types, so the taxonomy itself tells us that there are probably more than 4 subgroups, which can also be seen on the dendrogram, the red cluster is in this setting is probably at least 2 clusters in a more optimal setting and this situation is possible for both green and teal colored clusters according to the separation somewhere around 0.05 height.

#To perform the inter/intra-cluster distance analysis the hierarchy should be cut into k parts
#According to the second elbow the dendrogram will be cut into 4 parts
cutdend<-cutree(ahc, k=4)#To visualizae the heatmap and both codon based and species based dendrograms 
#make sure data is numeric and its log transformed for better clustering and visualization
data<- as.data.frame(sapply(advir[,6:69],as.numeric))
data<-log(data+1)#rownames should correspond to the species names
rownames(data)<-advir$SpeciesName#cll.scatt.data() is used to evaluate and output inter/intra-cluster distances
#and cls.attrib() to evaluate cluster size
#ggplot will be used to plot the intercluster distance medians and IQRshid=cls.scatt.data(data,cutdend, dist = "euclidean")
dfinter=as.data.frame(hid$intercls.complete)
intradim<-hid$intracls.complete
att<- cls.attrib(data,cutdend)
clsize<-att$cluster.size
dfinter=as.data.frame(hid$intercls.complete)
ggplot(stack(dfinter), aes(x = ind,y = values)) +
  geom_boxplot()

The intercluster distances are quite similar with median arond 0.09 and very similar Q1 and Q3 values (similar inter-quartile range). However the median values are quite high and this is another indicator combined with the silhouette score from previous analysis and the dendrogram that there are probably more clusters.

Here is the plot produced using the pheatmap() function.

#Using the pheatmap package a publication ready visualization is created
pheatmap(data, cutree_rows = 4)

All this can be observerd on the pheatmap and high codon usage for AAC codon is very specific for one cluster starting for Human adenovirus type 10 and ending with either Human adenovirus type 29 or Human adenovirus type 30, so this needs to be evaluated further. Also, it can be seen that higher usage levels of codons ACA, AAU and AAA usage, all three codons containing at least 2 adenine nucleotides are specific for cluster 1, starting with Human adenovirus type 19 and ending with Human adenovirus type 7a

Now the second setting with increased number of clusters to 7 (6) as suggested by the visual observation of the previous dendrogram and optimal silhouette score 6 +1 to see if that setting is better. Here is the dendrogram with k=7 setting.

#This will  create the alliterative hierarchical clustering dendrogram
#but this time using k as 7, much better biologically speaking
fviz_dend(ahc, cex = 0.7, k = 7,
          k_colors = c("blue", "green3", "red", "aquamarine","orange","purple","gray"),
          rect = TRUE, lower_rect = -0.1, rect_lty = 4 )

#To perform the inter/intra-cluster distance analysis the hierarchy should be cut into k parts
cutdend2<-cutree(ahc, k=7)#Repeat the same log transformation process for second dataset meant for k=7
data2<- as.data.frame(sapply(advir[,6:69],as.numeric))
data2<-log(data+1)#rownames again assigned as in advir
rownames(data2)<-advir$SpeciesName#Using this code, hierarchical clustering intracluster dimensions,
#intercluster distances will be calculated and plotted 
hid2=cls.scatt.data(data2,cutdend2, dist = "euclidean")
dfinter2=as.data.frame(hid2$intercls.complete)
intradim2<-hid2$intracls.complete
att2<- cls.attrib(data2,cutdend2)
clsize2<-att2$cluster.size
dfinter2=as.data.frame(hid2$intercls.complete)
ggplot(stack(dfinter2), aes(x = ind,y = values)) +
  geom_boxplot()

With k=7, it can be seen that all clusters except c1 and c4 have complete inter-cluster distance bellow 0.09 which is not much better compared to setting of k=4, however the most stable cluster c1 still has high intercluster distance, so the next step would be to try k=6 as suggested but the silouette score, but first the pheatmap for k=7 is to be examined

#Using the pheatmap() a publication ready visualization is created for k=7 structure
pheatmap(data, cutree_rows = 7)

Here is the pheatmap for k=7 structure :

Now the final k=6 setting. The means to interpret should be separation height, silhouette score and of course the intercluster distance as before.

The following code is used to plot the dendrogram with k=6 setting…

#Visualize the dendrogram with k=6
fviz_dend(ahc, cex = 0.7, k = 6,
          k_colors = c("blue", "green3", "red", "aquamarine","orange","purple"),
          rect = TRUE, lower_rect = -0.1, rect_lty = 4 )

#Cut the dendrogram for the hierarchy  k =6 parts
cutdend3<-cutree(ahc, k=6)#log transformation process for second dataset meant for k=6
data3<- as.data.frame(sapply(advir[,6:69],as.numeric))
data3<-log(data+1)#rownames again assigned as in advir k=6
rownames(data)<-advir$SpeciesName#Last setting for k=6 to check the silhouette differences between k=7 and k-6
#Same steps as in previous two setting except for k=6
hid3=cls.scatt.data(data3,cutdend3, dist = "euclidean")
dfinter3=as.data.frame(hid3$intercls.complete)
intradim3<-hid3$intracls.complete
att3<- cls.attrib(data3,cutdend3)
clsize3<-att3$cluster.size
dfinter3=as.data.frame(hid3$intercls.complete)
ggplot(stack(dfinter3), aes(x = ind,y = values)) +
  geom_boxplot()

Now the evaluation of the intercluster distance boxplot…

It can be seen not that all the clusters complete intercluster distances droped bellow 0.09 which is not better compared to both k= and k=7 settings.

The pheatmap with k=6 clustering structure :

#Pheatmap() for a publication ready visualization is created for k=6 structure
pheatmap(data, cutree_rows = 6)

It can be visualized and evaluated from a Data Science perspective is that cluster 1 containing the following virus species : Human adenovirus type 19, Human adenovirus type 19, Human adenovirus type 15, Human adenovirus type 37, Human adenovirus type 8E, Human adenovirus type 8, Human adenovirus type 6, Human adenovirus type 7d, Human adenovirus type 7a was most credible cluster, being differentiated in all 3 clustering settings. Further k=6 structure was most optimal, due to increasing silhouette score from k=4 to k=7 and lowering intercluster distances compared to k=4 and lower complete intercluster distance for cluster 1 compared to k=7 structure.

References:

1.Khomtchouk BB: ‘Codon usage bias levels predict taxonomic identity and genetic composition’. bioRxiv, 2020, doi: 10.1101/2020.10.26.356295.

2.Nakamura Y, Gojobori T, Ikemura T: ‘Codon usage tabulated from international DNA sequence databases: status for the year 2000’. Nucleic Acids Research, 2000, 28:292.

3.http://www.kazusa.or.jp/codon/

Data Science for Biology. Hierarchical clustering in R using Adenovirus codon usage data.

Written by Darko Medin

No responses yet