4. Large-scale profile analysis#

Churros provides a classheat function for clustering and visualizing large-scale epigenomic profiles. This function takes regions of interest (e.g., specific protein binding sites) as input 1 and a folder of epigenomic signal files (either binary or continuous) as input 2.

  • In the binary mode, classheat outputs a binary matrix (output 1) representing the overlap of epigenomic markers at given genomic regions. The binary matrix is then formatted and sorted by the user-defined column (i.e., the filename of the selected marker) to generate the processed matrix (output 2) and plot the sorted heatmap (output 3). Subsequently, classheat utilizes PCA followed by k-means clustering (or other clustering methods) to produce the clustered matrix (output 4) and the clustered heatmap (output 5).

  • In the continuous mode, classheat calculates the averaged read density of each epigenomic marker at given genomic regions (output 1). After logarithmic transformation, z-score normalization (optional method is 0-to-1 scaling), and sorting, classheat generates the remaining outputs in the same manner as in binary mode.

The main usages are:

churros_classheat mode region directory \
  [-k kcluster] [-s sortname] [-l samplelabel] [-n normalize type] [-m cluster method]

The required parameters:

  • mode: either binary or continuous.

  • region: a BED format file for regions of interest (input 1). Only the first 3 columns are used.

  • directory: a directory containing the epigenomic signal files. The signal files can be either binary (e.g., peak files in BED format) or continuous (e.g., read coverage in bigwig format).

The optional parameters:

  • -k kcluster: number of clusters for clustered matrix and clustered heatmap. The default value is 3.

  • -s sortname: the filename of the selected marker in the directory above. This is used to for the processed matrix and sorted heatmap.

  • -l samplelabel: A .tsv table used to assign groups for each marker in the directory above. For example, it could look like this.

H3K27ac_ENCSR000EWR_rep1_peaks.narrowPeak

H3K27ac

GATA3_ENCSR000EWV_rep1_peaks.narrowPeak

TFs

H3K9me3_ENCSR000EWQ_rep3.mpbl.100.bw

Histone

Rad21_ENCSR000BTQ_rep1_peaks.narrowPeak

TFs

  • -n normalize type: Normalization methods for continuous data, could be zscore or scale0to1. Default: zscore.

  • -m clustering method: minikmeans, kmeans, spectral, meanshift, dbscan, affinity

4.1. Example usage of binary mode#

churros_classheat -l samplelabel.tsv binary Rad21_ENCSR000BTQ_rep1_peaks.narrowPeak ./peakdir/

This command takes as input a file representing regions of interest (Rad21_ENCSR000BTQ_rep1_peaks.narrowPeak) and a directory (./peakdir/) containing multiple epigenomic signals. We also assigned labels to the files in the ./peakdir/ directory. Five output files are generated:

Output1_raw_matrix.tsv
Output2_sorted_matrix.tsv
Output3_sorted_heatmap.png
Output4_kmeans_matrix.tsv
Output5_kmeans_heatmap.png
Alternate

Fig. 4.1 Output5_kmeans_heatmap.png#

4.2. Example usage of continuous mode#

churros_classheat -l samplelabel.tsv -s GATA3_ENCSR000EWV_rep1.bw -k 3 -n zscore continuous Rad21_ENCSR000BTQ_rep1_peaks.narrowPeak ./bwdir/