10. Commands in Churros #

10.1. Reference Data Preparation #

10.1.1. download_genomedata.sh #

download_genomedata.sh downloads the genome and gene annotation files of the genome build specified. Churros assumes the reference data is downloaded by this command.

download_genomedata.sh <build> <outputdir>
   build (Ensembl|UCSC, you can specify either):
      human (GRCh38|hg38, GRCh37|hg19, T2T)
      mouse (GRCm39|mm39, GRCm38|mm10)
      rat (mRatBN7.2|rn7)
      fly (BDGP6|dm6)
      zebrafish (GRCz11|danRer11)
      chicken (GRCg6a|galGal6)
      African clawed frog (Xenopus_tropicalis|xenLae2)
      C. elegans (WBcel235|ce11)
      S. cerevisiae (R64-1-1|sacCer3)
      S. pombe (SPombe)
      Hydra vulgaris AEP (HVAEP)
   Example:
      download_genomedata.sh hg38 Referencedata_hg38

10.2. Correspondence table for the genome build #

While download_genomedata.sh uses the Ensembl genome build, Churros uses UCSC genome build in the command. Here we summarize the correspondence.

Species	Ensembl	UCSC
human	GRCh38	hg38
human	GRCh37	hg19
mouse	GRCm39	mm39
mouse	GRCm38	mm10
rat	mRatBN7.2	rn7
fly	BDGP6	dm6
zebrafish	GRCz11	danRer11
chicken	GRCg6a	galGal6
C.elegans	WBcel235	ce11
S.serevisiae	R64-1-1	sacCer3
African clawed frog	Xenopus_tropicalis	xenLae2

For example, to use human genome build GRCh38/hg38, specify GRCh38 or hg38 for download_genomedata.sh and hg38 for churros.

download_genomedata.sh hg38 UCSC-hg38/ 2>&1 | tee log/UCSC-hg38
churros -p 24 --mpbl samplelist.txt samplepairlist.txt hg38 UCSC-hg38
# or
download_genomedata.sh GRCh38 Ensembl-GRCh38/ 2>&1 | tee log/Ensembl-GRCh38
churros -p 24 --mpbl samplelist.txt samplepairlist.txt hg38 Ensembl-GRCh38/

Other species:

Species	ID
human	T2T
S.pombe	SPombe
A.thaliana	TAIR10
Oryzias latipes	Medaka
Hydra vulgaris AEP	HVAEP

10.2.1. build-index.sh #

build-index.sh builds index files of the tools specified. <odir> should be the same with <outputdir> directory provided in download_genomedata.sh. If -a option is specified, all scaffolds are also indexed in addition to chromosomes. The <odir> is used in the Churros commands below.

build-index.sh [-p ncore] -a <program> <odir>
    program: bowtie, bowtie-cs, bowtie2, bwa, chromap, bismark
    Example:
        build-index.sh bowtie2 Referencedata_hg38

10.3. Commands internally used in churros #

10.3.1. churros #

churros command internally implements churros_mapping, churros_callpeak, churros_visualize, churros_compare and churros_genPvalwig.

churros also check the quality of FASTQ files using fastqc and fastp in addition to the quality check of map files by churros_mapping. The result is summarized in the stats file in text format (churros.QCstats.tsv) and HTML format by MULTIQC (multiqc_report.html) in the output directory.

usage: churros [-h] [--cram] [-f] [-b BINSIZE] [-k K] [--nompbl] [--nofilter] [--noqc] [--fastqtrimming] [-q QVAL] [--macsdir MACSDIR]
               [--mapparam MAPPARAM] [--parse2wigparam PARSE2WIGPARAM] [-p THREADS] [--threads_comparative THREADS_COMPARATIVE]
               [--outputpvalue] [--comparative] [-D OUTPUTDIR] [--preset PRESET] [-v]
               samplelist samplepairlist build Ddir

positional arguments:
   samplelist            sample list
   samplepairlist        ChIP/Input pair list
   build                 genome build (e.g., hg38)
   Ddir                  directory of reference data

options:
   -h, --help            show this help message and exit
   --cram                output as CRAM format (default: BAM)
   -f, --force           overwrite if the output directory already exists
   -b BINSIZE, --binsize BINSIZE
                           binsize of parse2wig+ (default: 100)
   -k K                  read length for mappability file ([28|36|50], default:50)
   --nompbl              do not consider genome mappability in drompa+
   --nofilter            do not filter PCR duplicate
   --noqc                omit FASTQC and fastp
   --fastqtrimming       Apply adapter trimming with fastp before mapping (omitted if '--noqc' is specified)
   -q QVAL, --qval QVAL  threshould of MACS2 (default: 0.05)
   --macsdir MACSDIR     output direcoty of macs2 (default: 'macs2')
   --mapparam MAPPARAM   additional parameter for bowtie|bowtie2 (shouled be quated)
   --parse2wigparam PARSE2WIGPARAM
                           additional parameter for parse2wig+ (shouled be quated)
   -p THREADS, --threads THREADS
                           number of CPUs (default: 12)
   --threads_comparative THREADS_COMPARATIVE
                           number of CPUs for --comparative option (default: 8)
   --outputpvalue        output ChIP/Input -log(p) distribution as a begraph format
   --comparative         compare bigWigs and peaks among samples by churros_compare
   -D OUTPUTDIR, --outputdir OUTPUTDIR
                           output directory (default: 'Churros_result')
   --preset PRESET       Preset parameters for mapping reads ([scer|T2T])
   -v, --version         print version information and quit

Key points:
- We recommend considering genome mappability as long as mappability files are available.
  download_genomedata.sh generates mappability files for the read lengths 28, 36, and 50. Specify the read length closest to your data by -k option.
  
  If the mappability file is unavailable, consider generating it by yourself (see Appendix).
- The appropriate parameter setting depends on the species to be investigated. churros has --preset option to tune the parameter set for each species.
  In version 0.4.0, scer (for S. cerevisiae) and T2T (for T2T-CHM13) are available. When applying chuross to S. cerevisiae, try --preset scer option.

10.3.2. churros (spike-in mode)#

From version v1.0.0, churros has an option to apply spike-in normalization. The command is as follows:

build=hg38
build_spikein=mm39
Ddir_ref=Referencedata_${build}
Ddir_spikein=Referencedata_${build_spikein}
ncore=48

churros -p $ncore --spikein samplelist.txt samplepairlist.txt \
      $build ${Ddir_ref} --build_spikein ${build_spikein} --Ddir_spikein ${Ddir_spikein}

The required options are --spikein, build_spikein and --Ddir_spikein. This command uses hg38 for the reference genome and mm39 for the spike-in genome.

Churros will then create the bigWig/Spikein/, pdf_spikein/, and spikein_scalingfactor directories, which contain the results of the spike-in analysis.

10.3.3. churros_mapping #

churros_mapping maps FASTQ reads to the genome specified by Bowtie2 in default. The mapped reads are then quality-checked and converted to BigWig files.

churros_mapping has 5 commands: exec, map, postprocess, stats and header.

The main command is exec that maps reads and generates bigWig files (identical to both map and postprocess command execution).
map executes mapping.
postprocess generates bigWig files from the map files generated by map commands.
stats command outputs the quality values in one line (used in churros.QCstats.tsv).
Because stats command does not show the header of columns, use header command to show the header.

churros_mapping [options] <command> <samplelist> <build> <Ddir>
   <command>:
      exec: map & postprocess
      map: mapping reads
      postprocess: QC and generate wig files by ssp and parse2wig;
      stats: show mapping/QC stats;
      header: print header line of the stats
   <samplelist>: Samplelist file
   <build>: Genome build (e.g., hg38)
   <Ddir>: Directory of bowtie|bowtie2 index
   Options:
      -c: Output as CRAM format (defalt: BAM)
      -b: Binsize of parse2wig+ (defalt: 100)
      -z: Peak file for FRiP calculation (BED format, default: default MACS2 without control)
      -k [28|36|50]: Read length for mappability file (default:50)
      -n: Do not consider genome mappability
      -N: Do not filter PCR duplication
      -C: For SOLiD data (csfastq, defalt: fastq)
      -f: Output format of parse2wig+ (default: 3)
               0: compressed wig (.wig.gz)
               1: uncompressed wig (.wig)
               2: bedGraph (.bedGraph)
               3: bigWig (.bw)
      -P "param": Additional parameter for bowtie|bowtie2 (shouled be quated)
      -Q "param": Additional parameter for parse2wig+ (shouled be quated)
      -p: Number of CPUs (default: 12)
      -D: directory for execution (defalt: "Churros_result")
   Example:
      For single-end: churros_mapping exec chip.fastq.gz chip hg38 Referencedata_hg38
      For paired-end: churros_mapping exec "-1 chip_1.fastq.gz -2 chip_2.fastq.gz" chip hg38 Referencedata_hg38

Key points:
- There are two directories in bigWig directory, RawCount and TotalReadNormalized. The former is a raw count of nonredundant mapped reads, while the latter stores the read number after total read normalization to 20 M.
- Churros uses TotalReadNormalized in the downstream analysis, while MACS2 (peak calling) uses the former.

10.3.4. churros_mapping_spikein #

churros_mapping_spikein is similar to churros_mapping and applies spike-in normalization.

churros_mapping_spikein has 3 commands: exec, stats and header.

By default, churros_mapping_spikein uses the Calibrating ChIP-seq normalization proposed by Hu et al., NAR 2015, which requires the input sample obtained from WCE. If this is not available, use the --spikein_simple option, which applies a simpler normalization using only the ChIP samples proposed by Orlando et al., Cell Rep, 2014.

The command below uses hg38 for the reference genome and mm39 for the spike-in genome.

build=hg38
build_spikein=mm39
Ddir_ref=Referencedata_${build}
Ddir_spikein=Referencedata_${build_spikein}
ncore=48

churros_mapping_spikein exec samplelist.txt samplepairlist.txt ${build} ${build_spikein} \
      ${Ddir_ref} ${Ddir_spikein} -p ${ncore}

usage: churros_mapping_spikein [-h] [--spikein_simple] [--spikein_constant SPIKEIN_CONSTANT] [--cram] [-p THREADS] [-D OUTPUTDIR]
                            [--bowtieparam BOWTIEPARAM] [-b BINSIZE] [--peak PEAK] [--param_parse2wig PARAM_PARSE2WIG]
                            [--output_format OUTPUT_FORMAT] [--nompbl] [--nofilter] [-k KMER]
                            command samplelist samplepairlist build build_spikein Ddir_ref Ddir_spikein

positional arguments:
command               [exec|stats|header]
                           exec: mapping and postprocess
                           stats: show mapping/QC stats
                           header: print header line of the stats
samplelist            Sample list
samplepairlist        ChIP/Input pair list
build                 genome build (e.g., hg38)
build_spikein         genome build (e.g., mm39)
Ddir_ref              Directory of genome index for reference
Ddir_spikein          Directory of genome index for spike-in

options:
-h, --help            show this help message and exit
--spikein_simple      Spikein: Use ChIP samples only
--spikein_constant SPIKEIN_CONSTANT
                        Scaling Constant for the number of reads after normalization (default: 100)
--cram                Output as CRAM format (defalt: BAM)
-p THREADS, --threads THREADS
                        Number of CPUs (default: 12)
-D OUTPUTDIR, --outputdir OUTPUTDIR
                        Output directory (default: 'Churros_result')
--bowtieparam BOWTIEPARAM
                        Additional parameter for bowtie|bowtie2 (shouled be quated)
-b BINSIZE, --binsize BINSIZE
                        Binsize of parse2wig+ (default: 100)
--peak PEAK           Peak file for FRiP calculation (BED format, default: MACS2 without control)
--param_parse2wig PARAM_PARSE2WIG
                        Additional parameter for parse2wig+ (shouled be quated)
--output_format OUTPUT_FORMAT
                        Output format of parse2wig+ (default: 3)
                           0: compressed wig (.wig.gz)
                           1: uncompressed wig (.wig)
                           2: bedGraph (.bedGraph)
                           3: bigWig (.bw)
--nompbl              Do not consider genome mappability
--nofilter            Use data where PCR duplication is not filtered
-k KMER, --kmer KMER  Read length for mappability file ([28|36|50], default:50)

10.3.5. churros_callpeak #

churros_callpeak executes MACS2 to call peaks for all samples specified in samplepairlist. The results are output in macs directory by default. churros_callpeak also compares the obtained peaks among samples and outputs the heatmap in comparison and simpson_peak_results directories.

churros_callpeak [Options] <samplepairlist> <build>
   <samplepairlist>: text file of ChIP/Input sample pairs
   <build>: genome build (e.g., hg38)
   Options:
      -D : directory for execution (defalt: "Churros_result")
      -d : directory for peaks (defalt: "macs")
      -q : threshould of MACS2 (defalt: 0.05)
      -b : bam direcoty (defalt: "bam")
      -F : overwrite MACS2 resilts if exist (defalt: skip)
      -p : number of CPUs (defalt: 4)
      -s : postfix of the mapfile ($prefix$post.sort.bam, default: "")

10.3.6. churros_visualize #

churros_visualize executes DROMPA+ to make pdf files that visualize read/enrichment/p-value distributions. The results are output in pdf directory by default.

usage: churros_visualize [-h] [-b BINSIZE] [-l LINESIZE] [--nompbl] [--nofilter] [-d D] [--postfix POSTFIX] [--pvalue] [--bowtie1]
                        [-P DROMPAPARAM] [-G] [--enrich] [--logratio] [--preset PRESET] [-D OUTPUTDIR]
                        samplepairlist prefix build Ddir

positional arguments:
samplepairlist        ChIP/Input pair list
prefix                Output prefix (directory will be omitted)
build                 Genome build (e.g., hg38)
Ddir                  Directory of reference data

options:
-h, --help            show this help message and exit
-b BINSIZE, --binsize BINSIZE
                        Binsize of parse2wig+ (default: 100)
-l LINESIZE, --linesize LINESIZE
                        Line size for each page (kbp, defalt: 1000)
--nompbl              Do not consider genome mappability
--nofilter            Use data where PCR duplication is not filtered
-d D                  Directory of bigWig files (default: 'TotalReadNormalized/')
--postfix POSTFIX     Parameter string of parse2wig+ files to be used (default: '.mpbl')
--pvalue              Show p-value distribution instead of read distribution
--bowtie1             Specified bowtie1
-P DROMPAPARAM, --drompaparam DROMPAPARAM
                        Additional parameters for DROMPA+ (shouled be quated)
-G                    Genome-wide view (100kbp)
--enrich              PC_ENRICH: show ChIP/Input ratio (preferred for yeast)
--logratio            (for PC_ENRICH) Show log-scaled ChIP/Input ratio
--preset PRESET       Preset parameters for mapping reads ([scer|T2T])
-D OUTPUTDIR, --outputdir OUTPUTDIR
                        Output directory (default: 'Churros_result')

Note

If you supply -n option in churros_mapping (do not consider genome mappability), supply --nompbl option in churros_visualize to use the generated mappability-normalized bigWig files.

Key points:
- The default setting (100-bp bin and 1-Mbp page width) is adjusted to typical transcription factor analysis for human/mouse.
- For the broad mark analysis (e.g., H3K27me3 and H3K9me3, which are distributed more than 100 kbp), macro-scale visualization is useful. For example, -b 5000 -l 8000 option generates 5-kbp bin, 8-Mbp page width. The scale of the y-axis can be changed by -P option, for example, -P "--scale_tag 100".
- By -G option, churros_visualize visualizes ChIP/Input enrichment in genome-wide view (whole chromosome on one page).
- It is also possible to visualize -log10(p) of ChIP/Input enrichment instead of read distribution, by supplying --pvalue option.
- churros_visualize can highlight the peak regions called by MACS2 by supplying the macs/samplepairlist.txt generated by churros_callpeak for samplepairlist (see Tutorial).

10.3.7. churros_compare #

churros_compare estimates the correlation among samples described in <samplepairlist> and draw heatmaps and scatter plots using three types of comparative analysis:

Spearman correlation of read distribution by applying bigWig files (100-bp and 100-kbp bins) to deepTools plotCorrelation.
- This score evaluates the similarity of the whole genome including non-peak regions. Therefore the results may reflect the genome-wide features (e.g., GC bias and copy number variations) rather than peak overlap.
- The results are stored in bigwigCorrelation/.
Jaccard index of base-pair level overlap of peaks by BEDtools jaccard.
- This score is good for broad peaks such as some histone modifications (H3K27me3 and H3K36me3).
- The results are stored in Peak_BPlevel_overlap/.
Simpson index of peak-number level comparison.
- This score is good for the comparison of sharp peaks such as transcription factors.
- The results are stored in Peak_Number_overlap/. PairwiseComparison/ contains the results of all pairs (overlapped peak list and Venn diagram) and the Peaks contains top-ranked peaks of samples.

churros_compare [Options] <samplelist> <samplepairlist> <build>
   <samplelist>: text file of samples
   <build>: genome build (e.g., hg38)
   Options:
      -o: output directory (defalt: "comparison")
      -d: peak direcoty (defalt: "macs")
      -n: do not consider genome mappability
      -D: directory for execution (defalt: "Churros_result")
      -p : number of CPUs (default: 8)
      -y <str>: param string of parse2wig+ files to be used (default: ".mpbl")

Note

If all samples are sharp peaks (e.g., transcription factors), the Simpson index may be reasonable. If the samples contain broad peaks (e.g., histone modification such as H3K27me3), the Jaccard index may provide more reasonable results because multiple sharp peaks can be overlapped with one broad peak.

Note

If the number of samples is large (50~) and/or the number of peaks of each sample is large (100k~), the comparison will require a long time. In such a case, consider supplying a large number for -p, though that will require a large memory size.

10.3.8. churros_genPvalwig #

As churros_visualize can visualize -log10(p) of ChIP/Input enrichment distribution, churros_genPvalwig can be used the p-value distribution in bedGraph.

The good usage of churros_genPvalwig is specifying ChIP files in two conditions (e.g., before and after stimulation) in samplepairlist and analyzing the p-value distribution to investigate significantly increased/descreased regions.

churros_genPvalwig [Options] <samplepairlist> <odir> <build> <gt>
   <samplepairlist>: text file of ChIP/Input sample pairs
   <odir>: output directory
   <build>: genome build (e.g., hg38)
   <gt>: genome_table file
   Options:
      -b <int>: binsize (defalt: 100)
      -d <str>: directory of bigWig files (default: TotalReadNormalized/)
      -e <int>: Output value
         0: ChIP/Input enrichment
         1: -log10(P) (ChIP internal)
         2 (default): -log10(P) (ChIP/Input enrichment)
      -n: do not filter PCR duplicate
      -m: do not consider genome mappability
      -y <str>: postfix of .bw files to be used (default: "-raw-GR")
      -D <str>: directory for execution (defalt: "Churros_result")
      -x: Output as bigWig (defalt: bedGraph)
   Example:
      churros_genPvalwig samplelist.txt chip-seq hg38 genometable.hg38.txt

10.3.9. churros_classheat #

Churros provides a classheat function for clustering and visualizing large-scale epigenomic profiles. This function takes regions of interest (e.g., specific protein binding sites) as input 1 and a folder of epigenomic signal files (either binary or continuous) as input 2.

In the binary mode, classheat outputs a binary matrix (output 1) representing the overlap of epigenomic markers at given genomic regions. The binary matrix is then formatted and sorted by the user-defined column (i.e., the filename of the selected marker) to generate the processed matrix (output 2) and plot the sorted heatmap (output 3). Subsequently, classheat utilizes PCA followed by k-means clustering (or other clustering methods) to produce the clustered matrix (output 4) and the clustered heatmap (output 5).

In the continuous mode, classheat calculates the averaged read density of each epigenomic marker at given genomic regions (output 1). After logarithmic transformation, z-score normalization (optional method is 0-to-1 scaling), and sorting, classheat generates the remaining outputs in the same manner as in binary mode.

churros_classheat mode region directory [-k kcluster] [-s sortname] [-l samplelabel] [-n normalize type] [-m cluster method]

Example usage of binary mode:

churros_classheat -l samplelabel.tsv binary Rad21_ENCSR000BTQ_rep1_peaks.narrowPeak ./peakdir/

This command takes as input a file representing regions of interest (Rad21_ENCSR000BTQ_rep1_peaks.narrowPeak) and a directory (./peakdir/) containing multiple epigenomic signals. We also assigned labels to the files in the ./peakdir/ directory. Five output files are generated:

Output1_raw_matrix.tsv
Output2_sorted_matrix.tsv
Output3_sorted_heatmap.png
Output4_kmeans_matrix.tsv
Output5_kmeans_heatmap.png

Example usage of continuous mode:

churros_classheat -l samplelabel.tsv -s GATA3_ENCSR000EWV_rep1.bw -k 3 -n zscore continuous Rad21_ENCSR000BTQ_rep1_peaks.narrowPeak ./bwdir/

10.4. Commands internally used in churros_mapping #

10.4.1. bowtie.sh #

bowtie.sh is a script to use Bowtie. Because bowtie2 does not allow SOLiD colorspace data, use this script for it.

bowtie.sh [Options] <fastq> <prefix> <Ddir>
   <fastq>: fastq file
   <prefix>: output prefix
   <Ddir>: directory of bowtie index
   Options:
      -t STR: for SOLiD data ([fastq|csfata|csfastq], defalt: fastq)
      -c: output as CRAM format (defalt: BAM)
      -p INT: number of CPUs (default: 12)
      -P "STR": parameter of bowtie (shouled be quated, default: "-n2 -m1")
      -D: output dir (defalt: ./)
   Example:
      For single-end: bowtie.sh -P "-n2 -m1" chip.fastq.gz chip Referencedata_hg38
      For paired-end: bowtie.sh "\-1 chip_1.fastq.gz \-2 chip_2.fastq.gz" chip Referencedata_hg38
      For SOLiD data: bowtie.sh -t csfastq -P "-n2 -m1" chip.csfastq.gz chip Referencedata_hg38

10.4.2. bowtie2.sh #

bowtie2.sh is a script to use Bowtie2.

bowtie2.sh [Options] <fastq> <prefix> <Ddir>
   <fastq>: fastq file
   <prefix>: output prefix
   <Ddir>: directory of bowtie2 index
   Options:
      -c: output as CRAM format (defalt: BAM)
      -p: number of CPUs (default: 12)
      -P "bowtie2 param": parameter of bowtie2 (shouled be quated)
      -D: output dir (defalt: ./)
   Example:
      For single-end: bowtie2.sh -p "--very-sensitive" chip.fastq.gz chip Referencedata_hg38
      For paired-end: bowtie2.sh "\-1 chip_1.fastq.gz \-2 chip_2.fastq.gz" chip Referencedata_hg38

10.4.3. parse2wig+.sh #

parse2wig+.sh executes parse2wig+ to generate wig|bedGraph|bigWig files from map files with the read normalization. When -m option is supplied, parse2wig+.sh also normalizes the read based on the genome mappability (the read length can be specified using -k option).

parse2wig+.sh [options] <mapfile> <prefix> <build> <Ddir>
   <mapfile>: mapfile (SAM|BAM|CRAM|TAGALIGN format)
   <prefix>: output prefix
   <build>: genome build (e.g., hg38)
   <Ddir>: directory of bowtie2 index
   Options:
      -a: also outout raw read distribution
      -b: binsize of parse2wig+ (defalt: 100)
      -z: peak file for FRiP calculation (BED format)
      -l: predefined fragment length (default: estimated by trand-shift profile)
      -m: consider genome mappability
      -k: read length for mappability calculation ([28|36|50], default: 50)
      -p: for paired-end file
      -t: number of CPUs (default: 4)
      -n: do not filter PCR duplication
      -o: output directory (default: parse2wigdir+)
      -s: stats directory (default: log/parse2wig+)
      -f: output format of parse2wig+ (default: 3)
               0: compressed wig (.wig.gz)
               1: uncompressed wig (.wig)
               2: bedGraph (.bedGraph)
               3: bigWig (.bw)
      -D outputdir: output dir (defalt: ./)
      -F: overwrite files if exist (defalt: skip)
      -P: other options (should be quoted, see the help of parse2wig+ for the detail)
   Example:
      For single-end: parse2wig+.sh chip.sort.bam chip hg38 Referencedata_hg38
      For paired-end: parse2wig+.sh -p chip.sort.bam chip hg38 Referencedata_hg38

10.5. Commands internally used in churros_callpeak #

10.5.1. macs.sh #

macs.sh is a script to use MACS2.

macs.sh [Options] <IP bam> <Input bam> <prefix> <build> <mode>
   <IP bam>: BAM for for ChIP (treat) sample
   <Input bam>: BAM for for Input (control) sample: specify "none" if unavailable
   <prefix>: prefix of output file
   <build>: genome build (e.g., hg38)
   <mode>: peak mode ([sharp|broad|sharp-nomodel|broad-nomodel])
   Options:
      -f <int>: predefined fragment length (defalt: estimated in MACS2)
      -d <str>: output directory (defalt: "macs")
      -B: save extended fragment pileup, and local lambda tracks (two files) at every bp into a bedGraph file
      -F: overwrite files if exist (defalt: skip)

10.6. Commands internally used in churros_compare #

10.6.1. simpson_peak.sh #

simpson_peak.sh takes multiple peak lists (BED format) and output the correlation heatmap (.pdf) and scores (Simpson index). The one-by-one comparison results (overlapped peak list and Venn diagram) are also generated.

Note

If the number of peaks largely varies among samples, the results may become unfair. In such a case, use -n option to extract the same number of top-ranked peaks from the samples.

simpson_peak.sh [Options] <peakfile> <peakfile> ...
   <peakfile>: peak file (bed format)
   Options:
      -n <int>: extract top-<int> peaks for comparison (default: all peaks)
      -d <str>: output directory (default: "simpson_peak_results/")
      -v: Draw Venn diagrams for all pairs
      -p <int>: number of CPUs (default: 8)

10.7. Tools for Advanced Usage #

10.7.1. FRiR: Repeat Analysis #

Similar to the FRiP (fraction of reads in peaks) score of Landt et al. (2012), which calculates the fraction of mapped reads that fall within ChIP-seq peak regions, Churros calculates the FRiR (fraction of reads in repeats) score as the fraction of mapped reads that fall within repeat regions annotated by RemeatMasker.

Usage: FRiR [option] -r <repeatfile> -i <inputfile> -o <output> --gt <genome_table>

Example:
   FRiR -r Referencedata_hg38/RepeatMasker.txt.gz -o FRiRresult --gt Referencedata_hg38/genometable.txt -i Churros_result/hg38/bam/Sample.sort.bam --repeattype class

<repeatfile> is the RepeatMasker file downloaded with download_genomefa.sh. FRiR can allow a gzipped repeat file. The –repeattype option specifies the type of repeat classification of the output. The default is “class” (e.g., SINE, LINE, LTR, DNA, and others). The output is a text file with the FRiR score for each repeat type.

Note

Selecting --repeattype name results in a long computation time due to an extremely large number of classes.

10.7.2. rose: Super-enhancer analysis #

rose executes ROSE to identify super-enhancer sites from a BED file.

Input bam file is optional.

rose [Options] <IPbam> <Inputbam> <bed> <build>
   <IPbam>: BAM file for ChIP sample
   <Inputbam>: BAM file for Input sample (specify "none" when input is absent)
   <bed>: enhancer regions (BED format)
   <build>: genome build (hg18|hg19|hg38|mm8|mm9|mm10)
   Options:
      -d : maximum distance between two regions that will be stitched together (default: 12500)
      -e : exclude regions contained within +/- this distance from TSS in order to account for promoter biases (default: 0, recommended if used: 2500)

10.7.3. chromHMM.sh #

You can use chromHMM using chromHMM.sh <command>, e.g., chromHMM.sh LearnModel. See the ChromHMM website for the detail.

10.7.4. chromImpute.sh #

You can use chromImpute using chromImpute.sh <command>, e.g., chromImpute.sh Convert. See the chromImpute website for the detail.

10.8. Tools for DNA methylation analysis #

10.8.1. Bismark.sh: Bisulfite sequencing analysis #

Bismark.sh executes Bismark to handle Bisulfite sequencing data.

Bismark.sh command executes all steps of Bismark as follows:

bismark (mapping)

deduplicate_bismark

bismark_methylation_extractor

bismark2report

bismark2summary

Bismark.sh [Options] <index> <fastq>
   <index>: Bismark index directory
   <fastq>: Input fastq file
   Options:
      -d <str>: output directory (defalt: "Bismarkdir")
      -m <mode>: Bismark mode ([directional|non_directional|pbat|rrbs], default: directional)
      -p : number of CPUs (default: 4)

The results are output in Bismarkdir/. If you want to specify the name of output directory, use -d option.

10.9. Utility scripts in Churros #

10.9.1. gen_samplelist.sh: create samplelist.txt #

gen_samplelist.sh is a script that takes the FASTQ files directory as input and creates a draft samplelist.txt file. The first column is the sample name, and the second column is the path of the FASTQ file(s). Please modify the sample name as needed.

$ ls fastq
SRR227447.fastq.gz  SRR227552.fastq.gz  SRR227563.fastq.gz
SRR227575.fastq.gz  SRR227598.fastq.gz  SRR227639.fastq.gz
SRR227448.fastq.gz  SRR227553.fastq.gz  SRR227564.fastq.gz
SRR227576.fastq.gz  SRR227599.fastq.gz  SRR227640.fastq.gz

$ gen_samplelist.sh fastq > samplelist.txt
$ cat samplelist.txt
SRR227447      fastq/SRR227447.fastq.gz
SRR227448      fastq/SRR227448.fastq.gz
SRR227552      fastq/SRR227552.fastq.gz
SRR227553      fastq/SRR227553.fastq.gz
SRR227563      fastq/SRR227563.fastq.gz
SRR227564      fastq/SRR227564.fastq.gz
SRR227575      fastq/SRR227575.fastq.gz
SRR227576      fastq/SRR227576.fastq.gz
SRR227598      fastq/SRR227598.fastq.gz
SRR227599      fastq/SRR227599.fastq.gz
SRR227639      fastq/SRR227639.fastq.gz
SRR227640      fastq/SRR227640.fastq.gz

If you have paired-end FASTQ files, Supply -p option.

$ gen_samplelist.sh -p fastq > samplelist.txt

By default, gen_samplelist.sh assumes that the postfix of paired fastq files is “_1.fastq.gz” and “_2.fastq.gz”. If it is “_R1.fastq.gz” and “_R2.fastq.gz”, specify -r option.

$ gen_samplelist.sh -p -r fastq > samplelist.txt

10.9.2. generate_samplelist_from_SRA #

generate_samplelist_from_SRA is a script that takes SRA metadata files as input and creates a draft samplelist.txt file. This script is useful when you download the FASTQ files as well as the metadata files named SraExperimentPackage.xml and SraRunTable.txt from SRA.

generate_samplelist_from_SRA SraExperimentPackage.xml SraRunTable.txt samplelist.txt

10.9.3. gen_samplepairlist.sh: create samplepairlist.txt #

gen_samplepairlist.sh is a script that takes samplelist.txt as input and creats a draft samplepairlist.txt.

$ cat samplelist.txt
HepG2_H2A.Z     fastq/SRR227639.fastq.gz,fastq/SRR227640.fastq.gz
HepG2_H3K4me3   fastq/SRR227563.fastq.gz,fastq/SRR227564.fastq.gz
HepG2_H3K27ac   fastq/SRR227575.fastq.gz,fastq/SRR227576.fastq.gz
HepG2_H3K27me3  fastq/SRR227598.fastq.gz,fastq/SRR227599.fastq.gz
HepG2_H3K36me3  fastq/SRR227447.fastq.gz,fastq/SRR227448.fastq.gz
HepG2_Control   fastq/SRR227552.fastq.gz,fastq/SRR227553.fastq.gz

$ gen_samplepairlist.sh samplelist.txt > samplepairlist.txt
$ cat samplepairlist.txt
HepG2_H2A.Z,,HepG2_H2A.Z,sharp
HepG2_H3K4me3,,HepG2_H3K4me3,sharp
HepG2_H3K27ac,,HepG2_H3K27ac,sharp
HepG2_H3K27me3,,HepG2_H3K27me3,sharp
HepG2_H3K36me3,,HepG2_H3K36me3,sharp
HepG2_Control,,HepG2_Control,sharp

This script fills the first and third columns of samplepairlist.txt with the sample names in samplelist.txt. The second column (for the control samples) is left blank, so please fill it as needed. For example, if you want to use “HepG2_Control” as the control sample for “HepG2_H3K27ac”, fill the second column of the first row with “HepG2_Control”.

The fourth column is for the peak mode, which is “sharp” or “broad”. The default value is “sharp”, so please change it to “broad” if the target is broad peaks (e.g., H3K27me3 and H3K36me3).

# Example of filling the control sample name
$ cat samplepairlist.modified.txt
HepG2_H2A.Z,HepG2_Control,HepG2_H2A.Z,sharp
HepG2_H3K4me3,HepG2_Control,HepG2_H3K4me3,sharp
HepG2_H3K27ac,HepG2_Control,HepG2_H3K27ac,sharp
HepG2_H3K27me3,HepG2_Control,HepG2_H3K27me3,broad
HepG2_H3K36me3,HepG2_Control,HepG2_H3K36me3,broad

Specify -n option when you do not have matching input samples (outputs “none” in the second column).
Specify -b option to specify the peak mode “broad” for all samples (default: “sharp”).

10.9.4. checkQC.py: check the quality of the input ChIP-seq samples #

checkQC.py takes churros.QCstats.tsv and samplepairlist.txt and prints warnings if the samples do not meet the quality criteria.

checkQC.py churros.QCstats.tsv samplepairlist.txt
Example:
   checkQC.py Churros_result/hg38/churros.QCstats.tsv samplepairlist.txt

Unique mapping rate > 60%: If this rate is low, the reads in FASTQ files may be derived from repetitive regions, contamination with adapter sequences, or low-quality reads. Check the FASTQC result.
Nonredundant reads > 10,000,000: This number indicates the read depth. If the number is low, the number of detected peaks will be small, and the total read normalization for sample comparison will produce noisy results.
Read complexity > 0.8: This value reflects the amount of nonredundant reads in the sample. The low value indicates that the sample is overamplified by the PCR from a small amount of initial DNA, resulting in many false positive peaks.
Genome coverage > 0.6: The fraction of the reference genome covered by at least one mapped read. The low value indicates that the whole genome is not well sequenced and observed. Possible reasons are insufficient read depth and insufficient DNA fragmentation.
- The exception is RNA polII, which often causes low genome coverage due to its extremely high signal-to-noise ratio.
GC content < 60%: The GC content of the mapped reads. This value is typically ranges from 40% to 60%. The higher value indicates that the reads are derived from the GC-rich regions (i.e., open chromatin), possibly due to the bias of sonication and/or PCR amplification.
- However, it is noted that the appropriate value depends on the species and the target of the analysis. For example, RNA polII and H3K4me3 are enriched in the GC-rich regions and show high GC levels, but this is normal.
SSP-NSC > 3.0 (sharp), and > 1.5 (broad): SSP-NSC indicates the signal-to-noise ratio of the sample. The low value indicates that the sample is not enriched in the target regions, resulting in small number of peaks.
Background complexity > 0.8: This value reflects the uniformity of mapped reads in the background regions. The low value indicates that the whole genome is not well fragmented, resulting in many false positive peaks.
- This value decreases when the sample has high copy number regions in the genome, such as MCF-7 cells. In such cases, a value > 0.6 is considered acceptable.

See Nakato et al., Brief Bioinform. 2017 and Nakato et al., Bioinformatics 2018 for the detailed criteria.

10.9.5. Create mappability files #

If you want to create the mappability files yourself, you can use calculate_mappability_mosaics.sh.

calculate_mappability_mosaics.sh [Options] <Ddir>
   <Ddir>: directory of the genome
   Options:
      -f <int>: fragment length (default: 150)
      -b <array <int>>: binsizes (default: "10000 25000 50000 500000 1000000")
      -r <array <int>>: read length (default: "36 50")
      -p <int>: number of CPUs (default: 12)
   Example:
      calculate_mappability_mosaics.sh Ensembl-GRCh38

For example, if you want to make the mappability files for genome build hg38 with the read length 75 and 100, type:

calculate_mappability_mosaics.sh -r "75 100" UCSC-hg38

Then the data is created in UCSC-hg38/mappability_Mosaics_75mer and UCSC-hg38/mappability_Mosaics_100mer.

Note

This command takes long time for computation. Set large number for -p (e.g., 64).

Commands in Churros

Contents