2. Quick start (ChIP-seq, human)#

This page describes the tutorial on how to get the results from FASTQ files by Churros. There are only two pieces of data you need to prepare: FASTQ files and sample metadata.

While this tutorial uses human data, the sample scripts for human, mouse and S. cerevisiae are also available at Chrros GitHub site.

Note

This tutorial assumes using the Churros singularity image (churros.sif). Please add singularity exec churros.sif before each command below.
Example: singularity exec churros.sif download_genomedata.sh

2.1. Get data#

Here we use five histone modification data of HepG2 cells from ENCODE Project.

mkdir -p fastq
for id in SRR227447 SRR227448 SRR227552 SRR227553 SRR227563 SRR227564 SRR227575 SRR227576 SRR227598 SRR227599 SRR227639 SRR227640
do
    fastq-dump --gzip $id -O fastq
done
Then download and generate the reference dataset including genome, gene annotation and index files. Chuross contains scripts for that: download_genomedata.sh and build-index.sh.
Here we specify hg38 for genome build. See Appendix for the detail of genome build.
mkdir -p log
build=hg38      # genome build
Ddir=Referencedata_$build   # output directory
ncore=12    # number of CPUs

# download the genome
download_genomedata.sh $build $Ddir 2>&1 | tee log/$Ddir

# make Bowtie2 index
build-index.sh -p $ncore bowtie2 $Ddir

2.2. Prepare metadata files#

Churros takes two input files, samplelist.txt and samplepairlist.txt that describes the detail of samples to be analyzed.

2.2.1. samplelist.txt#

samplelist.txt is a tab-delimited file (TSV) that describes the sample labels and the path to the corresponding fastq files. Multiple fastq files can be specified by separateing with ,.

HepG2_H2A.Z     fastq/SRR227639.fastq.gz,fastq/SRR227640.fastq.gz
HepG2_H3K4me3   fastq/SRR227563.fastq.gz,fastq/SRR227564.fastq.gz
HepG2_H3K27ac   fastq/SRR227575.fastq.gz,fastq/SRR227576.fastq.gz
HepG2_H3K27me3  fastq/SRR227598.fastq.gz,fastq/SRR227599.fastq.gz
HepG2_H3K36me3  fastq/SRR227447.fastq.gz,fastq/SRR227448.fastq.gz
HepG2_Control   fastq/SRR227552.fastq.gz,fastq/SRR227553.fastq.gz

2.2.2. samplepairlist.txt#

samplepairlist.txt is a comma-delimited file (CSV) that describes the ChIP/Input pairs as follows:

  • ChIP-sample label

  • Input-sample label

  • prefix

  • peak mode

ChIP and input sample labels should be identical to those in samplelist.txt. Input samples can be omitted if unavailable. prefix is used for the output files. peak mode is either [sharp|broad|sharp-nomodel|broad-nomodel]. This parameter is used for peak calling by MACS2.

HepG2_H2A.Z,HepG2_Control,HepG2_H2A.Z,sharp
HepG2_H3K4me3,HepG2_Control,HepG2_H3K4me3,sharp
HepG2_H3K27ac,HepG2_Control,HepG2_H3K27ac,sharp
HepG2_H3K27me3,HepG2_Control,HepG2_H3K27me3,broad
HepG2_H3K36me3,HepG2_Control,HepG2_H3K36me3,broad

2.3. Running Churros#

churros command executes all steps from mapping reads to visualization.

churros -p 12 samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

-p 12 specifies the number of CPUs. hg38 is the UCSC genome build and Referencedata_hg38 is the directory generated by download_genomedata.sh and build-index.sh.

The results are output in Churros_result/hg38/.

  • Output
    • fastp/: Quality check results of FASTQ reads from fastp

    • fastqc/: Quality check results of FASTQ reads from fastqc

    • bam/ … map files (sorted BAM format by default) and BAM index files

    • bigWig/ … bigWig files (100 bp, 5 kbp and 100 kbp bins by default) with raw count (RawCount) and total read normalization (TotalReadNormalized)

    • multiqc_report.html, multiqc_data/ … The quality check summary generated by MultiQC

    • pdf/ … The pdf files and corresponding peak lists

    • sspout/ … output of SSP (strand-shift profile) for the detailed quality check

    • macs/ … peak files called by MACS2. The log files are stored in *log. samplepairlist.txt in macs/ directory includes the filename of the peak files that is used in the churros_visualize command.

    • stats/ … stats files generated by parse2wig+ and SSP

    • churros.QCstats.tsv … The stats summary for all samples

    • churros.samplepairlist.withflen.txt … The sample pair list with addition of fragment length estimated by SSP

    • log/ … log files

    • tmp/ … temp files

For the quality check, you can use churros.QCstats.tsv and multiqc_report.html.

You can visually check the read distribution and the peaks obtained with the pdf files.

You can start the deep analysis using the generated BAM and bigWig files.

Note

The fastp and fastqc directories are created under the Churros_result/ directory because they are independent of the genome build.

2.3.1. Change the name of the output directory#

If you want to specify the name of the output directory, use -D option.

churros -p 12 -D outputdir samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

2.3.2. Implement sample comparison#

By supplying --comparative option, churros executes churros_compare to implement all-by-all sample comparisons and make correlation heatmaps (see Step-by-step analysis (ChIP-seq, human) for detail).

churros -p 12 --comparative samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

2.3.3. Generate the ChIP/Input p-value#

--outputpvalue option outputs the bedGraph file for -log10(p-value) of ChIP/Input enrichment.

churros -p 12 --outputpvalue samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

2.3.4. Omit consideration of genome mappability#

Churros consider genome mappability in default. The mappability affects the quality check results and the read-distribution normalization in DROMPA+ but does not affect peak calling by MACS2. If you want not to consider it, supply --nompbl option.

churros -p 12 --nompbl samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

2.3.5. Use Adapter-Trimmed Reads for Mapping#

Churros maps raw reads by default to save on time and storage. However, if the mapping ratio is quite low, it is recommended to opt for mapping with adapter-trimmed reads by using the --fastqtrimming option.

When this option is selected, Churros will run fastp to remove the adapter sequences from the reads. The resulting trimmed reads are then stored in the fastp/ directory. Additionally, a modified version of samplelist.txt, named samplelist.trimed.txt, is saved in the Churros_result/ directory. This is used for the mapping process with churros_mapping.

churros -p 12 --fastqtrimming samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

2.4. Quality check of the input samples#

Quality check (QC) is an important step in verifying the reliability of the results obtained. From verion 0.11.0. Churros provides a script checkQC.py to check the quality of all input samples. The warning result is written to Churros_result/hg38/QCcheck.log. If there is no warning, the file is empty.

cat Churros_result/hg38/QCcheck.log

Even if the Churros_result is generated by the previous versions, you can use checkQC.py as follows.

build=hg38
checkQC.py Churros_result/$build/churros.QCstats.tsv samplepairlist.txt

See the checkQC.py: check the quality of the input ChIP-seq samples page for the detailed criteria.