Quick start (ChIP-seq, human)
===================================

This page describes a tutorial on how to obtain results from FASTQ files using **Churros**.
There are only two pieces of data that need to be prepared: FASTQ files and sample metadata.

While this tutorial uses human data, the sample scripts for human, mouse and `S. cerevisiae` are also available at `Churros Tutorial on GitHub <https://github.com/rnakato/Churros/tree/main/tutorial>`_.

.. contents::
   :depth: 3

Churros using Docker or Apptainer
---------------------------------------------

Since Churros is provided as a Docker image, you need to use ``docker`` or ``apptainer`` commands. We recommended to use Apptainer.

The command below is a general example of how to run Churros using Docker or Apptainer.
It will mount the ``/work`` directory of the host machine to the ``/work`` directory of the container.

.. code-block:: bash

    # For docker
    $ docker run --rm -it -v /work:/work rnakato/churros <command>
    ## Example
    $ docker run --rm -it -v /work:/work rnakato/churros \
        churros -p $ncore --mpbl samplelist.txt samplepairlist.txt $build $Ddir

    # For Apptainer
    $ apptainer exec --bind /work churros.sif <command>
    ## Example
    $ apptainer exec --bind /work churros.sif \
        churros -p $ncore --mpbl samplelist.txt samplepairlist.txt $build $Ddir

See also the sample scripts in the `Churros Tutorial <https://github.com/rnakato/Churros/tree/main/tutorial>`_ on GitHub.

Get data
------------------------

Here we use five histone modification data of HepG2 cells from `ENCODE Project <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29611>`_.

.. code-block:: bash

    mkdir -p fastq
    for id in SRR227447 SRR227448 SRR227552 SRR227553 SRR227563 SRR227564 SRR227575 SRR227576 SRR227598 SRR227599 SRR227639 SRR227640
    do
        fastq-dump --gzip $id -O fastq
    done

| Then download and generate the reference dataset including genome, gene annotation and index files. **Churros** contains scripts for that: ``download_genomedata.sh`` and ``build-index.sh``.
| Here we specify ``hg38`` for genome build. See `Reference Data Preparation <https://churros.readthedocs.io/en/latest/content/Commands.html#reference-data-preparation>`_ for the detail of genome build.

.. code-block:: bash

    mkdir -p log
    build=hg38      # genome build
    Ddir=Referencedata_$build   # output directory
    ncore=12    # number of CPUs

    # download the genome
    download_genomedata.sh -s $build $Ddir 2>&1 | tee log/$Ddir

    # make Bowtie2 index
    build-index.sh -p $ncore bowtie2 $Ddir

Prepare metadata files
-------------------------------------

Churros takes two input files, ``samplelist.txt`` and ``samplepairlist.txt`` that describes the detail of samples to be analyzed.

samplelist.txt
++++++++++++++++++++++++++

``samplelist.txt`` is a **tab-delimited** file (TSV) that describes the sample labels and the path to the corresponding fastq files.
Multiple fastq files can be specified by separateing with ``,``.

.. code-block:: bash

    HepG2_H2A.Z     fastq/SRR227639.fastq.gz,fastq/SRR227640.fastq.gz
    HepG2_H3K4me3   fastq/SRR227563.fastq.gz,fastq/SRR227564.fastq.gz
    HepG2_H3K27ac   fastq/SRR227575.fastq.gz,fastq/SRR227576.fastq.gz
    HepG2_H3K27me3  fastq/SRR227598.fastq.gz,fastq/SRR227599.fastq.gz
    HepG2_H3K36me3  fastq/SRR227447.fastq.gz,fastq/SRR227448.fastq.gz
    HepG2_Control   fastq/SRR227552.fastq.gz,fastq/SRR227553.fastq.gz


.. note::

   Starting with ``version 1.3.0``, **Churros** can allow BAM files as input in samplelist.txt instead of FASTQ files. 
   See `Prepare sample list <https://churros.readthedocs.io/en/latest/content/StepbyStep.html#prepare-sample-list>`_


samplepairlist.txt
++++++++++++++++++++++++++

``samplepairlist.txt`` is a **comma-delimited** file (CSV) that describes the ChIP/Input pairs as follows:

- ChIP-sample label
- Input-sample label
- prefix
- peak mode

ChIP and input sample labels should be identical to those in ``samplelist.txt``.
Input samples can be omitted if unavailable.
``prefix`` is used for the output files.
``peak mode`` is either ``[sharp|broad|sharp-nomodel|broad-nomodel]``. This parameter is used for peak calling by `MACS3 <https://macs3-project.github.io/MACS/>`_.

.. code-block:: bash

    HepG2_H2A.Z,HepG2_Control,HepG2_H2A.Z,sharp
    HepG2_H3K4me3,HepG2_Control,HepG2_H3K4me3,sharp
    HepG2_H3K27ac,HepG2_Control,HepG2_H3K27ac,sharp
    HepG2_H3K27me3,HepG2_Control,HepG2_H3K27me3,broad
    HepG2_H3K36me3,HepG2_Control,HepG2_H3K36me3,broad

Running Churros
------------------------------------------------

``churros`` command executes all steps from mapping reads to visualization.

.. code-block:: bash

    churros -p 12 samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

``-p 12`` specifies the number of CPUs. ``hg38`` is the UCSC genome build and ``Referencedata_hg38`` is the directory generated by ``download_genomedata.sh`` and ``build-index.sh``.

The results are output in ``Churros_result/hg38/``. 

- Output
    - fastp/: Quality check results of FASTQ reads from fastp
    - fastqc/: Quality check results of FASTQ reads from fastqc
    - bam/    ... map files (sorted BAM format by default) and BAM index files
    - bigWig/ ... bigWig files (100 bp, 5 kbp and 100 kbp bins by default) with raw count (``RawCount``) and total read normalization (``TotalReadNormalized``)
    - multiqc_report.html, multiqc_data/ ... The quality check summary generated by `MultiQC <https://multiqc.info/>`_
    - pdf/ ... The pdf files and corresponding peak lists
    - sspout/ ... output of `SSP <https://github.com/rnakato/SSP>`_ (strand-shift profile) for the detailed quality check
    - macs/ ... peak files called by MACS3. The log files are stored in \*log. ``samplepairlist.txt`` in ``macs/`` directory includes the filename of the peak files that is used in the ``churros_visualize`` command.
    - stats/ ... stats files generated by parse2wig+ and SSP
    - churros.QCstats.tsv ... The stats summary for all samples
    - churros.samplepairlist.withflen.txt ... The sample pair list with addition of fragment length estimated by SSP
    - log/ ... log files
    - tmp/ ... temp files

For the quality check, you can use ``churros.QCstats.tsv`` and ``multiqc_report.html``. 

You can visually check the read distribution and the peaks obtained with the pdf files.

You can start the deep analysis using the generated BAM and bigWig files.

.. note::

  The ``fastp`` and ``fastqc`` directories are created under the ``Churros_result/`` directory because they are independent of the genome build.

Change the name of the output directory
++++++++++++++++++++++++++++++++++++++++++++++

If you want to specify the name of the output directory, use ``-D`` option.

.. code-block:: bash

    churros -p 12 -D outputdir samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

Implement sample comparison
++++++++++++++++++++++++++++++++++++++++++

By supplying ``--comparative`` option, ``churros`` executes ``churros_compare`` to implement all-by-all sample comparisons and make correlation heatmaps (see :doc:`StepbyStep` for detail).

.. code-block:: bash

    churros -p 12 --comparative samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

Generate the ChIP/Input p-value
+++++++++++++++++++++++++++++++++++++++++++++++

``--outputpvalue`` option outputs the bedGraph file for -log10(p-value) of ChIP/Input enrichment with binomial test.

.. code-block:: bash

    churros -p 12 --outputpvalue samplelist.txt samplepairlist.txt hg38 Referencedata_hg38


Omit consideration of genome mappability
+++++++++++++++++++++++++++++++++++++++++++++++++

Churros consider genome mappability in default. 
The mappability affects the quality check results and the read-distribution normalization in DROMPAplus but does not affect peak calling by MACS3. 
If you want not to consider it, supply ``--nompbl`` option.

.. code-block:: bash

    churros -p 12 --nompbl samplelist.txt samplepairlist.txt hg38 Referencedata_hg38


Use Adapter-Trimmed Reads for Mapping
+++++++++++++++++++++++++++++++++++++++++++++++++

Churros maps raw reads by default to save on time and storage.
However, if the mapping ratio is quite low, in case the read length is long, or the expected fragment length is short (e.g, `CUT&Tag data <https://www.protocols.io/view/cut-amp-tag-data-processing-and-analysis-tutorial-bjk2kkye.html>`), it is recommended to opt for mapping with adapter-trimmed reads by using the ``--fastqtrimming`` option.

When this option is selected, Churros will run ``fastp`` to remove the adapter sequences from the reads. The resulting trimmed reads are then stored in the ``fastp/`` directory.
Additionally, a modified version of ``samplelist.txt``, named ``samplelist.trimed.txt``, is saved in the ``Churros_result/`` directory. This is used for the mapping process with ``churros_mapping``.

.. code-block:: bash

    churros -p 12 --fastqtrimming samplelist.txt samplepairlist.txt hg38 Referencedata_hg38


Quality check of the input samples
------------------------------------------

Quality check (QC) is an important step in verifying the reliability of the results obtained.
**Churros** provides a script ``checkQC.py`` to check the quality of all input samples.
The warning result is written to ``Churros_result/hg38/QCcheck.log``. If there is no warning, the file is empty.

.. code-block:: bash

    cat Churros_result/hg38/QCcheck.log

Even if the ``Churros_result`` is generated by the previous versions, you can use ``checkQC.py`` as follows.

.. code-block:: bash

    checkQC.py Churros_result/hg38/churros.QCstats.tsv samplepairlist.txt

See the `checkQC.py: check the quality of the input ChIP-seq samples <https://churros.readthedocs.io/en/latest/content/Commands.html#checkqc-py-check-the-quality-of-the-input-chip-seq-samples>`_ page for the detailed criteria.