Quick start (ChIP-seq, human)
===================================

This page describes a tutorial on how to obtain results from FASTQ files using **Churros**.
There are only two pieces of data that need to be prepared: FASTQ files and sample metadata.

While this tutorial uses human data, the sample scripts for human, mouse and `S. cerevisiae` are also available at `Churros Tutorial on GitHub <https://github.com/rnakato/Churros/tree/main/tutorial>`_.

.. contents::
   :depth: 3

Churros using Docker or Apptainer
---------------------------------------------

Since Churros is provided as a Docker image, you need to use ``docker`` or ``apptainer`` commands. We recommended to use Apptainer.

The command below is a general example of how to run Churros using Docker or Apptainer.
It will mount the ``/work`` directory of the host machine to the ``/work`` directory of the container.

.. code-block:: bash

    # For docker
    $ docker run --rm -it -v /work:/work rnakato/churros <command>
    ## Example
    $ docker run --rm -it -v /work:/work rnakato/churros \
        churros -p $ncore --mpbl samplelist.txt samplepairlist.txt $build $Ddir

    # For Apptainer
    $ apptainer exec --bind /work churros.sif <command>
    ## Example
    $ apptainer exec --bind /work churros.sif \
        churros -p $ncore --mpbl samplelist.txt samplepairlist.txt $build $Ddir

See also the sample scripts in the `Churros Tutorial <https://github.com/rnakato/Churros/tree/main/tutorial>`_ on GitHub.

Get data
------------------------

Here we use five histone modification data of HepG2 cells from `ENCODE Project <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE29611>`_.

.. code-block:: bash

    mkdir -p fastq
    for id in SRR227447 SRR227448 SRR227552 SRR227553 SRR227563 SRR227564 SRR227575 SRR227576 SRR227598 SRR227599 SRR227639 SRR227640
    do
        fastq-dump --gzip $id -O fastq
    done

| Then download and generate the reference dataset including genome, gene annotation and index files. **Churros** contains scripts for that: ``download_genomedata.sh`` and ``build-index.sh``.
| Here we specify ``hg38`` for genome build. See `Reference Data Preparation <https://churros.readthedocs.io/en/latest/content/Commands.html#reference-data-preparation>`_ for the detail of genome build.

.. code-block:: bash

    mkdir -p log
    build=hg38      # genome build
    Ddir=Referencedata_$build   # output directory
    ncore=12    # number of CPUs

    # download the genome
    download_genomedata.sh $build $Ddir 2>&1 | tee log/$Ddir

    # make Bowtie2 index
    build-index.sh -p $ncore bowtie2 $Ddir

Prepare metadata files
-------------------------------------

Churros takes two input files, ``samplelist.txt`` and ``samplepairlist.txt`` that describes the detail of samples to be analyzed.

samplelist.txt
++++++++++++++++++++++++++

``samplelist.txt`` is a **tab-delimited** file (TSV) that describes the sample labels and the path to the corresponding fastq files.
Multiple fastq files can be specified by separateing with ``,``.

.. code-block:: bash

    HepG2_H2A.Z     fastq/SRR227639.fastq.gz,fastq/SRR227640.fastq.gz
    HepG2_H3K4me3   fastq/SRR227563.fastq.gz,fastq/SRR227564.fastq.gz
    HepG2_H3K27ac   fastq/SRR227575.fastq.gz,fastq/SRR227576.fastq.gz
    HepG2_H3K27me3  fastq/SRR227598.fastq.gz,fastq/SRR227599.fastq.gz
    HepG2_H3K36me3  fastq/SRR227447.fastq.gz,fastq/SRR227448.fastq.gz
    HepG2_Control   fastq/SRR227552.fastq.gz,fastq/SRR227553.fastq.gz


.. note::

   Starting with ``version 1.3.0``, **Churros** can allow BAM files as input in samplelist.txt instead of FASTQ files. 
   See `Prepare sample list <https://churros.readthedocs.io/en/latest/content/StepbyStep.html#prepare-sample-list>`_


samplepairlist.txt
++++++++++++++++++++++++++

``samplepairlist.txt`` is a **comma-delimited** file (CSV) that describes the ChIP/Input pairs as follows:

- ChIP-sample label
- Input-sample label
- prefix
- peak mode

ChIP and input sample labels should be identical to those in ``samplelist.txt``.
Input samples can be omitted if unavailable.
``prefix`` is used for the output files.
``peak mode`` is either ``[sharp|broad|sharp-nomodel|broad-nomodel]``. This parameter is used for peak calling by `MACS2 <https://github.com/macs3-project/MACS>`_.

.. code-block:: bash

    HepG2_H2A.Z,HepG2_Control,HepG2_H2A.Z,sharp
    HepG2_H3K4me3,HepG2_Control,HepG2_H3K4me3,sharp
    HepG2_H3K27ac,HepG2_Control,HepG2_H3K27ac,sharp
    HepG2_H3K27me3,HepG2_Control,HepG2_H3K27me3,broad
    HepG2_H3K36me3,HepG2_Control,HepG2_H3K36me3,broad

Running Churros
------------------------------------------------

``churros`` command executes all steps from mapping reads to visualization.

.. code-block:: bash

    churros -p 12 samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

``-p 12`` specifies the number of CPUs. ``hg38`` is the UCSC genome build and ``Referencedata_hg38`` is the directory generated by ``download_genomedata.sh`` and ``build-index.sh``.

The results are output in ``Churros_result/hg38/``. 


- Output
    - fastp/: Quality check results of FASTQ reads from fastp
    - fastqc/: Quality check results of FASTQ reads from fastqc
    - bam/    ... map files (sorted BAM format by default) and BAM index files
    - bigWig/ ... bigWig files (100 bp, 5 kbp and 100 kbp bins by default) with raw count (``RawCount``) and total read normalization (``TotalReadNormalized``)
    - multiqc_report.html, multiqc_data/ ... The quality check summary generated by `MultiQC <https://multiqc.info/>`_
    - pdf/ ... The pdf files and corresponding peak lists
    - sspout/ ... output of `SSP <https://github.com/rnakato/SSP>`_ (strand-shift profile) for the detailed quality check
    - macs/ ... peak files called by MACS2. The log files are stored in \*log. ``samplepairlist.txt`` in ``macs/`` directory includes the filename of the peak files that is used in the ``churros_visualize`` command.
    - stats/ ... stats files generated by parse2wig+ and SSP
    - churros.QCstats.tsv ... The stats summary for all samples
    - churros.samplepairlist.withflen.txt ... The sample pair list with addition of fragment length estimated by SSP
    - log/ ... log files
    - tmp/ ... temp files

For the quality check, you can use ``churros.QCstats.tsv`` and ``multiqc_report.html``. 

You can visually check the read distribution and the peaks obtained with the pdf files.

You can start the deep analysis using the generated BAM and bigWig files.

.. note::

  The ``fastp`` and ``fastqc`` directories are created under the ``Churros_result/`` directory because they are independent of the genome build.

Change the name of the output directory
++++++++++++++++++++++++++++++++++++++++++++++

If you want to specify the name of the output directory, use ``-D`` option.

.. code-block:: bash

    churros -p 12 -D outputdir samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

Implement sample comparison
++++++++++++++++++++++++++++++++++++++++++

By supplying ``--comparative`` option, ``churros`` executes ``churros_compare`` to implement all-by-all sample comparisons and make correlation heatmaps (see :doc:`StepbyStep` for detail).

.. code-block:: bash

    churros -p 12 --comparative samplelist.txt samplepairlist.txt hg38 Referencedata_hg38

Generate the ChIP/Input p-value
+++++++++++++++++++++++++++++++++++++++++++++++

``--outputpvalue`` option outputs the bedGraph file for -log10(p-value) of ChIP/Input enrichment.

.. code-block:: bash

    churros -p 12 --outputpvalue samplelist.txt samplepairlist.txt hg38 Referencedata_hg38


Omit consideration of genome mappability
+++++++++++++++++++++++++++++++++++++++++++++++++

Churros consider genome mappability in default. 
The mappability affects the quality check results and the read-distribution normalization in DROMPA+ but does not affect peak calling by MACS2. 
If you want not to consider it, supply ``--nompbl`` option.

.. code-block:: bash

    churros -p 12 --nompbl samplelist.txt samplepairlist.txt hg38 Referencedata_hg38


Use Adapter-Trimmed Reads for Mapping
+++++++++++++++++++++++++++++++++++++++++++++++++

Churros maps raw reads by default to save on time and storage.
However, if the mapping ratio is quite low, it is recommended to opt for mapping with adapter-trimmed reads by using the ``--fastqtrimming`` option.

When this option is selected, Churros will run ``fastp`` to remove the adapter sequences from the reads. The resulting trimmed reads are then stored in the ``fastp/`` directory.
Additionally, a modified version of ``samplelist.txt``, named ``samplelist.trimed.txt``, is saved in the ``Churros_result/`` directory. This is used for the mapping process with ``churros_mapping``.

.. code-block:: bash

    churros -p 12 --fastqtrimming samplelist.txt samplepairlist.txt hg38 Referencedata_hg38


Quality check of the input samples
------------------------------------------

Quality check (QC) is an important step in verifying the reliability of the results obtained.
From verion ``0.11.0``. **Churros** provides a script ``checkQC.py`` to check the quality of all input samples.
The warning result is written to ``Churros_result/hg38/QCcheck.log``. If there is no warning, the file is empty.

.. code-block:: bash

    cat Churros_result/hg38/QCcheck.log

Even if the ``Churros_result`` is generated by the previous versions, you can use ``checkQC.py`` as follows.

.. code-block:: bash

    build=hg38
    checkQC.py Churros_result/$build/churros.QCstats.tsv samplepairlist.txt

See the `checkQC.py: check the quality of the input ChIP-seq samples <https://churros.readthedocs.io/en/latest/content/Commands.html#checkqc-py-check-the-quality-of-the-input-chip-seq-samples>`_ page for the detailed criteria.