1. Prepare FASTQ files

Usually, the FASTQ files are delivered to you from the sequencing core. In this tutorial, we will use the H3K79me2 ChIP-seq data published by Orlando et al (Cell Reports, 2014). In this study, Drosophila S2 cells were added to human Jurkat cells at the ratio of 1:2.

Sample (ChIP)	SRR_accession	Sample (WCE)	SRR_accession
Jurkat_K79_0%_R1	SRR1536557	Jurkat_WCE_0%_R1	SRR1584489
Jurkat_K79_25%_R1	SRR1536558	Jurkat_WCE_25%_R1	SRR1584490
Jurkat_K79_50%_R1	SRR1536559	Jurkat_WCE_50%_R1	SRR1584491
Jurkat_K79_75%_R1	SRR1536560	Jurkat_WCE_75%_R1	SRR1584492
Jurkat_K79_100%_R1	SRR1536561	Jurkat_WCE_100%_R1	SRR1584493

After download the SRA files to your hard drive, you will need the fastq-dump command from the SRA Tookit to convert the SRA archive files into FASTQ format. We used this command to convert single-end sequencing data:

$ fastq-dump SRA_file1 SRA_file2 SRA_file3

Note

Alternatively, you can download FASTQ files directly from the ENA (Enropean Nucleotide Archive) by searching SRR accession numbers listed in the above table.

2. Build bowtie2 index files

To calculate how many reads in the FASTQ files are drived from the Drosophila S2 cells, we could map all reads to the composite reference genome (i.e., human + Drosophila). In this tutorial, we will use bowtie2, other short reads aligners such as BWA also work fine. To use bowtie2 to map reads to the composite reference genome, we need to create bowtie2 index files first.

Download the human and Drosophila reference genome sequences:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
$ gunzip hg38.fa.gz
$ gunzip dm6.fa.gz

Rename chromosome IDs of the Drosophila genome:

$ sed 's/chr/dm6_chr/g' dm6.fa  > dm6_renameID.fa

Concatenate the human and Drosophila genome files:

$ cat hg38.fa  dm6_renameID.fa > hg38_dm6.fa

Run bowtie2-build to build index files. “hg38_dm6” is the prefix of index files:

$ bowtie2-build  hg38_dm6.fa  hg38_dm6

Clean up:

$ rm hg38.fa dm6.fa dm6_renameID.fa

Note

We have Bowtie2 index files files for hg38 + dm6, hg19 + dm6, mm9 + dm6, mm10 + dm6, mm39 + dm6 available for download.

3. Map reads to the composite reference genome

Use bowtie2 to map all reads to the composite reference genome (i.e., hg38_dm6.fa). Sort and index the BAM files using Samtools. Below is an example to process one sample:

bowtie2 --threads 4 -x ./database/GRCh38_dm6  -U Jurkat_K79_00p_Rep1_SRR1536557.fastq.gz  | samtools view -Sbh - > Jurkat_K79_00p_Rep1_SRR1536557.bam
samtools sort -@ 4 Jurkat_K79_00p_Rep1_SRR1536557.bam > Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam
samtools index -@ 4 Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam

Note

We have pre-aligned BAM files available from Test dataset for download.

4. Calculate scaling factors

To remove experimental bias and accurately quantify ChIP signal changes, the amount of exogenous Drosophila chromatin in different samples needs to be normalized to the same level. After sequencing, this is equivalent to normalize Drosophila reads to the same amount.

We provide split_bam.py to split the BAM file into 4 sub-BAM files.

_both.bam : contains reads mapped to both human and Drosophila genomes

_exogenous.bam : contains reads only map to the Drosophila genomes

_human.bam : contains reads only map to the human genomes

_neither.bam : contains qcfailed, low quality, unmapped, dupilcate reads etc.

A report file describing the number of alignments in each sub-BAM is also generated.

For example:

$ split_bam.py --threads 8  -i Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam  -o Jurkat_K79_00p_Rep1
Read the BAM file: Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam
Done
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks..

$ cat Jurkat_K79_00p_Rep1.report.txt

Sample n_unmapped      n_qcFail        n_duplicate     n_secondary     n_low_mapq      n_both  n_sample        n_exogenous
Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam      719553  0       0       0       5355549 0       20741684        5207395

Mapping statistics and the derived scaling factors (SF) are summerized in the table below (the number of Drosophila reads in each sample was normalized to 1 million).

Sample	Unmapped reads	Low_mapq_reads	Human_reads	Fly_reads	SF
Jurkat_K79_00p_Rep1_SRR1536557	719,553	5,355,549	20,741,684	5,207,395	0.1920346
Jurkat_K79_25p_Rep1_SRR1536558	2,177,315	5,367,333	19,282,608	6,035,961	0.1656737
Jurkat_K79_50p_Rep1_SRR1536559	2,919,492	5,233,017	17,235,944	6,931,602	0.14426679
Jurkat_K79_75p_Rep1_SRR1536560	1,627,494	4,300,454	11,731,507	8,291,870	0.12060006
Jurkat_K79_100p_Rep1_SRR1536561	4,526,721	5,109,981	7,893,377	14,372,323	0.06957817
Jurkat_WCE_00p_Rep1.SRR1584489	2,344,210	3,914,206	11,126,768	844,395	1.18427987
Jurkat_WCE_25p_Rep1.SRR1584490	9,982,131	1,056,402	2,771,236	164,052	6.09562822
Jurkat_WCE_50p_Rep1.SRR1584491	4,398,297	1,750,494	4,431,034	267,211	3.74236091
Jurkat_WCE_75p_Rep1.SRR1584492	7,317,599	1,620,854	4,029,096	179,641	5.56665795
Jurkat_WCE_100p_Rep1.SRR1584493	7,609,117	2,336,349	5,197,374	344,665	2.901368

Note

You can also use K-mer based, alignment-free methods to calculate scaling factor. For example, you can use xenome to separate human and fly reads from your FASTQ files.

5. Run Spiker

Spiker is specifically designed to analyze ChIP-seq data with spike-in control. The input files can be single-end or paired-end FASTQ files or pre-aligned BAM files. It supports both narrow peak (such as H3K27ac) and broad peak analysis (such as H3K79me2 in this tutorial).

In below, we used “Jurkat_K79_00p_Rep1_SRR1536557_human.sorted.bam” (ChIP) and “Jurkat_WCE_00p_Rep1.SRR1584489_human.sorted.bam” (control) as inputs to Spiker:

$ spiker.py --broad --spikeIn --csf 1.18 --tsf 0.19 --bw -t Jurkat_K79_00p_Rep1_SRR1536557_human.sorted.bam -c Jurkat_WCE_00p_Rep1.SRR1584489_human.sorted.bam  -o Jurkat_K79_00p_Rep1_SRR1536557

2021-02-18 12:56:20 [INFO]  Running ChIP-seq workflow ...
2021-02-18 12:56:20 [INFO]  Input BAM files. Skip Step_1 (reads mapping).
2021-02-18 12:56:20 [INFO]  Step_2: Extract information from BAM files ...
2021-02-18 12:56:20 [INFO]  Step_2.1: Check sequencing layout from BAM file ...
2021-02-18 12:56:22 [INFO]     The layout of Jurkat_K79_00p_Rep1_SRR1536557_human.sorted.bam is SE
2021-02-18 12:56:23 [INFO]     The layout of Jurkat_WCE_00p_Rep1.SRR1584489_human.sorted.bam is SE
2021-02-18 12:56:23 [INFO]  Step_2.2: Extract information from ChIP BAM files ...
2021-02-18 12:56:25 [INFO]     The total mapped reads in ChIP sample is: 20741684
2021-02-18 12:56:25 [INFO]  Step_2.3: Extract information from control BAM files ...
2021-02-18 12:56:26 [INFO]     The total mapped reads in control sample is: 11126768
2021-02-18 12:56:26 [INFO]  Step_3: Deduplication ...
2021-02-18 12:56:26 [INFO]  Step_3.1: Remove duplidate reads from ChIP BAM file ...
...

Output files

*.gappedPeak: peak file (in gappedPeak) called by Spiker.
*.control.pileup.max.bdg: bedGraph file of control pileup
*.control.pileup.max.bigWig: bigWig file of control pileup
*.treat.pileup.bdg: bedGraph file of ChIP sample pileup (raw)
*.treat.pileup.SpikeIn_scaled.bdg: bedGraph file of ChIP sample pileup (Scaled to spike-in control)
*.treat.pileup.SpikeIn_scaled.bigWig: bigWig file of ChIP sample pileup (Scaled to spike-in control)