1. Prepare FASTQ files

Usually, the FASTQ files are delivered to you from the sequencing core. In this tutorial, we will use the H3K79me2 ChIP-seq data published by Orlando et al (Cell Reports, 2014). In this study, Drosophila S2 cells were added to human Jurkat cells at the ratio of 1:2.

Sample (ChIP)

SRR_accession

Sample (WCE)

SRR_accession

Jurkat_K79_0%_R1

SRR1536557

Jurkat_WCE_0%_R1

SRR1584489

Jurkat_K79_25%_R1

SRR1536558

Jurkat_WCE_25%_R1

SRR1584490

Jurkat_K79_50%_R1

SRR1536559

Jurkat_WCE_50%_R1

SRR1584491

Jurkat_K79_75%_R1

SRR1536560

Jurkat_WCE_75%_R1

SRR1584492

Jurkat_K79_100%_R1

SRR1536561

Jurkat_WCE_100%_R1

SRR1584493

After download the SRA files to your hard drive, you will need the fastq-dump command from the SRA Tookit to convert the SRA archive files into FASTQ format. We used this command to convert single-end sequencing data:

$ fastq-dump SRA_file1 SRA_file2 SRA_file3

Note

Alternatively, you can download FASTQ files directly from the ENA (Enropean Nucleotide Archive) by searching SRR accession numbers listed in the above table.

2. Build bowtie2 index files

To calculate how many reads in the FASTQ files are drived from the Drosophila S2 cells, we could map all reads to the composite reference genome (i.e., human + Drosophila). In this tutorial, we will use bowtie2, other short reads aligners such as BWA also work fine. To use bowtie2 to map reads to the composite reference genome, we need to create bowtie2 index files first.

Download the human and Drosophila reference genome sequences:

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz
$ gunzip hg38.fa.gz
$ gunzip dm6.fa.gz

Rename chromosome IDs of the Drosophila genome:

$ sed 's/chr/dm6_chr/g' dm6.fa  > dm6_renameID.fa

Concatenate the human and Drosophila genome files:

$ cat hg38.fa  dm6_renameID.fa > hg38_dm6.fa

Run bowtie2-build to build index files. “hg38_dm6” is the prefix of index files:

$ bowtie2-build  hg38_dm6.fa  hg38_dm6

Clean up:

$ rm hg38.fa dm6.fa dm6_renameID.fa

Note

We have Bowtie2 index files files for hg38 + dm6, hg19 + dm6, mm9 + dm6, mm10 + dm6, mm39 + dm6 available for download.

3. Map reads to the composite reference genome

Use bowtie2 to map all reads to the composite reference genome (i.e., hg38_dm6.fa). Sort and index the BAM files using Samtools. Below is an example to process one sample:

bowtie2 --threads 4 -x ./database/GRCh38_dm6  -U Jurkat_K79_00p_Rep1_SRR1536557.fastq.gz  | samtools view -Sbh - > Jurkat_K79_00p_Rep1_SRR1536557.bam
samtools sort -@ 4 Jurkat_K79_00p_Rep1_SRR1536557.bam > Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam
samtools index -@ 4 Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam

Note

We have pre-aligned BAM files available from Test dataset for download.

4. Calculate scaling factors

To remove experimental bias and accurately quantify ChIP signal changes, the amount of exogenous Drosophila chromatin in different samples needs to be normalized to the same level. After sequencing, this is equivalent to normalize Drosophila reads to the same amount.

We provide split_bam.py to split the BAM file into 4 sub-BAM files.

  • _both.bam : contains reads mapped to both human and Drosophila genomes

  • _exogenous.bam : contains reads only map to the Drosophila genomes

  • _human.bam : contains reads only map to the human genomes

  • _neither.bam : contains qcfailed, low quality, unmapped, dupilcate reads etc.

A report file describing the number of alignments in each sub-BAM is also generated.

For example:

$ split_bam.py --threads 8  -i Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam  -o Jurkat_K79_00p_Rep1
Read the BAM file: Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam
Done
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 0 files and 8 in-memory blocks..

$ cat Jurkat_K79_00p_Rep1.report.txt

Sample n_unmapped      n_qcFail        n_duplicate     n_secondary     n_low_mapq      n_both  n_sample        n_exogenous
Jurkat_K79_00p_Rep1_SRR1536557.sorted.bam      719553  0       0       0       5355549 0       20741684        5207395

Mapping statistics and the derived scaling factors (SF) are summerized in the table below (the number of Drosophila reads in each sample was normalized to 1 million).

Sample

Unmapped reads

Low_mapq_reads

Human_reads

Fly_reads

SF

Jurkat_K79_00p_Rep1_SRR1536557

719,553

5,355,549

20,741,684

5,207,395

0.1920346

Jurkat_K79_25p_Rep1_SRR1536558

2,177,315

5,367,333

19,282,608

6,035,961

0.1656737

Jurkat_K79_50p_Rep1_SRR1536559

2,919,492

5,233,017

17,235,944

6,931,602

0.14426679

Jurkat_K79_75p_Rep1_SRR1536560

1,627,494

4,300,454

11,731,507

8,291,870

0.12060006

Jurkat_K79_100p_Rep1_SRR1536561

4,526,721

5,109,981

7,893,377

14,372,323

0.06957817

Jurkat_WCE_00p_Rep1.SRR1584489

2,344,210

3,914,206

11,126,768

844,395

1.18427987

Jurkat_WCE_25p_Rep1.SRR1584490

9,982,131

1,056,402

2,771,236

164,052

6.09562822

Jurkat_WCE_50p_Rep1.SRR1584491

4,398,297

1,750,494

4,431,034

267,211

3.74236091

Jurkat_WCE_75p_Rep1.SRR1584492

7,317,599

1,620,854

4,029,096

179,641

5.56665795

Jurkat_WCE_100p_Rep1.SRR1584493

7,609,117

2,336,349

5,197,374

344,665

2.901368

Note

You can also use K-mer based, alignment-free methods to calculate scaling factor. For example, you can use xenome to separate human and fly reads from your FASTQ files.

5. Run Spiker

Spiker is specifically designed to analyze ChIP-seq data with spike-in control. The input files can be single-end or paired-end FASTQ files or pre-aligned BAM files. It supports both narrow peak (such as H3K27ac) and broad peak analysis (such as H3K79me2 in this tutorial).

In below, we used “Jurkat_K79_00p_Rep1_SRR1536557_human.sorted.bam” (ChIP) and “Jurkat_WCE_00p_Rep1.SRR1584489_human.sorted.bam” (control) as inputs to Spiker:

$ spiker.py --broad --spikeIn --csf 1.18 --tsf 0.19 --bw -t Jurkat_K79_00p_Rep1_SRR1536557_human.sorted.bam -c Jurkat_WCE_00p_Rep1.SRR1584489_human.sorted.bam  -o Jurkat_K79_00p_Rep1_SRR1536557

2021-02-18 12:56:20 [INFO]  Running ChIP-seq workflow ...
2021-02-18 12:56:20 [INFO]  Input BAM files. Skip Step_1 (reads mapping).
2021-02-18 12:56:20 [INFO]  Step_2: Extract information from BAM files ...
2021-02-18 12:56:20 [INFO]  Step_2.1: Check sequencing layout from BAM file ...
2021-02-18 12:56:22 [INFO]     The layout of Jurkat_K79_00p_Rep1_SRR1536557_human.sorted.bam is SE
2021-02-18 12:56:23 [INFO]     The layout of Jurkat_WCE_00p_Rep1.SRR1584489_human.sorted.bam is SE
2021-02-18 12:56:23 [INFO]  Step_2.2: Extract information from ChIP BAM files ...
2021-02-18 12:56:25 [INFO]     The total mapped reads in ChIP sample is: 20741684
2021-02-18 12:56:25 [INFO]  Step_2.3: Extract information from control BAM files ...
2021-02-18 12:56:26 [INFO]     The total mapped reads in control sample is: 11126768
2021-02-18 12:56:26 [INFO]  Step_3: Deduplication ...
2021-02-18 12:56:26 [INFO]  Step_3.1: Remove duplidate reads from ChIP BAM file ...
...

Output files

*.gappedPeak

peak file (in gappedPeak) called by Spiker.

*.control.pileup.max.bdg

bedGraph file of control pileup

*.control.pileup.max.bigWig

bigWig file of control pileup

*.treat.pileup.bdg

bedGraph file of ChIP sample pileup (raw)

*.treat.pileup.SpikeIn_scaled.bdg

bedGraph file of ChIP sample pileup (Scaled to spike-in control)

*.treat.pileup.SpikeIn_scaled.bigWig

bigWig file of ChIP sample pileup (Scaled to spike-in control)