OpenContami User's Guide
What is OpenContami (OCT)?
The quality assurance is becoming an increasingly important issue in various research settings. NGS (next-generation sequencing)-based contaminant detections offer promising diagnostics to assess the presence of contaminants. Because biological resources are frequently contaminated by multiple microorganisms, researchers need careful attention to intra- and interspecies sequence similarities. OpenContami investigates the origin of sequenced reads from user-uploaded BAM files and provides highly probable microbial contaminants present in the BAM files that may be contaminated by the laboratory reagents, sequencer carryovers, cross-contamination, etc. The word "OpenContami" is an abbreviated word for "Open Cell Microbial Contaminants by High-throughput Sequencing".
How does OpenContami detect microbe-originated reads?
OpenContami explores unique hits to microbial genomes, as a primary key, and exploits the weighted contributions of multiple hits.
The analytic pipeline of OpenContami performs greedy alignments to subtract exogenous reads from the input BAM file. This pipeline thoroughly discards host-related reads, then independently maps the screened reads to individual microbial species genomes, which enables us to define the mapping status of each read (i.e. unique or multi-mapped reads).
To quantify microbe abundance at the genus label, the pipeline first tests the statistical significance of uniq-genus-hits observed in the screened reads by preparing an ensemble of uniq-genus-hits with random read sets. If the observation of unique microbe hits is significantly greater than its random ensemble mean value, the pipeline reports the microbe as a potential contaminant (p-value of Z-score distribution). Microbes that were detected with no unique hits are not of interest but are incorporated into the quantification at the sample level. Then, the pipeline scores multiple hits by estimating their importance in the screened reads for considering intra- and interspecies sequence homology: a read uniquely mapped to single genus is counted as 1.0, whereas a read mapped to multiple genera is penalized by an exponential function shaped by the overall read-mapping results of the screened reads.Step I (Quality control)
Trimmomatic assesses the quality of the input NGS reads by removing adapters and trimming reads.Step II (Mapping to host-reference genome)
HISAT2 coupled with Bowtie2 with the option “-k 1” aligns the quality-controlled reads to a host reference genome.Step III (Removing host-relevant reads)
To remove any potential host reads, Bowtie2 with “--sensitive” and via BLASTn with the options “-evalue 0.001 -perc_identity 80 -max_target_seqs 1” sequentially align the unmapped reads again to alternative host genomic and transcriptomic sequences.Step IV (Making low-complexity sequences)
The host-unmapped reads that still remain are candidate contaminant-origin reads. To reduce false discovery, TANTAN masks the low-complexity sequences in the host-unmapped reads.Step V (Mapping to a microbe genome)
Bowtie2, with the option “--sensitive”, aligns the masked sequences to one set of bacterial, viral, or fungal genomes of species belonging to the same genus. This step is independently repeated with each of the 2,289 genera.Step VI (Categorizing read-mapping status)
A mapped read is categorized as either a ’uniq-genus-hit’ (i.e. uniquely mapped to a specific genus) or a ‘multi-genera-hit’ (i.e. repeatedly mapped to multiple genera). The statistics is gathered from the mapping results, which includes the total number of microbe-mapped reads (i.e. sum of ‘uniq-genus-hit’ and ‘multi-genera-hit’) and the total number of host-mapped reads.Step VII (Defining a shape of scoring function)
The total number of microbe-mapped reads and the number of genera of each ‘multi-genera-hit’ read define an exponential function for weighting the ‘multi-genera-hit’ reads; a read uniquely mapped to a genus is counted as 1.0, whereas a read mapped to multiple genera is penalized by the exponential function.Step VIII (Testing statistical significance of unique hits)
To test the chance occurrence of the ‘uniq-genus-hit’ reads that were mapped to specific microbes, the pipeline first randomly samples n reads (i.e. the total number of microbe-mapped reads) from the microbe genomes that discard the observed microbial genomes. Next, the pipeline aligns the random reads to the observed microbial genomes and counts the uniquely mapped reads. This procedure is repeated ten times to prepare an ensemble of random numbers of unique reads for each observed genus. The numbers for a genus are converted into z-scores, and the null hypothesis that no difference exists between the observation and the mean of its ensemble is tested, resulting in a p-value.Step IX (Calculating RPMHs)
For sample-level quantification, a normalized RPMH (read per million host-mapped reads) value is calculated as RPMH = n / m x 10^6, where n and m are the total number of microbe-mapped reads and the total number of host-mapped reads in a given input dataset, respectively. For genus-level quantification, the RPMH of a genus G is calculated by RPMH(G)=(∑_(k=1)^n` S_k )/m, where n` is the total number of reads uniquely or repeatedly mapped to G.
How to use OpenContami?
- To calc. number of host-mapped reads: %>samtools view -cF 4 input.bam
- To reduce file size: %>samtools view -bf 4 input.bam out.bam
- Select the tab "2.Experiment" of Annotation Edit and set "Reference Genome" to "Others"
- Select the tab "4.Tools" of Annotation Edit and set "OpenContami (OCT) Configuration" fields
- Note: "3. How many R1 reads were mapped to the host genome" is used as "(1) Mapped Reads to refGenome" (i.e. total number of host-mapped reads). RPMHs are calculated by using this value.
- Using already finished job, to calculate RPMHs again with the new value of "3. How many R1 reads were mapped to the host genome" can be done.
- Click the button "Re-calculate RPMH" at the tab "4.Tools"
- This option calculates RPMHs again quickly (without running again mappings)
Readout the result web-page
- Numbers at sample level
- A) Raw Reads: reads in FASTQ
- B) Number of Raw Reads after QC Filtering: reads retained after QC filtering
- C) Mapped Reads to refGenome (Paired-End): host-mapped reads
- D) Unmapped Reads to refGenome (Paired-End): host-unmapped reads
- For each of Read1 and Read2 (if paired-end sequencing)
- 1. Mapped Reads to refGenome: reads mapped to the host reference genome
- 2. Unmapped Reads to refGenome: reads unmapped to the host reference genome
- 3. Final mapped Reads to Host genome: reads mapped to the host reference genome and host alternative genome sequences
- 4. Final Unmapped Reads to Host genome: host-unmapped reads used for a contaminant profile
- 5. Mapped Reads to Microbial Genomes: total number of reads mapped to microbes listed in OpenContami DB
- 6. Microbe RPMH: read per million host-mapped reads = (5) / (3) x10^6
- 7. Microbe PPM: parts per million = (5) / (B) x10^6
- 8. Chimeric Reads (Host-Microbe): total number reads which R1 was mapped to the host genome and R2 was mapped to microbes, vice versa
- 9. Chimeric RPMH: read per million host-mapped reads = (8) / (3) x10^6
- 10. Chimeric PPM: parts per million = (8) / (B) x10^6
- Numbers at genus level
- I. RPMH: read per million host-mapped reads
- II. PPM: parts per million
- III. Multi_Reads: total number of reads mapped to multiple microbe genera
- IV. Uniq_Reads: total number of reads mapped to a microbe genus
- V. wTotal: sum of weighted score = weighted (III) + (IV)
- VI. kingdom: bacteria, fungi, viral
- VII. genus: taxonomy
- VIII. P_value: significance of (IV)