Open Cell Microbial Contaminants by High-throughput Sequencing


1. What is OpenContami?

The quality assurance is becoming an increasingly important issue in various research settings. NGS (next-generation sequencing)-based contaminant detections offer promising diagnostics to assess the presence of contaminants. Because biological resources are frequently contaminated by multiple microorganisms, researchers need careful attention to intra- and interspecies sequence similarities. OpenContami investigates the origin of sequenced reads from user-uploaded BAM files and provides highly probable microbial contaminants present in the BAM files that may be contaminated by the laboratory reagents, sequencer carryovers, cross-contamination, etc.


2. How does OpenContami detect microbe-originated reads?

The analytic pipeline of OpenContami performs greedy alignments to subtract exogenous reads from the input BAM file. This pipeline thoroughly discards host-related reads, then independently maps the screened reads to individual microbial species genomes, which enables us to define the mapping status of each read: a read is categorized into either euniq-species-hitf (or euniq-genus-hitf), uniquely mapped to a specific species (or genus), or emulti-species-hitf (or emulti-genera-hitf), repeatedly mapped to multiple species (or genera).


To quantify microbe abundance at the genus label, the pipeline first tests the statistical significance of uniq-genus-hits observed in the screened reads by preparing an ensemble of uniq-genus-hits with random read sets. If the observation of unique microbe hits is significantly greater than its random ensemble mean value, the pipeline reports the microbe as a potential contaminant (p-value of Z-score distribution). Microbes that were detected with no unique hits are not of interest but are incorporated into the quantification at the sample level. Then, the pipeline scores multiple hits by estimating their importance in the screened reads for considering intra- and interspecies sequence homology: a read uniquely mapped to single genus is counted as 1.0, whereas a read mapped to multiple genera is penalized by the exponential function shaped by the overall read-mapping results of the screened reads. In summary, OpenContami explores unique hits as a primary key and exploits the weighted contributions of multiple hits.


2-1. Quantification of microbe-mapped reads

For the sample-level quantification, a normalized score RPMH (reads per million host-mapped reads) is calculated as RPMH=n/m ~10^6, where n and m is the total number of microbe-mapped reads and the total number of host-mapped reads, respectively. For the genus-level quantification, a score Si for the read i that was mapped to Ti genera (or a genus) is given by Si= e^(-n(Ti -1) /_(j=1)^n Tj). Thus, a read uniquely mapped to a genus is counted as 1.0, whereas a read mapped to multiple genera is penalized by the exponential function.


For instance, the genus-level RPMH of a genus G is calculated by RPMH(G)=(_(k=1)^n ̀ S_k ) /m, where n ̀ is the total number of reads uniquely or repeatedly mapped to G.


2-2. Preparation of random microbial reads for statistical tests

We randomly select 10 species belonging to distinct genera and prepare 1,000 50bp DNA fragments from each of selected species genomes (1,000 reads ~ 10 species), and calculates the false discovery rate (FDR) for each species; that is, TN / (TN+TP), where TP (true positive) is the number of reads mapped to their origin, and TN (true negative) is the number of reads mapped to others.


After cataloging candidate microbes that have one or more uniq-genus hits, to test the chance occurrence of these observed uniquely mapped reads, we first prepare n random reads, where n is the total number of microbe-mapped reads in a given input dataset. Then, we align the random reads to the microbe catalog genomes and count uniquely mapped reads. This procedure is repeated 10 times to prepare an ensemble of random numbers of unique reads for each observed genus. The ensemble numbers for a genus are converted into z-scores, and a null hypothesis assuming that no difference exists between the observation and the mean of its ensemble is tested under the p-value threshold 0.001.



3. How to read the web page of OpenContami profile

3-1. Numbers at the sample level

A)       Raw Reads:   reads in FASTQ

B)       Number of Raw Reads after QC Filtering: reads retained after QC filtering

C)        Mapped Reads to refGenome (Paired-End): host-mapped reads

D)       Unmapped Reads to refGenome (Paired-End): host-unmapped reads


For each of Read1 and Read2 (if paired-end sequencing)

1.        Mapped Reads to refGenome: reads mapped to the host reference genome

2.        Unmapped Reads to refGenome: reads unmapped to the host reference genome.

3.        Final mapped Reads to Host genome: reads mapped to the host reference genome and host alternative genome sequences.

4.        Final Unmapped Reads to Host genome: host-unmapped reads used for a contaminant profile

5.        Mapped Reads to Microbial Genomes: total number of reads mapped to microbes listed in OpenContami DB.

6.        Microbe RPMH: reads per million host-mapped reads = (5) / (3) ~10^6

7.        Microbe PPM: parts per million = (5) / (B) ~10^6

8.        Chimeric Reads (Host-Microbe): total number reads which R1 was mapped to the host genome and R2 was mapped to microbes, vice versa.

9.        Chimeric RPMH: reads per million host-mapped reads = (8) / (3) ~10^6

10.    Chimeric PPM: parts per million = (8) / (B) ~10^6


3-2. Numbers at the genus level

I.          RPMH: reads per million host-mapped reads

II.        PPM: parts per million

III.      Multi_Reads: total number of reads mapped to multiple microbe genera

IV.      Uniq_Reads: total number of reads mapped to a microbe genus

V.        wTotal: sum of weighted score = weighted (III) + (IV)

VI.      kingdom: bacteria, fungi, viral

VII.    genus: taxonomy

VIII.  P_value: significance of (IV)




4. How to use OpenContami?

4-1. Account registration via OpenLooper >> Register


4-2. Upload BAM file via OpenLooper >> Dashboard >> Upload

For reducing the BAM file size, gsamtools view -bf 4 input.bam out.bamh


4-3. Annotated the BAM file via OpenLooper >> Dashboard >> Annotation

Select the tab g2. Experimenth of Annotation Edit -> gReference Genomeh -> gOthersh


4-4. Set the configuration for running OCT pipeline via the tab g4.Toolsh of Annotation Edit

g4.Toolsh -> gOpenContami (OCT) Configurationh



g3. How many R1 reads were mapped to the host genomeh is used as g(1) Mapped Reads to refGenomeh. RPMHs are calculated by using this value.


4-5. Run the pipeline via the clickable button at the tab g4.Toolsh of Annotation Edit



Latest update: May 17, 2019