CovidMutations: a feasible framework for mutation analysis and assays validation of COVID-19

R version: R version 4.4.2 (2024-10-31)
Package version: 0.1.3

Introduction

The novel respiratory disease COVID-19 has caused worldwide pandemic and large efforts are currently being undertaken in characterizing the virus (SARS-CoV-2) causing it in both molecular and epidemiological aspects(Dong, Du, and Gardner 2020). The genomic variability of SARS-CoV-2 may largely correlate with the virus specific etiological effects(Kim et al. 2020). In the present study, by detailing the characteristics of SARS-CoV-2 complete genomes currently available on the Global Initiative on Sharing Avian Influenza Data (GISAID) consortium(Elbe and Buckland-Merrett 2017), We analyze and annotate all SARS-CoV-2 mutations compared with the reference Wuhan genome NC_045512.2. Our analysis shows the mutational pattern of different type of mutations and may reveal their potential effects on proteins. Reverse transcription polymerase chain reaction (RT-PCR) was the first method developed for COVID-19 detection and is the current gold standard as it offers both high accuracy and throughput(Corman et al. 2020). it is increasingly critical to evaluate the performance of RT-PCR tests using both experimental tests and in-silico analysis. Here we present a feasible method to evaluate the detection efficiency of different real-time reverse-transcription polymerase chain reaction (RT-PCR) assays available as example data(Sanchez-Padilla et al. 2015). By tracking mutational trends in viral sequences and evaluating the effectiveness of assay (mutation ratio in detecting regions), we provide options for further design of highly sensitive, clinically useful assays. Our analysis may provide new strategies of antiviral therapy based on the molecular characteristics of this novel virus(Mercatelli and Giorgi 2020).

Downloading and preparing the data

The fasta file of SARS-CoV-2 genome is made available through the GISAID. Users have to log in to access the data. In the Browse session, the hCoV-19 can be downloaded following the instructions below: Then click the Download button to download all available fasta files. The fasta file is processed by seqkit software(Shen et al. 2016) and Nucmer script(Delcher et al. 2002) under linux to produce the nucmer.snps file, the procedure is:

#  download the SARS-Cov-2 genomics sequence(*.fasta) from Gisaid website( comploe, high coverage only, low coverage exclusion, Host=human, Virus name=hCoV-19)
# download NC_045512.2.fa from NCBI as reference for alignment

#specify your fasta file as input
input=gisaid_hcov-19_2020_06_14_04.fasta


#Command line
## clear the data
seqkit grep  -s -p - $input -v > Gisaid_clear.fasta  # remove the data with '-'

## Run nucmer to obtain variant file

ref=NC_045512.2.fa # The reference SARS-CoV-2 Wuhan Genome

## Covert the DOS/window file format to UNIX format for both ref and input files

sed 's/^M$//' Gisaid_clear.fasta > Gisaid_clear_format.fasta
sed 's/^M$//' NC_045512.2.fa > ref.fa


## remove fasta sequence with duplicated ID 
awk '/^>/{f=!d[$1];d[$1]=1}f' Gisaid_clear_format.fasta > Gisaid_RMD.fasta
## calculate total sample(important for analysis)
grep -c '^>' Gisaid_RMD.fasta
## downsampling fasta seq:
seqkit sample --proportion 0.15 Gisaid_RMD.fasta > Gisaid_RMD_15.fasta


### Snap-calling
nucmer --forward -p nucmer ref.fa Gisaid_RMD.fasta
show-coords -r -c -l nucmer.delta > nucmer.coords
show-snps nucmer.delta -T -l > nucmer.snps

#Command line(single step)
# seqkit grep  -s -p - $input -v |sed 's/^M$//' |awk '/^>/{f=!d[$1];d[$1]=1}f' >Gisaid_RMD.fasta

followed by read into R:

nucmer<- read.delim('nucmer.snps',as.is=TRUE,skip=4,header=FALSE)

Than the downstream analysis can be performed by manipulating the nucmer object.

In summary, the following steps need to be performed to prepare the data:

  1. Create an account for GISAID login
  2. Filter and download the fasta file
  3. Preprocess the fasta file with linux command line
  4. Read the preprocessed data into R

Annotate the mutational events

Merge neighboring events of SNP, insertion and deletion

The first step for handling the nucmer object is to preprocess it by defining each column name and merging different types of mutations (single nucleotide polymorphism (SNP), insertion and deletion etc.), then the undated nucmer object could be annotated.

To best display the procedure, we provide example data:

library(CovidMutations)
#The example data:
data("nucmer")

Another way to implement the procedure is to read in the raw data (nucmer.snps) provided as extdata:

nucmer<- read.delim(unzip(system.file(package="CovidMutations", "extdata", "nucmer10.zip")), as.is = T, skip = 4, header = F)

Users can also download their coronavirus data to build the nucmer object as mentioned previously, once the nucmer.snps file is generated, the input nucmer object can be made by following steps:

options(stringsAsFactors = FALSE)
#read in R:
#nucmer<-read.delim("nucmer.snps",as.is=TRUE,skip=4,header=FALSE)
colnames(nucmer)<-c("rpos","rvar","qvar","qpos","","","","","rlength","qlength","","","rname","qname")
rownames(nucmer)<-paste0("var",1:nrow(nucmer))

After building nucmer object, it should be filtered to exclude unwanted symbols:


# Fix IUPAC codes, both the example data and data created by users should perform the steps below:
nucmer<-nucmer[!nucmer$qvar%in%c("B","D","H","K","M","N","R","S","V","W","Y"),]
nucmer<- mergeEvents(nucmer = nucmer)## This will update the nucmer object
head(nucmer)
#>      rpos rvar qvar qpos             rlength qlength        rname
#> var1    6    A    G    6  6  6 254 0   29903   29863 1 1 Untitled
#> var2   10    T    G    6  6  6 254 0   29903   29863 1 1 Untitled
#> var3   10    T    G    6  3  6 254 0   29903   29864 1 1 Untitled
#> var4   12   AT   CA    9  1  9 254 0   29903   29862 1 1 Untitled
#> var5   13    T    C   13 13 13 254 0   29903   29901 1 1 Untitled
#> var6   13    T    C   13 13 13 254 0   29903   29903 1 1 Untitled
#>                                                                qname
#> var1             hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16
#> var2 hCoV-19/India/NCDC7554_CSIR-IGIB/2020|EPI_ISL_482641|2020-05-27
#> var3               hCoV-19/Taiwan/144/2020|EPI_ISL_421641|2020-03-19
#> var4         hCoV-19/India/nimh-14723/2020|EPI_ISL_486388|2020-05-10
#> var5       hCoV-19/England/OXON-B1B67/2020|EPI_ISL_479076|2020-05-03
#> var6          hCoV-19/Germany/Br-ZK-1/2020|EPI_ISL_491115|2020-04-27

Provide effects of each SNP, insertion and deletion

The updated nucmer object is used as input for annotating the mutations, the refseq data and gff3 data are downloaded from ncbi database(Wu et al. 2020). The indelSNP() function returns the result of annotation as .csv file, or .rda file (saveRda = TRUE).

data("refseq")
data("gff3")
annot <- setNames(gff3[,10],gff3[,9])  #annot: subset the gene and its product from gff3 file
outdir <- tempdir()
covid_annot<- indelSNP(nucmer = nucmer,
         saveRda = FALSE,
         refseq = refseq,
         gff3 = gff3,
         annot = annot,
         outdir = outdir)

Mutation statistics and profile

Basic descriptions for the mutational events.

Plot the mutation statistics after annotating the nucmer object by indelSNP function. The example data covid_annot is downsampled annotation data generated by indelSNP function.

The updated numcer object (covid_annot) includes annotations for each mutation:

#data("covid_annot") We provide example covid_annot results produced by the `indelSNP` function
covid_annot <- as.data.frame(covid_annot)
head(covid_annot)
#>                                                sample refpos refvar qvar  qpos
#> 1 hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16      6      A    G     6
#> 2 hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16    241      C    T   241
#> 3 hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16   1059      C    T  1059
#> 4 hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16   3037      C    T  3037
#> 5 hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16   9891      C    T  9891
#> 6 hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16  11083      G    T 11083
#>   qlength protein variant   varclass
#> 1   29863   5'UTR       6 extragenic
#> 2   29863   5'UTR     241 extragenic
#> 3   29863    NSP2    T85I        SNP
#> 4   29863    NSP3   F106F SNP_silent
#> 5   29863    NSP4   A446V        SNP
#> 6   29863    NSP6    L37F        SNP
#>                                          annotation
#> 1                                              <NA>
#> 2                                              <NA>
#> 3                          Non-Structural protein 2
#> 4 Predicted phosphoesterase, papain-like proteinase
#> 5                             Transmembrane protein
#> 6                             Transmembrane protein

If the outdir is NULL, the plot shows in the R studio panel.


plotMutAnno(results = covid_annot,figureType = "MostMut", outdir = NULL)
Barplot of mutation counts for the downsampled data. The sample ID shown below the x axis.

Barplot of mutation counts for the downsampled data. The sample ID shown below the x axis.

Other types of figures available are shown below:

Average mutation counts for each sample:

plotMutAnno(results = covid_annot,figureType = "MutPerSample", outdir = NULL)
Barplot of average mutation counts for the downsampled data

Barplot of average mutation counts for the downsampled data

Most frequent events per class:

plotMutAnno(results = covid_annot,figureType = "VarClasses", outdir = NULL)
Barplot of most frequent mutational events for the downsampled data

Barplot of most frequent mutational events for the downsampled data

Most frequent mutational type:

plotMutAnno(results = covid_annot,figureType = "VarType", outdir = NULL)
Barplot of most frequent mutational type for the downsampled data

Barplot of most frequent mutational type for the downsampled data

Most frequent events (nucleotide):

plotMutAnno(results = covid_annot,figureType = "NucleoEvents", outdir = NULL)
Barplot of most frequent events (nucleotide) for the downsampled data

Barplot of most frequent events (nucleotide) for the downsampled data

Most frequent events (proteins):

plotMutAnno(results = covid_annot,figureType = "ProEvents", outdir = NULL)
Barplot of most frequent events (proteins) for the downsampled data

Barplot of most frequent events (proteins) for the downsampled data

The most frequent mutational events for each protein

Identifying mutational profile for selected protein is critical for understanding virus evolution and targeted therapy(Taiaroa et al. 2020), also, hypothesizing targetable targets for drug design(Sanchez-Padilla et al. 2015). The plotMutProteins function is for displaying representative mutational events in the coding proteins that relevant to virus infection. The proteinName parameter is specified for SARS-CoV-2 genome. See available choices by ?plotMutProteins. Change the top parameter to choose numbers of observations.

#data("covid_annot")
covid_annot <- as.data.frame(covid_annot)
plotMutProteins(results = covid_annot,proteinName = "NSP2", top = 20, outdir = NULL)
Barplot of most mutated variant for each protein

Barplot of most mutated variant for each protein

Preprocess nucmer object to add group information

To assess the geographical distribution of virus strain and provide global profile of SNPs, nucmer is processed to add additional group information and update the column name, the chinalist data is for replacing cities in China into “China” to make the distribution analysis easier. The function nucmerRMD returns an updated nucmer object (to distinguish it from nucmer, this updated nucmer object is called nucmerr in the following session). .

#data("nucmer")
data("chinalist")
#outdir <- tempdir()
nucmerr<- nucmerRMD(nucmer = nucmer, outdir = NULL, chinalist = chinalist)
head(nucmerr)
#>      rpos rvar qvar qpos
#> var1    6    A    G    6
#> var2   10    T    G    6
#> var3   10    T    G    6
#> var4   12   AT   CA    9
#> var5   13    T    C   13
#> var6   13    T    C   13
#>                                                                   ID
#> var1             hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16
#> var2 hCoV-19/India/NCDC7554_CSIR-IGIB/2020|EPI_ISL_482641|2020-05-27
#> var3               hCoV-19/Taiwan/144/2020|EPI_ISL_421641|2020-03-19
#> var4         hCoV-19/India/nimh-14723/2020|EPI_ISL_486388|2020-05-10
#> var5       hCoV-19/England/OXON-B1B67/2020|EPI_ISL_479076|2020-05-03
#> var6          hCoV-19/Germany/Br-ZK-1/2020|EPI_ISL_491115|2020-04-27
#>              sample       time country M_type   PM_type
#> var1 EPI_ISL_443266 2020-03-16  France   A->G    6:A->G
#> var2 EPI_ISL_482641 2020-05-27   India   T->G   10:T->G
#> var3 EPI_ISL_421641 2020-03-19  Taiwan   T->G   10:T->G
#> var4 EPI_ISL_486388 2020-05-10   India AT->CA 12:AT->CA
#> var5 EPI_ISL_479076 2020-05-03 England   T->C   13:T->C
#> var6 EPI_ISL_491115 2020-04-27 Germany   T->C   13:T->C

For analyzing virus data other than SARS-CoV-2, users can also make their own nucmerr object, following code below:

#make sure that the nucmer object has the right column name: "rpos","rvar","qvar","qpos","ID"
nucmer <- nucmer[,c(1,2,3,4,14)]
colnames(nucmer)<-c("rpos","rvar","qvar","qpos","ID")

Add sample, time, country group columns: please make sure that the nucmer ID comprises necessary group information in format like:

hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16

Extract simplified ID (EPI_ISL_443266), time (2020-03-16) and country (France) from nucmer ID and add mutation types:

nucmer$sample <-vapply(strsplit(as.character(nucmer$ID), "[|]"), function(x) x[2], character(1))
nucmer$time <-vapply(strsplit(as.character(nucmer$ID), "[|]"), function(x) x[3], character(1))
nucmer$country <-vapply(strsplit(as.character(nucmer$ID), "[/]"), function(x) x[2], character(1))

#M_type represents mutation type, PM_type represents positions and mutation type.  
nucmer$M_type <-str_c(nucmer$rvar,nucmer$qvar,sep ="->")
nucmer$PM_type <-str_c(nucmer$rpos,nucmer$M_type,sep =":")

The updated nucmer is nucmerr now, check the format:

nucmerr <- nucmer
head(nucmerr)
#>      rpos rvar qvar qpos
#> var1    6    A    G    6
#> var2   10    T    G    6
#> var3   10    T    G    6
#> var4   12   AT   CA    9
#> var5   13    T    C   13
#> var6   13    T    C   13
#>                                                                   ID
#> var1             hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16
#> var2 hCoV-19/India/NCDC7554_CSIR-IGIB/2020|EPI_ISL_482641|2020-05-27
#> var3               hCoV-19/Taiwan/144/2020|EPI_ISL_421641|2020-03-19
#> var4         hCoV-19/India/nimh-14723/2020|EPI_ISL_486388|2020-05-10
#> var5       hCoV-19/England/OXON-B1B67/2020|EPI_ISL_479076|2020-05-03
#> var6          hCoV-19/Germany/Br-ZK-1/2020|EPI_ISL_491115|2020-04-27
#>              sample       time country M_type   PM_type
#> var1 EPI_ISL_443266 2020-03-16  France   A->G    6:A->G
#> var2 EPI_ISL_482641 2020-05-27   India   T->G   10:T->G
#> var3 EPI_ISL_421641 2020-03-19  Taiwan   T->G   10:T->G
#> var4 EPI_ISL_486388 2020-05-10   India AT->CA 12:AT->CA
#> var5 EPI_ISL_479076 2020-05-03 England   T->C   13:T->C
#> var6 EPI_ISL_491115 2020-04-27 Germany   T->C   13:T->C

Global SNP profiling in virus genome

The mutational profile (global SNPs) is visualized by function globalSNPprofile, so that the typical mutational pattern for samples data in the nucmerr object is presented. The nucmerr example data is downsampled data, preprocessed by nucmerRMD function. Specify the figure_Type parameter to display either heatmap or count plot.

#data("nucmerr") We provide example nucmerr object produced by the `nucmerRMD` function
head(nucmerr)
#>      rpos rvar qvar qpos
#> var1    6    A    G    6
#> var2   10    T    G    6
#> var3   10    T    G    6
#> var4   12   AT   CA    9
#> var5   13    T    C   13
#> var6   13    T    C   13
#>                                                                   ID
#> var1             hCoV-19/France/B5322/2020|EPI_ISL_443266|2020-03-16
#> var2 hCoV-19/India/NCDC7554_CSIR-IGIB/2020|EPI_ISL_482641|2020-05-27
#> var3               hCoV-19/Taiwan/144/2020|EPI_ISL_421641|2020-03-19
#> var4         hCoV-19/India/nimh-14723/2020|EPI_ISL_486388|2020-05-10
#> var5       hCoV-19/England/OXON-B1B67/2020|EPI_ISL_479076|2020-05-03
#> var6          hCoV-19/Germany/Br-ZK-1/2020|EPI_ISL_491115|2020-04-27
#>              sample       time country M_type   PM_type
#> var1 EPI_ISL_443266 2020-03-16  France   A->G    6:A->G
#> var2 EPI_ISL_482641 2020-05-27   India   T->G   10:T->G
#> var3 EPI_ISL_421641 2020-03-19  Taiwan   T->G   10:T->G
#> var4 EPI_ISL_486388 2020-05-10   India AT->CA 12:AT->CA
#> var5 EPI_ISL_479076 2020-05-03 England   T->C   13:T->C
#> var6 EPI_ISL_491115 2020-04-27 Germany   T->C   13:T->C
#outdir <- tempdir()
globalSNPprofile(nucmerr = nucmerr, outdir = NULL, figure_Type = "heatmap", country = "global", top = 5)
Global SNPs pattern across the genome

Global SNPs pattern across the genome

The country parameter is for choosing a country area to plot local mutational profile (“USA”, “india”, “China”, “England”. etc.). If country = global, the output is for mutations across all countries.

globalSNPprofile(nucmerr = nucmerr, outdir = NULL, figure_Type = "heatmap", country = "USA", top = 5)
Local SNPs pattern across the genome (for USA)

Local SNPs pattern across the genome (for USA)

Now we can compare the “USA” pattern with the “England” pattern:

globalSNPprofile(nucmerr = nucmerr, outdir = NULL, figure_Type = "heatmap", country = "England", top = 5)
Local SNPs pattern across the genome (for England)

Local SNPs pattern across the genome (for England)

The country list:

 [1] "Algeria"     "Argentina"   "Australia"   "Austria"     "Bahrein"    
 [6] "Bangladesh"  "bat"         "Belgium"     "Benin"       "Brazil"     
[11] "Brunei"      "Bulgaria"    "Canada"      "Chile"       "China"      
[16] "Colombia"    "Croatia"     "Cyprus"      "Denmark"     "DRC"        
[21] "Ecuador"     "Egypt"       "England"     "env"         "Estonia"    
[26] "Felis"       "Finland"     "France"      "Gambia"      "Georgia"    
[31] "Germany"     "Greece"      "Hungary"     "Iceland"     "india"      
[36] "India"       "Indonesia"   "Iran"        "Ireland"     "Israel"     
[41] "Italy"       "ITALY"       "Jamaica"     "Japan"       "Jordan"     
[46] "Kazakhstan"  "Kenya"       "Korea"       "Latvia"      "Lebanon"    
[51] "Luxembourg"  "Malaysia"    "Mexico"      "mink"        "Mongolia"   
[56] "Morocco"     "Netherlands" "New"         "Nigeria"     "Norway"     
[61] "Oman"        "Pakistan"    "pangolin"    "Poland"      "Portugal"   
[66] "Puerto"      "Romania"     "Russia"      "Scotland"    "Senegal"    
[71] "Serbia"      "Shaoxing"    "Singapore"   "Slovakia"    "South"      
[76] "SouthAfrica" "Spain"       "Sweden"      "Switzerland" "Taiwan"     
[81] "Thailand"    "Tunisia"     "Turkey"      "Uganda"      "United"     
[86] "Uruguay"     "USA"         "Venezuela"   "Vietnam"     "Wales"      
[91] "Yichun"  

Global protein mutational events profiling

To better guide the production of molecule-targeted drugs, further analyzing the global profile of protein mutations is needed. The heatmap below shows some significant pattern of SARS-CoV-2 mutational effects in several regions. The ‘top’ parameter is for specifying the number of observations to display in the final figure:

globalProteinMut(covid_annot = covid_annot, outdir = NULL, figure_Type = "heatmap", top = 10, country = "global")
Global protein mutational pattern across the genome

Global protein mutational pattern across the genome

The count plot is available by adjusting the figure_Type parameter:

globalProteinMut(covid_annot = covid_annot, outdir = NULL, figure_Type = "count", top = 10, country = "global")
Global protein mutational counts across the genome

Global protein mutational counts across the genome

Like globalSNPprofile function, the country parameter is also available:

globalProteinMut(covid_annot = covid_annot, outdir = NULL, figure_Type = "heatmap", top = 10, country = "USA")
Local protein mutational pattern across the genome (for USA)

Local protein mutational pattern across the genome (for USA)

We can compare the patterns between two countries:

globalProteinMut(covid_annot = covid_annot, outdir = NULL, figure_Type = "heatmap", top = 10, country = "England")
Local protein mutational pattern across the genome (for England)

Local protein mutational pattern across the genome (for England)

Plot mutation distribution and mutation statistics

Visualization for the top mutated samples, average mutational counts, top mutated position in the genome, mutational density across the genome and distribution of mutations across countries. Try ?mutStat to see available options for the figure_Type parameter.

#outdir <- tempdir()
mutStat(nucmerr = nucmerr,
        outdir = NULL,
        figure_Type = "TopMuSample",
        type_top = 10,
        country = FALSE,
        mutpos = NULL)
Top 10 mutated samples

Top 10 mutated samples

If the figure type is “TopCountryMut”, mutpos can specify a range of genomic position (e.g. 28831:28931) for density plot.

#outdir <- tempdir()
mutStat(nucmerr = nucmerr,
        outdir = NULL,
        figure_Type = "TopCountryMut",
        type_top = 10,
        country = TRUE,  #involve country distribution, country =TRUE
        mutpos = 28831:28931)
Mutational density across the genome in different countries

Mutational density across the genome in different countries

The “MutDens” figure_Type is for mutational density profiling by position:

#outdir <- tempdir()
mutStat(nucmerr = nucmerr,
        outdir = NULL,
        figure_Type = "MutDens",
        type_top = NULL, 
        country = FALSE,  
        mutpos = NULL)
Global mutational density profiling

Global mutational density profiling

Plot mutation counts for certain genes

Genes responsible for virus infection or transmission may have higher mutation counts. So we plot mutation counts for each gene to identify the gene most susceptible to mutation.

The “gene_position” file is derived from gff3 file. If figurelist = FALSE, mutation counts for each gene is saved as figure file.

data("gene_position")
#outdir <- tempdir()
MutByGene(nucmerr = nucmerr, gff3 = gene_position, figurelist = TRUE, outdir = NULL)
Mutation counts for each gene

Mutation counts for each gene

#if figurelist = TRUE, the recommendation for figure display(in pixel)is: width=1650, height=1300

Assays efficiency

The detection of SARS-CoV-2 is important for the prevention of the outbreak and management of patients. RT-PCR assay is one of the most effective molecular diagnosis strategies to detect virus in clinical laboratory(Kilic, Weissleder, and Lee 2020).

Calculate the mutation ratio using different assays

The assays example data contains some current assays for detecting SARS-CoV-2 genome for different regions, like ORF-1ab, N protein, etc. The “F1”, “F2”, “R1”, “R2” refer to genomic positions of two pairs of primers (forward primer and reverse primer). The “P1”,“P2” refer to genomic positions of probes for detection. As SARS-CoV-2 genome is still evolving over time, we assume that assays for detecting this pathogen should also be changed and optimized given that assays detecting high-mutation-ratio region may lead to more false negatives, which may cause severe results. Users can design their own assay and transform the assays information according to the format shown by the example data to evaluate whether their assays are efficient for detecting potential SARS-CoV-2.

data("assays")
assays
#>              Assay    F1    F2    R1    R2    P1    P2
#> 3  ChinaCDC-ORF1ab 13342 13362 13442 13460 13377 13404
#> 4       ChinaCDC-N 28881 28902 28958 28979 28934 28953
#> 5     Charite-RdRP 15431 15452 15505 15530 15469 15494
#> 6        Charite-E 26269 26294 26360 26381 26332 26357
#> 7        HKU-ORF1b 18778 18797 18888 18909 18849 18872
#> 8            HKU-N 29145 29166 29236 29254 29177 29196
#> 9        USACDC-N1 28287 28306 28335 28358 28309 28332
#> 10       USACDC-N2 29164 29183 29213 29230 29188 29210
#> 11       USACDC-N3 28681 28702 28732 28752 28704 28727
#> 12      Thailand-N 28320 28339 28358 28376 28341 28356

The function AssayMutRatio is to use the well established RT-PCR assays information to detect mutation ratio in different SARS-CoV-2 genomic sites. The output will be series of figures presenting the mutation profile using a specific assay (output as file, the arrows show the directions of primers (red) and probes (blue). F1, R2, P1 for 5’, F2, R1, P2 for 3’) and a figure for comparison between the mutation ratio by each assay (if outdir is NULL, it returns only the figure for comparison). To some extent we believe that the lower the mutation ratio, the higher the reliability of RT-PCR.

Total <- 4000 ## Total Cleared GISAID fasta data, sekitseq
#outdir <- tempdir()
#Output the results
AssayMutRatio(nucmerr = nucmerr,
              assays = assays,
              totalsample = Total,
              plotType = "logtrans",
              outdir = NULL)
Mutation detection rate using different assays

Mutation detection rate using different assays

If outdir is specified, one of the output figure of assays is like:

Bacth assay analysis for last five Nr of primers

Last five nucleotides of primer mutation count/type for any RT-PCR primer. This is also an evaluation aspect for assays efficiency.

The LastfiveNrMutation function returns mutation counts(last five nucleotides for each primer) for each assay as output. If the figurelist = FALSE, it outputs each assay detecting profile as image file separately.

totalsample <- 4000

LastfiveNrMutation(nucmerr = nucmerr,
                    assays = assays,
                    totalsample = totalsample,
                    figurelist = TRUE,
                    outdir = NULL)
Mutation detection counts (last five nucleotides for each primer) using different assays

Mutation detection counts (last five nucleotides for each primer) using different assays

e.g., the last five nucleotides mutation pattern of primers of ChinaCDC-N:

Detection of co-occurring mutations using double-assay

Moreover, researchers may have a question about the performance of double assays in detecting samples with co-occurring mutations (significant mutational pattern), as some mutations are definitely co-occurring more frequently. Further, specificity of a test is enhanced by targeting multiple loci(Kilic, Weissleder, and Lee 2020). Here we designed a function for simultaneously measuring the mutated samples using two different assays. The information of first assay and second assay should be structured like the example assays data, containing F1, F2, R1, R2, P1, P2 sites.

Users can use their personalized assay data to implement the code below.

assay1 <- assays[1,]
assay2 <- assays[2,]
doubleAssay(nucmerr = nucmerr,
            assay1 = assay1,
            assay2 = assay2,
            outdir = NULL)
Mutation detection counts (last five nucleotides for each primer) using double assays. The samples shown on the y-axis for both assays are co-occurring mutated samples. Because the downsampled example data is too small, here we also display the actual results using real data shown below

Mutation detection counts (last five nucleotides for each primer) using double assays. The samples shown on the y-axis for both assays are co-occurring mutated samples. Because the downsampled example data is too small, here we also display the actual results using real data shown below

The co_Mutation_Ratio is calculated by dividing the number of samples with co-occurring mutations by the total mutated samples detected using each assay.

The co-occurring mutated samples are shown as the overlap in the “venn” diagram. The doubleAssay function returns the information of these samples as .csv file by giving the outdir parameter.

We believe that the lower the co_Mutation_Ratio , the more efficient the detection.

Packages used

This workflow depends on R version 3.6.2 (2019-12-12) or higher. The complete list of the packages used for this workflow are shown below:

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] VennDiagram_1.7.3    futile.logger_1.4.3  dplyr_1.1.4         
#>  [4] ggpubr_0.6.0         stringr_1.5.1        seqinr_4.2-36       
#>  [7] cowplot_1.1.3        ggplot2_3.5.1        CovidMutations_0.1.3
#> [10] BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9           utf8_1.2.4           generics_0.1.3      
#>  [4] tidyr_1.3.1          rstatix_0.7.2        futile.options_1.0.1
#>  [7] stringi_1.8.4        digest_0.6.37        magrittr_2.0.3      
#> [10] evaluate_1.0.1       fastmap_1.2.0        jsonlite_1.8.9      
#> [13] backports_1.5.0      Formula_1.2-5        formatR_1.14        
#> [16] BiocManager_1.30.25  purrr_1.0.2          fansi_1.0.6         
#> [19] scales_1.3.0         ade4_1.7-22          jquerylib_0.1.4     
#> [22] abind_1.4-8          cli_3.6.3            rlang_1.1.4         
#> [25] munsell_0.5.1        withr_3.0.2          cachem_1.1.0        
#> [28] yaml_2.3.10          tools_4.4.2          ggsignif_0.6.4      
#> [31] colorspace_2.1-1     lambda.r_1.2.4       broom_1.0.7         
#> [34] buildtools_1.0.0     vctrs_0.6.5          R6_2.5.1            
#> [37] lifecycle_1.0.4      car_3.1-3            MASS_7.3-61         
#> [40] pkgconfig_2.0.3      pillar_1.9.0         bslib_0.8.0         
#> [43] gtable_0.3.6         Rcpp_1.0.13-1        glue_1.8.0          
#> [46] xfun_0.49            tibble_3.2.1         tidyselect_1.2.1    
#> [49] sys_3.4.3            knitr_1.49           farver_2.1.2        
#> [52] htmltools_0.5.8.1    labeling_0.4.3       carData_3.0-5       
#> [55] rmarkdown_2.29       maketools_1.3.1      compiler_4.4.2

References

Corman, V. M., O. Landt, M. Kaiser, R. Molenkamp, A. Meijer, D. K. Chu, T. Bleicker, et al. 2020. “Detection of 2019 Novel Coronavirus (2019-nCoV) by Real-Time RT-PCR.” Euro Surveillance : Bulletin Europeen Sur Les Maladies Transmissibles = European Communicable Disease Bulletin 25 (3). https://doi.org/10.2807/1560-7917.ES.2020.25.3.2000045.
Delcher, Arthur L., Adam Phillippy, Jane Carlton, and Steven L. Salzberg. 2002. “Fast Algorithms for Large-Scale Genome Alignment and Comparison.” Nucleic Acids Research 30 (11): 2478–83. https://doi.org/10.1093/nar/30.11.2478.
Dong, Ensheng, Hongru Du, and Lauren Gardner. 2020. “An Interactive Web-Based Dashboard to Track COVID-19 in Real Time.” The Lancet Infectious Diseases 20 (5): 533–34. https://doi.org/10.1016/S1473-3099(20)30120-1.
Elbe, Stefan, and Gemma Buckland-Merrett. 2017. “Data, Disease and Diplomacy: GISAID’s Innovative Contribution to Global Health.” Global Challenges (Hoboken, NJ) 1 (1): 33–46. https://doi.org/10.1002/gch2.1018.
Kilic, Tugba, Ralph Weissleder, and Hakho Lee. 2020. “Molecular and Immunological Diagnostic Tests of COVID-19: Current Status and Challenges.” iScience 23 (8): 101406. https://doi.org/10.1016/j.isci.2020.101406.
Kim, Dongwan, Joo-Yeon Lee, Jeong-Sun Yang, Jun Won Kim, V. Narry Kim, and Hyeshik Chang. 2020. “The Architecture of SARS-CoV-2 Transcriptome.” Cell 181 (4): 914–921.e10. https://doi.org/10.1016/j.cell.2020.04.011.
Mercatelli, Daniele, and Federico Manuel Giorgi. 2020. Geographic and Genomic Distribution of SARS-CoV-2 Mutations. https://doi.org/10.20944/preprints202004.0529.v1.
Sanchez-Padilla, Elisabeth, Matthias Merker, Patrick Beckert, Frauke Jochims, Themba Dlamini, Patricia Kahn, Maryline Bonnet, and Stefan Niemann. 2015. “Detection of Drug-Resistant Tuberculosis by Xpert MTB/RIF in Swaziland.” The New England Journal of Medicine 372 (12): 1181–82. https://doi.org/10.1056/NEJMc1413930.
Shen, Wei, Shuai Le, Yan Li, and Fuquan Hu. 2016. “SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/q File Manipulation.” PloS One 11 (10): e0163962. https://doi.org/10.1371/journal.pone.0163962.
Taiaroa, George, Daniel Rawlinson, Leo Featherstone, Miranda Pitt, Leon Caly, Julian Druce, Damian Purcell, et al. 2020. Direct RNA Sequencing and Early Evolution of SARS-CoV-2. Vol. 34. https://doi.org/10.1101/2020.03.05.976167.
Web Page. n.d.b. https://www.who.int/emergencies/diseases/novel-coronavirus-2019.
———. n.d.a. https://www.finddx.org/covid-19/sarscov2-eval-molecular/.
Wu, Fan, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, et al. 2020. “A New Coronavirus Associated with Human Respiratory Disease in China.” Nature 579 (7798): 265–69. https://doi.org/10.1038/s41586-020-2008-3.