Chapter 2.1. Improving in vitro diagnostic design with in silico support: Part 1 Sequence-based informatics with simple tools


Eugene Boon Beng Ong, Gee Jun Tye and Yee Siew Choong

Art work
Diagnostic technologies should be designed to function 
in the presence of humidity, extreme temperatures, and/or dust
…One wartime winter when I lay sick
Six Winters
Tomas Tranströmer. 


The traditional approach to the discovery and validation of antigens or biomarkers is lengthy, cumbersome, and laborious. Although this is understandable owing to the technical limitations of the past, the recent explosion of data from the whole genome sequencing of DNA (deoxyribonucleic acid) from humans and other organisms has made available vast amounts of information at the genetic and proteomic levels. As a result, a new approach and a new set of analytical tools are needed to make sense of big data. The field of bioinformatics traces its roots to the late 1970s alongside the beginning of modern DNA sequencing. This was followed by the automation of DNA sequencing in the 1980s, which also meant an increase in throughput and the necessity to manage the deluge of data. The linear coding nature of DNA has allowed the digitization of hereditary information, and although the code has not been completely demystified, we are now capable of reading and understanding whole genome sequences using bioinformatics tools. The bioinformatics tools available today tackle data using sequence- and structure-based approaches. The cost of establishing simple bioinformatics analysis is much lower than that of conducting wet experiments, where proper laboratory setup and safety regulations are required. Modern desktop computers and laptops have enough computing power to conduct simple bioinformatics analyses. This chapter does not deal with the more advanced bioinformatics of assembling or analyzing next-generation sequencing data. Instead, we discuss simple approaches using publicly available tools applied to readily available data for research into diagnostics and related fields, especially for researchers in resource-poor settings.


GenBank is a public sequence database that is provided by the United States National Center for Biotechnology Information (NCBI) (URLs are listed at the end of the chapter). It is an annotated collection of all publicly available nucleotide sequences and their protein translations [1], and comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at the NCBI, with daily data interchange. The NCBI offers comprehensive educational resources including training and tutorials for users of its website. In areas where there is limited internet connectivity, full genome sequences for organisms of interest can be downloaded in advance and copied locally for dissemination. This is useful not only for research in resource-poor settings but also for training and educational purposes.

The European Molecular Biology Laboratory—European Bioinformatics Institute (EMBL-EBI) also hosts freely available and up-to-date molecular databases from life science experiments, and offers resources for online training. Some advanced bioinformatics software is also free for academic use. These software can be divided into those used for nucleotide analysis and those used for protein analysis.

The genome is assembled and annotated to predict coding genes, pseudogenes, promoters and regulatory regions, untranslated regions, repeats, and other features from DNA sequencing data [2]. The DNA sequence itself can reveal much information about a pathogen, and can be used in the development of DNA-based diagnostics. For protein antigen studies, the Universal Protein Resource (UniProt) repository catalogs comprehensive protein information such as protein and gene names, function, physicochemical properties, enzyme-specific information such as catalytic activity, cofactors and catalytic residues, subcellular location, protein–protein interactions, patterns of expression, locations and roles of significant domains and sites, ontology, ion-, substrate-, and cofactor-binding sites, protein variant forms produced by natural genetic variation, RNA editing, alternative splicing, and proteolytic processing [3]. The UniProt database is partially curated by experts based on publications, and also contains computer predictions for post-translational modifications, transmembrane domains and topology, signal peptides, domain identification, and protein family classification. UniProt has an in-built suite of analytical tools for sequence similarity searches and sequence alignments, and also allows customizable search results to be downloaded for further local analysis [4].

Apart from the general databases described above, other organism-specific databases are available, and are maintained by their respective research communities. For example, several databases are available for Plasmodium spp., the parasite associated with malaria. Some of these organism-specific sites such as PlasmoDB also serve as one-stop centers for information dissemination, meetings, and the latest updates and findings in the respective communities [5].

There are also databases for specific protein families such as those for virulence factors (VFs) from across bacterial species. These databases facilitate research by collectively presenting the VFs of various medically significant bacterial pathogens. For example, the aim of the virulence factor database (VFDB) is to provide a source for scientists to rapidly access current knowledge about VFs from various bacterial pathogens[6, 7]. The VFDB has a user-friendly interface to facilitate searches by genus or by typing keywords, and contains a basic local alignment search tool (BLAST) against all known VF-related genes. Another similar database MvirDB (http://mvirdb.llnl.gov) organizes sequences representing known toxins, VFs, and antibiotic resistance genes [8].

Simple Bioinformatics Tools

Another all-in-one online bioinformatics resource for proteomics is ExPASy, an extensive and integrative portal that accesses many scientific resources, databases, and software tools catering for different areas of life science. It aims to provide seamless access to resources for proteomics, genomics, phylogeny, systems biology, evolution, population genetics, transcriptomics, etc. On the basis of the primary structure or amino acid sequence of proteins, basic information can be determined and predictions made about their properties such as molecular weight, isoelectric point, titration curves, hydrophobicity, stability, and the presence of signal peptides. One such server for sequence-based analysis and prediction of structural features is SCRATCH. The SCRATCH software suite includes predictors for secondary structure, relative solvent accessibility, disordered regions, domains, disulfide bridges, single mutation stability, residue contacts versus the average individual residue contacts, and tertiary structure [9].

The identification of antigen proteins that can elicit significant humoral immune system responses is important for immunological studies, and also for diagnostic and vaccine design. In this sense, the availability of annotated genomes allow in silico screening of a pathogen’s entire proteome for biomarker discovery. The computational prediction of antigenic determinants and B-cell epitopes has been an active area of research since the early 1980s [10, 11]. As an example, Magnan et al. developed a high-throughput prediction method for protein antigenicity using the data from the protein microarrays of several pathogens. Using sequence-based machine learning, the technique was able to predict the degree of humoral immune response to novel proteins [12]. The resulting prediction software called ANTIGENpro, a sequence-based, alignment-free, and pathogen-independent predictor of protein antigenicity, has been integrated into the SCRATCH suite of predictors.

In another protein microarray study of Salmonella enterica serovar Typhi, the researchers were able to analyze serodominant and serodiagnostic antigens using in silico predictions for transmembrane domains, signal peptides, subcellular localizations, and isoelectric points[13]. Their analysis revealed that proteins with one transmembrane domain or proteins predicted to have a signal peptide were significantly enriched in the serodominant and serodiagnostic antigen groups of S. Typhi proteins. Although their study cautioned that no single proteomic feature or category of features was sufficient to identify all the signature antigens, it demonstrated the power of bioinformatics analysis in expanding the depth of experimental data.

For advanced bioinformatics analysis there is PATRIC, a Bacterial Bioinformatics Resource Center with an information system that is designed to support the biomedical research community’s work on bacterial infectious diseases via integration of vital pathogen information with rich data and analysis tools [14]. PATRIC collates and focuses available bacterial phylogenomic data from numerous sources specifically for the bacterial research community. This effort saves biologists time and manpower when conducting comparative analyses. The freely available PATRIC platform provides an interface for biologists to discover data and information, and conduct comprehensive comparative genomics and other analyses at a one-stop shop.

Using the databases and tools mentioned above, simple analyses can be carried out on publicly available data. However, although much information can be gleaned from these sequence-based approaches, their practicality can be limited given the non-linear conformational nature of the epitopes in active folded proteins in vivo, and the immune system’s ability to adapt its response to antigens [15, 16]. For these reasons, structural bioinformatics approaches can be applied, and they are discussed in the following chapter.

Case study: Antigenic prediction of putative exported proteins

  • Database: UniProt
  • Bioinformatics tools: ANTIGENpro and SignalP
  • Organism: Salmonella enterica serovar Typhi
  • Other software: MS Excel

In this case study using bioinformatics tools to predict antigenic exported proteins of S. Typhi, we began by using the UniProt website to perform a search for all the S. Typhi proteins. We used the “Advanced Search” option and selected the “Organism” field to search for the term “Salmonella Typhi”. The results showed all reviewed and unreviewed entries (5,411 results). To limit our search, we selected the “complete proteome set (4,876)” option. The results can be customized to show or hide information such as entry (UniProtID), protein names, gene names, amino acid length, protein families, protein existence (inferred from homology or experimentally validated), and the subcellular location of the protein. The complete proteome set can be downloaded in various formats such as MS Excel, FASTA, GFF, or flat text. An example of the customizable information that can be retrieved from a UniProt search is shown in Table 1 (columns)

Table 1. Selected results from a search of S. Typhi proteins in UniProt (accessed April 2014).

By downloading the data in MS Excel file format, the S. Typhi proteome can be analyzed and sorted in an MS Excel spreadsheet. Using the inbuilt MS Excel features, the data can be analyzed and sorted according to one’s needs. For example, the proteome can be sorted according to amino acid length from the shortest to the longest. To search for putative exported proteins in the S. Typhi proteome, the “Conditional Formatting” tool can be used to select and highlight entries containing the keywords “exported” or “secreted”. This can be followed by scrutiny of the entries to select only relevant proteins that will be further investigated. Table 2 shows selected putative exported or secreted proteins with sizes in the range 52–57 kDa. To determine if the proteins contain a signal peptide, an indicator of secretion, the protein sequence (obtained from UniProt) can be analyzed using the SignalP server. The SignalP 4.1 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms including gram-positive and gram-negative prokaryotes, and eukaryotes based on a combination of several artificial neural networks [17]. Using the BLAST feature in UniProt, the proteins can be checked against the database for homologous proteins. Additional protein information such as molecular weight, gene ontology, and DNA sequence can be obtained from the UniProt entry.

Table 2. Predicted antigenicity of selected putative exported and secreted proteins of S. Typhi.

  1. a Presence of signal peptide on the proteins were predicted using SignalP 4.1 Server.
  2. b Antigenecity of the proteins was predicted using ANTIGENpro.
  3. c Percent identity of the proteins to E. coli were determined using the BLAST feature in UniProt.

In this example, all the putative exported and secreted proteins were first identified from the proteome, and those with amino acid lengths in the range 450–550 were selected. The amino acid sequences of proteins matching those criteria were submitted to SignalP and some were predicted to have no signal peptide. The sequences were also submitted to ANTIGENpro to predict their antigenicity. Finally, using the BLAST feature of UniProt, the proteins were checked against the database to determine whether they were highly conserved in closely related organisms. In the case of S. Typhi, the proteins were checked against the Escherichia coli proteome. E. coli and S. Typhi are closely related microbes. In this case study, the putative secreted peptidase Q8Z7Q9 was a good serodiagnostic marker candidate for S. Typhi because it has a high predicted antigenicity of 0.95, and no orthologous proteins exist in E. coli. Thus, this protein may be a potential marker for typhoid fever and these protein candidates can be verified in a wet lab. A flowchart of the simple bioinformatics analysis is shown in Figure 1.

Figure 1. General workflow of simple bioinformatics analysis.

Online resources and URLs


  1. Mizrachi, I., GenBank: The Nucleotide Sequence Database, in The NCBI Handbook, McEntyre J and Ostell J, Editors. 2002, National Center for Biotechnology Information (US: Bethesda.
  2. Reeves, G.A., D. Talavera, and J.M. Thornton, Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface, 2009. 6(31): p. 129-47.
  3. Du Toit, A., Bacterial pathogenesis: Modulating host metabolism. Nat Rev Microbiol, 2014. 12(3): p. 154.
  4. Chen, C., et al., A fast Peptide Match service for UniProt Knowledgebase. Bioinformatics, 2013. 29(21): p. 2808-9.
  5. Aurrecoechea, C., et al., PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res, 2009. 37(Database issue): p. D539-43.
  6. Chen, L., et al., VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res, 2005. 33(Database issue): p. D325-8.
  7. Chen, L., et al., VFDB 2012 update: toward the genetic diversity and molecular evolution of bacterial virulence factors. Nucleic Acids Res, 2012. 40(Database issue): p. D641-5.
  8. Zhou, C.E., et al., MvirDB--a microbial database of protein toxins, virulence factors and antibiotic resistance genes for bio-defence applications. Nucleic Acids Res, 2007. 35(Database issue): p. D391-4.
  9. Cheng, J., et al., SCRATCH: a protein structure and structural feature prediction server. Nucleic Acids Res, 2005. 33(Web Server issue): p. W72-6.
  10. Hofmann, H.J. and D. Hadge, On the theoretical prediction of protein antigenic determinants from amino acid sequences. Biomed Biochim Acta, 1987. 46(11): p. 855-66.
  11. Hopp, T.P. and K.R. Woods, Prediction of protein antigenic determinants from amino acid sequences. Proc Natl Acad Sci U S A, 1981. 78(6): p. 3824-8.
  12. Magnan, C.N., et al., High-throughput prediction of protein antigenicity using protein microarray data. Bioinformatics, 2010. 26(23): p. 2936-43.
  13. Liang, L., et al., Immune profiling with a Salmonella Typhi antigen microarray identifies new diagnostic biomarkers of human typhoid. Sci Rep, 2013. 3: p.1043.
  14. Wattam, A.R., et al., PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res, 2014. 42(Database issue): p. D581-91.
  15. Ponomarenko, J.V. and P.E. Bourne, Antibody-protein interactions: benchmark datasets and prediction tools evaluation. BMC Struct Biol, 2007. 7: p. 64.
  16. Blythe, M.J. and D.R. Flower, Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci, 2005. 14(1): p. 246-8.
  17. Petersen, T.N., et al., SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods, 2011. 8(10): p. 785-6.