The Food Allergy Research and Resource Program (FARRP) AllergenOnline.org database has been updated to version 18A on February 1, 2018. Version 18A contains a comprehensive list (2093 protein (amino acid) sequence entries that are categorized into 832 taxonomic-protein groups of unique proven or putative allergens (food, airway, venom/salivary and contact). Some of the allergenic wheat gliadins or glutenins may also cause celiac disease (see Celiac Database ), however they are listed on the allergen site if there is evidence of IgE binding.
- The annual update process includes collecting new sequences designated as “allerg*” in reference files from NCBI protein database (compiled from GenBank, RefSeq and TPA databases as well as protein sequences from SwissProt, PIR, PRF and PDB databases). In a few instances, sequences are taken directly from a peer reviewed publication as they have not been entered into the NCBI or other available databases. Duplicate and inappropriate sequences are removed by a process described below. The sequences are categorized by taxonomic group (genus/species) and protein sequence identity (close homology). The new draft dataset is compared with sequences contained in the previous version of AllergenOnline.org and integrated into existing groups if appropriate, or classified into new groups. Peer-reviewed publications are identified from PubMed and other resources, then collected and reviewed for evidence of allergenicity of the source organism and the specific protein. Additional information is gathered from the Allergen Nomenclature Committee website of WHO/IUIS (World Health Organization/International Union of Immunological Societies) and occasionally Wikipedia. This information was reviewed for each group of sequences as described below to classify the entries as likely “allergenic” (absolute proof including challenge testing, or putative, specific IgE binding using sera from individuals with allergies to the source organism), or “insufficient proof” of allergy due to a lack of convincing evidence of allergenicity. During the review process, an attempt is made to identify new publications demonstrating proof of allergy for groups of potential allergens that were designated as having insufficient proof of allergenicity in previous versions. A consensus decision by the whole peer review panel is normally reached for each group regarding the designation as an “allergen” or having “insufficient proof”. However in a few instances a majority decision is taken. Criteria used to reach a decision to include or exclude each sequence or allergen group is described below.
REMOVAL OF "FALSE" ENTRIES
A protein sequence search strategy used for the current update is described in the publication describing construction and curation of the AllergenOnline database (Goodman et al., 2016). Each year the NCBI Protein database is searched using a keyword limit of “allerg*”, with date limits from the last download (for version 16) to the current download (4 May, 2016). This year 1737 new sequences were considered and using characteristics to remove irrelevant entries, that was reduced to 1,060 new sequences. Those were added to existing entries and previously downloaded sequences that did not have published proof of IgE binding or allergy. The sequences are grouped into taxonomic protein groups using FASTA comparisons. PubMed and other sources of scientific literature are searched for references that might provide evidence of IgE binding using appropriately allergic human serum donors. Evidence of biological activity is also collected along with a description of the protein source, sequence and physical characteristics. The information is triaged by Dr. Goodman. If there is published evidence the entry is refered to two of the six additional reviewers. Innformation is recorded from the initial entry through reviewers comments. Compilation of a list of sequences for review by the entire expert panel includes screening to remove sequences that are included only based on being "similar to" an allergen, or homologous. Many peptides are from a taxonomic organism that is associated with allergy (e.g. Aspergillus sp., Alternaria sp.). In order to reduce the list to manageable size without excluding likely true allergens we exclude sequences from genome model organisms (e.g. Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, etc.). However, proteins from allergenic species that are also genetic models (rice, mice and corn) that have information suggesting allergenicity are included in the initial list, but without inclusion of published references related to allergenicity are excluded. Proteins that are obviously merely associated with an allergic response (e.g. cytokines, chemokines, immunoglobulins and transcription factors) are also excluded. Sequences are then screened and grouped based on sequence identity and taxonomic identity to those already in the AllergenOnline.org database (e.g. allergens, putative allergens or sequences with insufficient evidence to demonstrate allergy, see below) from previous versions. Relevant publications are collected for the review panel, using references from the NCBI sequence entry as well as separate searches of the PubMed database, based on keyword searches of the taxa, common name and sequence authors. The information for each allergen group is triaged to gather more specific information and reviewed by the expert panel in a three stage process as described below.
PEER REVIEW PROCESS FOR CATEGORIZING SEQUENCE GROUPS: PROOF OF ALLERGENICITY
Goal: To update and curate the list of sequences included in the AllergenOnline.org (FARRP) database on an annual basis to include only protein sequences that are supported by evidence demonstrating that the protein is a proven allergen or that there is substantial proof of allergy to the source of the protein as well as immunoglobulin E (IgE) binding to the specific protein using sera from individuals with allergies to the source. Nearly identical sequences in the same taxonomic / protein group are included if it is clear there are variants of the protein that might contribute to allergy.
Rationale: The AllergenOnline database is intended for use as a tool for evaluating the safety of proteins included in foods through processing or genetic modification. The Codex Alimentarius Guidelines (2003 and 2009) established a process for evaluating potential allergenicity based on evidence that the protein is likely to cause allergic reactions in consumers. A key component in the evaluation process is comparison of candidate products (proteins) with those of known allergens using a bioinformatics approach such as FASTA or BLASTP local alignment tools to identify proteins that would require further testing by serum IgE binding and/or clinical testing to evaluate safety. It is therefore important to have scientific evidence that the database entries are allergens or probable (putative) allergens in order to maximize the reliability of bioinformatics searches.
Peer Review Process: In 2005, FARRP brought together a panel of seven food allergy experts to define criteria for inclusion in future versions of the database. A protocol was developed for including sequences for consideration, for classifying sequences into groups (allergen, putative allergen, insufficient evidence to classify as an allergen or putative allergen), collecting publications for review, providing information to reviewers and finally for voting to accept sequences as allergens or putative allergens. In general we have included sequences in the taxonomic allergen group that are at least 67% identical to the protein that is the subject of peer-reviewed published study supporting IgE binding to the protein, using sera from clinically defined subjects allergic to the source. The identity limit was initially suggested by the IUIS Allergen Nomenclature Subcommittee as a limit for defining isoallergen groups. Information regarding the individual proteins should demonstrate the protein is actually expressed in the source material that causes allergic reactions.
Criteria for three classes of assignment were agreed to: Allergen is a protein that has been demonstrated to specifically bind IgE using sera from individuals with clear allergies to the source of the gene/protein and further that the protein causes basophil activation or histamine release, skin test reactivity or challenge test reactivity using subjects allergic to the source. Putative allergen is a protein that has met most of the criteria of an allergen, but has a missing component, usually biological activity (basophil activation or in vivo reactivity), less well defined clinical population or lack of data demonstrating the specific protein was used in reported testing. Both Allergens and Putative Allergens are retained in the list of sequence searchable protein entries in AllergenOnline.org. The third category, those with Insufficient Evidence of Allergenicity (Unproven), are not included in the sequence searchable protein list because they were judged to be lacking critical evidence of specific IgE binding, the serum donors were not demonstrated to be allergic to the source and there was no allergic biological activity demonstrated for the protein. The proteins categorized as "Insufficient evidence" are maintained in a list for future annual reviews as new candidate "allergens" are identified from NCBI and the published literature. If new evidence supports reclassification in the opinion of the reviewers, they would be included in future versions of the database. In rare instances after 2007 individual sequence entries in the database that were previously included in the searchable allergen list have been removed after more detailed reviews have failed to identify published evidence the protein is expressed in allergenic material or that the original review miss-interpreted the data in the available publications.
The amount and quality of published objective data supporting the classification of various proteins as allergens varies remarkably. For many food, airway or contact allergens there is unquestionable objective data of the identity, characterization and purity of the protein and clear evidence that human subjects with relevant allergic histories and symptoms were tested to demonstrate reactions upon challenge, or at least clear evidence of specific IgE binding. However, there are also a number of proteins labeled as allergens in the literature or in the NCBI sequence database (or in UniProt) for which there is not sufficient objective data characterizing the protein used in testing, or data to demonstrate human reactivity or specific IgE binding. Our peer review process is designed to review the collective literature for individual proteins and classify the individual allergen groups based on our stated criteria.
The review process includes triage and initial evaluation summary Dr. Goodman at FARRP. Often additional references are identified and added for further review. Then each sequence group is assigned to two other reviewers from the expert panel. The detailed review comments from all three reviewers are compiled and presented to the entire group of seven experts for a final round of reviews. Comments and votes are recorded in the database files as an archive file. Later changes in status and reasons for changes are also included in the archive. A list of relevant references that were included in the review process are included in the public view of each version of the database. At the end of the review period a search is made of the WHO/IUIS Allergen Nomenclature website (www.allergen.org) and new entries that are not in AllergenOnline are reviewed for published evidence and these are added to the review list. The WHO/IUIS entries are often identified prior to publication. Therefore the entries are reviewed again each year.
Before release of the database the sequences, GI numbers (now accession numbers since NCBI has stopped issuing GI numbers), taxonomy of the source and reference lists are compiled and checked before release of the new version to the public. The public website shows relevant information for each sequence.
PEER REVIEW PANEL
Baumert, Joe, PhD, FARRP, University of Nebraska, USA
Bohle, Barbara, PhD, Division of Immunopathology, Medical University of Vienna, Austria
Ebisawa, Motohiro, MD, Pediatric Allergy, National Sagamihara Hospital, Japan
Fatima Ferreira, PhD, University of Salzburg, Austria
Goodman, Rick, PhD, FARRP, University of Nebraska, USA
Joerg Kleine-Tebbe, MD, Allergy & Asthma Center, Westend. Berlin, Germany
Taylor, Steve, PhD, FARRP, University of Nebraska, USA
van Ree, Ronald, PhD, University of Amsterdam, The Netherlands
Sampson, Hugh, MD, Pediatric Allergy, Mount Sinai Medical Center, New York, USA
Vieths, Stefan, PhD, Paul-Ehrlich-Institut, Germany (2005-2012)
Hefle, Sue, PhD, FARRP, University of Nebraska, USA (2005-2006)
ALLERGEN DATABASE SEARCH ROUTINES
This website includes a sequence comparison routine, FASTA (Pearson and Lipman, 1988) which may be used to compare a protein sequence (the query sequence) to entries in the allergen database. This version of the FASTA search interface utilizes the FASTA3 (Pearson, 2000) algorithm. The purpose of the comparison routine is to evaluate whether the query protein sequence is identical to, or homologous with known or putative allergens in the database. Alignments with high identity scores may indicate a potential for allergenic cross-reactions. However, there is not sufficient scientific data to establish a simple scoring boundary (E-score or percent identity), beyond which cross-reactivity is certain, or below which cross-reactivity is not possible. Based on historical data, cross-reactivity is not likely for proteins with less than 50% identity over the entire protein sequence, and is fairly common above 70% identity (Aalberse, 2000). Through experience we find that sequences of two proteins having published evidence of cross-reactivity will align in AllergenOnline.org with a relatively high percent identity (>50% over nearly full-length) and have an E score (statistical expectation score) smaller than 1e-7 (0.0000001). Thus if a query protein matches a sequence in AllergenOnline.org with higher identity and smaller E scores, the protein should be considered as likely to be cross-reactive in the absence of extensive testing (IgE binding and possibly clinical challenges). Proteins sharing lower identity matches by FASTA alignment and having higher E scores are not likely to share IgE binding. Experimental studies would be needed to confirm that proteins sharing identities lower than 50% and having E scores larger than 1e-4 share IgE binding and clinical reactivity. Evaluation of literature regarding the matched allergen would help to identify appropriately allergic study subjects.
Sliding 80mer FASTA
In addition to the full-length FASTA search, we have added an option to automatically scan each possible 80 amino acid segment (1-80, 2-81, 3-82, etc.) of the entered search protein against the AllergenOnline database, looking for matches of at least 35% identity. The 35% identity for 80 amino acid segments was suggested in a scientific advisory to regulators for evaluating proteins in genetically modified crops (see FAO/WHO 2001, and Codex 2003). This short segment matching routine evaluating segments of 80 amino acids appears to be quite conservative, and precautionary as discussed in Goodman et al. (2005) and Goodman and Hefle (2005). However, the 80 amino acid segment search appears to be far more likely to be informative than a search for shorter identical segments of 6 or 8 contiguous amino acids as originally recommended by Metcalfe et al. (1996) or the FAO/WHO 2001 approach, based on evaluations by Hileman et al., (2002) and Silvanovich et al. (2006). See also the summary report from the bioinformatics workshop on evaluating potential allergenicity (Goodman, 2006).In the past AllergenOnline.org has employed an E()-value (E-score) threshold of 100 as a statistical cutoff limit in the 80-mer search in identifying alignments with >35% identity matches that should be evaluated further. However, we have determined that the very large E score allows alignments with multiple gaps and leads to alignments in some cases that do not make sense when compared to full-length alignments. Reexamination of publications by Pearson in 2004 and earlier publications clearly support the use of the default E = 10 as a limit for FASTA or in exceptional cases with specialized, small databases or sequences, the limit could be set lower (e.g. E = 0.01). We have therefore modified the search parameters to evaluate only alignments with E scores = 10 or less in the release of AllergenOnline.org version 15 (12 January, 2015). It is important to keep in mind that the default E()-value is simply a starting threshold used to allow alignments to be observed and then investigated using 35% identity and 80 amino acid overlap as the criteria. In cases when the alignment identified matches of >35% identity in the sliding 80mer search, additional bioinformatics comparisons maybe useful to evaluate likely biological significance, or specific serum testing may prove useful if appropriate specifically allergic serum donors can be identified to evaluate the potential cross-reactivity suggested by the match.
8mer Identity Match
Although the CODEX mentions using short segment (6 or 8 amino acid) sequence matches, it also indicates that searches must be based on scientific proof. As we have searched for, and been unable to find examples where an isolated identity match of 6 or 8 amino acids was found between cross-reactive proteins unless there was at least a 35% identity match over 80 amino acids, we previously did not include that search routine on our database (Goodman et al. 2008). However, since some countries still require an eight amino acid identity search, even in the lack of evidence demonstrating a positive predictive value, we now provide that as an option.
- Goodman RE, Ebisawa M, Ferreira F, Sampson HA, van Ree R, Vieths S, Baumert JL, Bohle B, Lalithambika S, Wise J, Taylor SL. 2016. AllergenOnline: A peer-reviewed, curated allergen database to assess novel food proteins for potential cross-reactivity. Mol. Nutr. Food Res. 60(5):1183-1198.
- Pearson WR and Lipman DJ 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2440-2448.
- Pearson WR. 2000. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132:185-219.
- Pearson WR. 2003. Finding protein and nucleotide similarities with FASTA. Current Protocols in Bioinformatics. Section 3.9.1 to 3.9.23.
- Siruguri V, Bharatraj DK, Vankudavath RN, Rao Mendu VV, Gupta V, Goodman RE. Evaluation of Bar, Barnase, and Barstar recombinant proteins expressed in genetically engineered Brassica juncea (Indian mustard) for potential risks of food allergy using bioinformatics and literature searches. Food Chem Toxicol. 2015 Jun 14;83:93-102. doi: 10.1016/j.fct.2015.06.003. [Epub ahead of print] PubMed PMID: 26079618.
- Jin Y, Goodman RE, Tetteh AO, Lu M, Tripathi L. Bioinformatics analysis to assess potential risks of allergenicity and toxicity of HRAP and PFLP proteins in genetically modified bananas resistant to Xanthomonas wilt disease. Food Chem Toxicol. 2017; 109:81-89. http://doi.org/10.1016/j.fct.2017.08.024 (Open Access, PMID: 28830835).
- Jin Y, He X, Ando-Kumi K, Fraser RZ, Lu M, Goodman RE. Evaluating potential risks of food allergy and toxicity of soy leghemoglobin expressed in Pichia pastoris. Mol Nutr Food Res. 2017 Sept. 18; doi: 10.1002/mnfr.201700297 [Epub ahead of print](Open Access, PMID: 28921896).
- Aalberse RC. 2000. Structural biology of allergens. J. Allergy Clin. Immunol. 106:228-238.
- Aalberse R C, Stapel S O 2001. Structure of food allergens in relation to allergenicity. Pediatr Allergy Immunol 12:10-4.
- Codex Alimentarius Commission. 2003. Alinorm 03/34: Appendix III. Guideline for the conduct of food safety assessment of foods derived from recombinant DNA plants. Annex IV. Annex on the assessment of possible allergenicity, Rome, Italy.
- Doolittle RF. in Methods in Enzymology Vol. 183. Molecular evolution: Computer analysis of protein and nucleic acid sequences, RF Doolittle, Ed. (Academic Press, Inc., San Diego, 1990), chap. 6.
- FAO/WHO 2001. Evaluation of allergenicity of genetically modified foods derived from biotechnology. Rome, Italy.
- Goodman RE, Hefle SL. 2005. Gaining perspective on the allergenicity assessment of genetically modified food crops. Expert Rev. Clin. Immunol. 1(4):561-578.
- Goodman RE, Hefle SL, Taylor SL, van Ree R. 2005. Assessing genetically modified crops to minimize the risk of increased food allergy. Int. Arch. Allergy Immunol. 137(2):153-166.
- Goodman RE. 2006. Practical and predictive bioinformatics methods for the identification of potentially cross-reactive protein matches. Mol Nutr Food Res 50:655-660.
- Goodman RE, Vieths S, Sampson HA, Hill D, Ebisawa M, Taylor SL, van Ree R. 2008. Allergenicity assessment of genetically modified crops - what makes sense? Nat Biotech 26(1):73-81.
- Hileman RE, Silvanovich A, Goodman RE, Rice EA, Holleschak G, Astwood JD, Hefle SL. 2002. Bioinformatic methods of allergenicity assessment using a comprehensive allergen database. Int. Arch. Allergy Immunol. 128:280-291.
- Ladics GS, Bannon GA, Silvanovich A, Cressman RF. 2007. Comparison of conventional FASTA identity searches with the 80 amino acid sliding window FASTA search for the elucidation of potential identities to known allergens. Mol Nutr Food Res 51(8):985-998.
- Metcalfe DD, Astwood JD, Townsend R, Sampson HA, Taylor SL, Fuchs RL. 1996. Assessment of the allergenic potential of foods derived from genetically engineered crop plants. Crit Rev Food Sci Nutr 36 Suppl:S165-86.
- Silvanovich A, Nemeth MA, Song P, Herman R, Tagliani, L, Bannon, GA. 2006. The value of short amino acid sequence matches for prediction of protein allergencity. Toxicol. Sci. 90(1):252-258.
- Thomas K, Bannon G, Hefle S, Herouet C, Holsapple M, Ladics G, MacIntosh S, Privalle L. 2005. In silico methods for evaluating human allergenicity of novel proteins. Toxicol Sci 88(2):307-310.
- Brusic V, Petrovsky N, Gendel SM, Millot M, Gigonzac O, Stelman SJ. 2003. Computational tools for the study of allergens. Allergy 58:1083-1092.
- Brusic V, Petrovsky N, Gendel SM, Millot M, Gigonzac O, Stelman SJ. 2003. Allergen databases. Allergy 58:1093-1100.
- Kleter GA Peijnenburg AACM. 2002. Screening of transgenic proteins expressed in transgenic food crops for the presence of short amino acid sequences identical to potential, IgE=binding linear epitopes of allergens. BMC Structural Biology 2:8.
- Ivanciuc O, Schein CH, Braun W. 2003. SDAP: database and computational tools for allergenic proteins. Nuc. Acids Res. 31:359-362.
- Malandain H. 2004. Basic immunology, allergen prediction and bioinformatics Allergy 59:1011-1012.
- Martinez Barrio A, Soeria-Atmadja D, Nister A, Gustafsson MG, Hammerling U, Bongcam-Rudloff E EVALLER: a web server for in silico assessment of potential protein allergenicity. Nuc. Acids Res. 35(Web Server Issue): W694-W700.
- Saha S, Raghava GPS. 2006. Algpred: prediction of allergenic proteins and mapping of IgE epitopes. Nuc. Acids Res. 34(Web Server Issue): W202-W209.
- Stadler MB, Stadler BM. 2003. Allergenicity prediction by protein sequence. FASEB J. 17:1141-1143.
- Zhang L, Huang Y, Zou Z, He Y, Chen X, Tao A. 2012. SORTALLER: predicting allergens using substantially optimized algorithm on allergen family featured peptides. Bioinformatics. 28(16):2178-2179