About AllergenOnline

The Food Allergy Research and Resource Program (FARRP) AllergenOnline.org database has been updated to version 23 on January 30, 2025. Version 23 contains a comprehensive list (2334 protein (amino acid) sequence entries that are categorized into 969 taxonomic-protein groups of unique proven or putative allergens (food, airway, venom/salivary and contact)from 448 species. Note: Gal g 6 NCBI entry is a multimeric egg yolk protein of 1912 amino acids, but only the c-terminal 284 AA are the allergen based on PMID 20509661. Some of the allergenic wheat gliadins or glutenins may also cause celiac disease (see Celiac Database ), however they are listed on the allergen site if there is evidence of IgE binding.

The annual update process includes collecting new sequences designated as "allerg*" in reference files from NCBI protein database (compiled from GenBank, RefSeq and TPA databases as well as protein sequences from SwissProt, PIR, PRF and PDB databases). In a few instances, sequences are taken directly from a peer reviewed publication as they have not been entered into the NCBI or other available databases. However, in 2025 we focused on new entries in the WHO/IUIS Allergen Nomenclature database. Duplicate and inappropriate sequences are removed by a process described below. The sequences are categorized by taxonomic group (genus/species) and protein sequence identity (close homology). The new draft dataset is compared with sequences contained in the previous version of AllergenOnline.org and integrated into existing groups if appropriate, or classified into new groups. Peer-reviewed publications are identified from PubMed and other resources, then collected and reviewed for evidence of allergenicity of the source organism and the specific protein. Additional information is gathered from the Allergen Nomenclature Committee website of WHO/IUIS (World Health Organization/International Union of Immunological Societies) and occasionally Wikipedia. This information was reviewed for each group of sequences as described below to classify the entries as likely "allergenic" based on specific amino acid sequence and IgE binding using sera from multiple appropriate subjects PLUS biological activity of basophil activation or skin prick tests, or "putative allergen" based on IgE binding only. Otherwise the proteins are classified as having "insufficient proof" to be classified as an allergen or putative allergen. A consensus decision by the whole peer review panel is normally reached for each group regarding the designation as an "allergen", "putative allergen" or having "insufficient proof".

Recent Open Access Publications

Defeating Late Blight disease of potato Open Access 480 483 March 2021

2021_ African and Asian agriculture

Genetic Engineering--future food security

Whole proteomes from Genomes vs AllergenOnline 2021

Removal of "False" Entries

A protein sequence search strategy used for the current update is described in the publication describing construction and curation of the AllergenOnline database (Goodman et al., 2016). Nearly every year the NCBI Protein database is searched using a keyword limit of "allerg*", with date limits from the last download (for each version XX) to the current date. However, for version 23 we decided to simply focus on new entries in the WHO/IUIS allergen nomenclature over the past two years, checking for recent publicaions. Peer reviewed publications are found in PubMed and Web of Science. The sequences are grouped into taxonomic protein groups using FASTA comparisons. In addition, new entries in the WHO/IUIS database that have published references were included in our review.

The amino acid sequemces and associated publications are reviewed by Dr. Goodman and if there were publications showing data on serum IgE binding or allergy, the committee of reviewers are involved in either entering the protein as a putative allergen or a proven allergen based on scientific data. Otherwise potential new entries are set aside as not having sufficient proof of IgE binding or biological activity of basophil activity or skin prick tests. Information is recorded from the initial entry and two reviewers from the panel then review Goodman's comments and the identified papers and they record their entries. At the end after all are complete,the full panel re-reviews all new entries and records their final votes. The full data set is added to existing allergens and putative allergens to make the new version.

Compilation of a list of sequences for review by the entire expert panel include screening to remove sequences that are included only based on being "similar to" an allergen, or homologous. Many peptides are from a taxonomic organism that is associated with allergy (e.g. Aspergillus sp., Alternaria sp.). In order to reduce the list to manageable size without excluding likely true allergens we exclude sequences from genome model organisms (e.g. Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans, etc.). However, proteins from allergenic species that are also genetic models (rice, mice and corn) that have information suggesting allergenicity are included in the initial list, but without inclusion of published references related to allergenicity are excluded. Proteins that are obviously merely associated with an allergic response (e.g. cytokines, chemokines, immunoglobulins and transcription factors) are also excluded. Sequences are then screened and grouped based on sequence identity and taxonomic identity to those already in the AllergenOnline.org database (e.g. allergens, putative allergens or sequences with insufficient evidence to demonstrate allergy, see below) from previous versions. Relevant publications are collected for the review panel, using references from the NCBI sequence entry as well as separate searches of the PubMed database, based on keyword searches of the taxa, common name and sequence authors. The information for each allergen group is triaged to gather more specific information and reviewed by the expert panel in a three stage process as described below.

Peer Review Process for Categorizing Sequence Groups: Proof of Allergenicity

Goal: To update and curate the list of sequences included in the AllergenOnline.org (FARRP) database on an annual basis to include only protein sequences that are supported by evidence demonstrating that the protein is a proven allergen or that there is substantial proof of allergy to the source of the protein as well as immunoglobulin E (IgE) binding to the specific protein using sera from individuals with allergies to the source. Nearly identical sequences in the same taxonomic / protein group are included if it is clear there are variants of the protein that might contribute to allergy.

Rationale: The AllergenOnline database is intended for use as a tool for evaluating the safety of proteins included in foods through processing or genetic modification. The Codex Alimentarius Guidelines (2003 and 2009) established a process for evaluating potential allergenicity based on evidence that the protein is likely to cause allergic reactions in consumers. A key component in the evaluation process is comparison of candidate products (proteins) with those of known allergens using a bioinformatics approach such as FASTA or BLASTP local alignment tools to identify proteins that would require further testing by serum IgE binding and/or clinical testing to evaluate safety. It is therefore important to have scientific evidence that the database entries are allergens or probable (putative) allergens in order to maximize the reliability of bioinformatics searches.

Peer Review Process: In 2005, FARRP brought together a panel of seven food allergy experts to define criteria for inclusion in future versions of the database. A protocol was developed for including sequences for consideration, for classifying sequences into groups (allergen, putative allergen, insufficient evidence to classify as an allergen or putative allergen), collecting publications for review, providing information to reviewers and finally for voting to accept sequences as allergens or putative allergens. In general we have included sequences in the taxonomic allergen group that are at least 67% identical to the protein that is the subject of peer-reviewed published study supporting IgE binding to the protein, using sera from clinically defined subjects allergic to the source. The identity of each includes the abbreviated WHO/IUIS designation suggested by the Allergen Nomenclature Subcommittee. The amino acid sequence and taxonomic identities are critical for defining defining isoallergen groups. Information regarding the scientific justifiction of individual proteins is intended to be described in peer reviewed publications.

Criteria for three classes of allergenicity were agreed to: Allergen is a protein that has been demonstrated to specifically bind IgE using sera from individuals with clear allergies to the source of the gene/protein and further that the protein causes basophil activation or histamine release, skin test reactivity or challenge test reactivity using subjects allergic to the source. Putative allergen is a protein that has met most of the criteria of an allergen, but is missing the biological activity component, whether basophil activation or in vivo reactivity. Sometimes less well defined clinical subject populations are used in the peer reviewed paper or there is a lack of data demonstrating the specific protein was used in reported testing and the protein is classified as a putative allergen. Both Allergens and Putative Allergens are retained in the list of sequence searchable protein entries in AllergenOnline.org. The third category, those with Insufficient Evidence of Allergenicity (Unproven), are not included in the sequence searchable database because they were judged to be lacking critical evidence of specific IgE binding, the serum donors were not demonstrated to be allergic to the source and there was no allergic biological activity demonstrated for the protein. The proteins categorized as "Insufficient evidence" are maintained in a list for future annual reviews and when new publications are found they can be included if new evidence supports reclassification in the opinion of the reviewers. In rare instances after 2007 individual sequence entries in the database that were previously included in the searchable allergen list have been removed after more detailed reviews have failed to identify published evidence the protein is expressed in allergenic material or that the original review miss-interpreted the data in the available publications.

The amount and quality of published objective data supporting the classification of various proteins as allergens varies remarkably. For many food, airway or contact allergens there is unquestionable objective data of the identity, characterization and purity of the protein and clear evidence that human subjects with relevant allergic histories and symptoms were tested to demonstrate reactions upon challenge, or at least clear evidence of specific IgE binding. However, there are also a number of proteins labeled as allergens in the literature or in the NCBI sequence database (or in UniProt) for which there is not sufficient objective data characterizing the protein used in testing, or data to demonstrate human reactivity or specific IgE binding. Our peer review process is designed to review the collective literature for individual proteins and classify the individual allergen groups based on our stated criteria.

The review process includes triage and initial evaluation summary Dr. Goodman at FARRP. Often additional references are identified and added for further review. Then each sequence group is assigned to two other reviewers from the expert panel. The detailed review comments from all three reviewers are compiled and presented to the entire group of seven experts for a final round of reviews. Comments and votes are recorded in the database files as an archive file. Later changes in status and reasons for changes are also included in the archive. A list of relevant references that were included in the review process are included in the public view of each version of the database. At the end of the review period a search is made of the WHO/IUIS Allergen Nomenclature website (www.allergen.org) and new entries that are not in AllergenOnline are reviewed for published evidence and these are added to the review list. The WHO/IUIS entries are often identified prior to publication. Therefore their new entries are reviewed again periodically.

Before release of the database the sequences, GI numbers and accession numbers are checked. The NCBI database stopped showing GI numbers publicly, but they can be found in searches. The taxonomy of the source and reference lists are compiled and checked before release of the new version though sometimes taxonomic classifications change. The public website shows relevant information for each sequence.

Peer Review Panel

Baumert, Joe, PhD, FARRP, University of Nebraska, USA
Bohle, Barbara, PhD, Division of Immunopathology, Medical University of Vienna, Austria
Ebisawa, Motohiro, MD, Pediatric Allergy, National Sagamihara Hospital, Japan
Fatima Ferreira, PhD, University of Salzburg, Austria
Johnson, Phil, PhD, FARRP, University of Nebraska, USA
Goodman, Rick, PhD, FARRP, University of Nebraska, USA
Taylor, Steve, PhD, FARRP, University of Nebraska, USA
van Ree, Ronald, PhD, University of Amsterdam, The Netherlands

Former Members:

Joerg Kleine-Tebbe, MD, Allergy & Asthma Center, Westend. Berlin, Germany(2018-2019)
Sampson, Hugh, MD, Pediatric Allergy, Mount Sinai Medical Center, New York, USA (2005-2015)
Vieths, Stefan, PhD, Paul-Ehrlich-Institut, Germany (2005-2012)
Hefle, Sue, PhD, FARRP, University of Nebraska, USA (2005-2006)

Financial Support

Financial support for this database was provided by grants from corporate subscribers and by FARRP (the Food Allergy Research and Resource Program, Department of Food Science & Technology at the University of Nebraska-Lincoln) and faculty. However a number of companies ended sponsorship from 2016 to 2021. FARRP then is the primary sponsor.

The majority of the scientific information for the Allergen database and now for the Celiac database is collected and evaluated by Rick Goodman, then verified by the review team. The database is updated approximately annually. The database construction was performed by John Wise, who now consults to help maintain it.

Sponsors (2004 - 2015):

Subscribers (2016 - 2017):

Subscribers 2018 - 2021:

Allergen Database Search Routines

Full-length FASTA

This website includes a sequence comparison routine, FASTA (Pearson and Lipman, 1988) which may be used to compare a protein sequence (the query sequence) to entries in the allergen database. This version of the FASTA search interface utilizes the FASTA3 (Pearson, 2000) algorithm. The purpose of the comparison routine is to evaluate whether the query protein sequence is identical to, or homologous with known or putative allergens in the database. Alignments with high identity scores may indicate a potential for allergenic cross-reactions and a potential risk of food allergy if included in a new food. However, there is not sufficient scientific data to establish a simple scoring boundary (E-score or percent identity), beyond which cross-reactivity is certain, or below which cross-reactivity is not possible. Based on historical data, cross-reactivity is not likely for proteins with less than 50% identity over the entire protein sequence, and is fairly common above 70% identity (Aalberse, 2000). Through experience we find that sequences of two proteins having published evidence of cross-reactivity will align in AllergenOnline.org with a relatively high percent identity (>50% over nearly full-length) and have an E score (statistical expectation score) much smaller than 1e-7 (0.0000001). Thus if a query protein matches a sequence in AllergenOnline.org with higher identity and smaller E scores, the protein should be considered as a possible risk for cross-reactivity and specific testing might be required, such as extensive testing (IgE binding and possibly clinical challenges). Proteins sharing lower identity matches by FASTA alignment and having higher E scores are not likely to share IgE binding. Experimental studies would be needed to confirm that proteins sharing identities lower than 50% and having E scores larger than 1e-4 share IgE binding and clinical reactivity. Evaluation of literature regarding the matched allergen would help to identify appropriately allergic study subjects.

Sliding 80mer FASTA

In addition to the full-length FASTA search,the CODEX Alimentarius Commission (FAO/WHO 2001, CODEX 2003) suggested that proteins with >35% identity over 80 amino acidsons might cause cross-reactions. Therefore we added a sliding 80 amino acid scanning 80mer search in 2009. It automatically scans each entered protein for possible matches using segments (1-80, 2-81, 3-82, etc.) of the entered search protein against the AllergenOnline database, looking for matches of at least 35% identity as a default. The 35% identity for 80 amino acid segments was suggested in a scientific advisory to regulators for evaluating proteins in genetically modified crops (see FAO/WHO 2001, and Codex 2003).

However we discovered that the 35% identity is usually far too conservative as many false positive matches are identified in searches of whole proteomes for novel foods (Abdelmoteleb et al. 2021). Therefore we have added a search option to chose higher identity matches (45%, 55%, 65%, 75%) in the Sliding 80mer Window search using both versino 35 and 36 of FASTA. Shorter proteins down to about 30 amino acids may include two or more IgE binding epitopes what can trigger mast cell or basophil activation. Our search does compensate also for shorter alignments than 80 amino acids. These searches are much more likely than short peptide searches of 8 amino acids to uncover real potential allergy risks and more so than only full-length FASTA or BLASTP, and such matches could represent real potential risks. This is different from other databases.

This 80 amino acid short segment matching routine is very important for evaluating potential risks of cross-reactivity. It is quite conservative uas many proteins that have been shown to be truly cross-reactive are highly identical for their full-length as presented in Goodman et al. (2005) and Goodman and Hefle (2005). It has become clear that some naturally occuring proteins do share shorter segments, but in all cases they must cross-link IgE on Fc epsilon receptors of mast cells and basophils. Experimental work at Cornell University by David Holowka and Barbara Baird (Holowka D, Sil D, torigoe C and Baird B, 2007), demonstrated by artificial protein epitopes on fixed size DNA structures likely represent ~30 amino acids between IgE epitopes at the smallest.

Previous bioinformatics methods for identification of possible cross-reactive proteins had included short segment searches of 8 amino acids although the paper by Metcalfe et al. (1996) did not make it clear that companies like Monsanto were identifying possible matches by BLASTP or FASTA and looking for matches of at least 8 contiguous amino acids for possible risk assessment. When the FAO/WHO 2001 suggested a simple 6 amino acid match as important, theoretical work by Hileman et al (2002) and Silvanovich et al. (2006) demonstrated the apparent false positive rate using those criteria. See also the summary report from the bioinformatics workshop on evaluating potential allergenicity (Goodman, 2006).In the past AllergenOnline.org has employed an E()-value (E-score) threshold of 100 as a statistical cutoff limit in the 80-mer search in identifying alignments with >35% identity matches that should be evaluated further. However, we have determined that the very large E score allows alignments with multiple gaps and leads to alignments in some cases that do not make sense when compared to full-length alignments. Reexamination of publications by Pearson in 2004 and earlier publications clearly support the use of the default E = 10 as a limit for FASTA or in exceptional cases with specialized, small databases or sequences, the limit could be set lower (e.g. E = 0.01). We have therefore modified the search parameters to evaluate only alignments with E scores = 10 or less in the release of AllergenOnline.org version 15 (12 January, 2015). It is important to keep in mind that the default E()-value is simply a starting threshold used to allow alignments to be observed and then investigated using 35% identity and 80 amino acid overlap as the criteria. In cases when the alignment identified matches of >35% identity in the sliding 80mer search, additional bioinformatics comparisons maybe useful such as comparison of the full protein sequence to all proteins in the NCBI Protein database to understand the potential evolutionary conservation of the sequence to evaluate likely biological significance. In the end, specific human serum testing with well characterized clinically allergic subjects sera would be needed using well designed experiments and experimental controls. Such tests are difficult as the primary burden is finding clearly appropriate positive and negative serum donors.

8mer Identity Match

Although the CODEX mentions using short segment (6 or 8 amino acid) sequence matches, it also indicates that searches must be based on scientific proof. As we have searched for, and been unable to find examples where an isolated identity match of 6 or 8 amino acids was found between cross-reactive proteins unless there was at least a 35% identity match over 80 amino acids, we previously did not include that search routine on our database (Goodman et al. 2008). However, since some countries still require an eight amino acid identity search, even in the lack of evidence demonstrating a positive predictive value, we now provide that as an option.

References

AllergenOnline database:

For bioinformatic analysis:

For protein sequence (structure) and allergenicity:

Additional or alternative bioinformatics tools and databases may also be useful for the evaluation of potential allergens (see also LINKS):