Reference Manual G - FlyBase Documentation
|
Last Updated: 21 May 2007
G.1. Nontraditional allelesIn addition to 'alleles' in the traditional sense, FlyBase now names and curates further classes of allele so that phenotypic or expression pattern data can be captured for in vitro construct alleles and alleles of reporter (e.g., Ecol\lacZ), effector (e.g., Scer\FLP) or toxin (e.g., Rcom\DT-A) genes. Since these alleles have not historically been named by researchers, and have been named by FlyBase, their presentation in FlyBase requires some explanation: G.1.1. Alleles of reporter genesAlleles of reporter genes currently fall into two main classes, those resulting from enhancer trap experiments, and those resulting from promoter (or other regulatory region) analysis, where a fragment is used to drive the expression of a reporter gene. Ecol\lacZ will be used for illustration. Enhancer trap results:
Promoter analysis results:
G.1.2. Alleles of ectopically expressed Drosophila gene productsProducts of genes may be ectopically expressed due either to juxtaposition with different regulatory sequences in the genome (as a result of being inserted into different-than-wild-type locations by chromosome rearrangement or P element transposition) or due to in vitro construction creating a different constellation of regulatory sequences than in wild type. By analogy with alleles of Ecol\lacZ for enhancer traps, P-element-borne insertions of genes e.g., w or ve that have a qualitatively distinct _position-dependent_ mutant phenotype will be curated as new alleles of e.g., w or ve, e.g., veStg caused by a particular insertion of P{HS-rho}, P{HS-rho}Stg. The 'in vitro construct' ectopic expression alleles currently fall into two main classes, one component or two component systems: One component systems: An occasional exception is made for promoter fusions that are widely used to provide essentially wild-type gene function; these alleles have the mini-gene '+m construct' designation (see below) prepended to an, e.g., heat shock designation, e.g., w+mW.hs. It is common that authors report a construct where e.g., ftz is expressed under a 'heat shock' or Hsp70 promoter, while providing no further details about the nature of the promoter. For these cases the allele symbol hs.PI is employed, e.g., Antphs.PZ for 'Antp heat shock construct of Zeng'. An 'hs' designation should be reserved for when the heat inducible, not just the minimal, promoter fragment is used. Where the allele is both altered in its coding region and being expressed from an ectopic promoter the sequence 'alteration.promoter' is used in the allele designation, e.g., tor13D.hs.sev to denote the coding sequence of tor13D expressed from a heat shock (undefined) promoter with a sev enhancer. An exception to this rule is made for Tags, which appear as the last component of the allele symbol (see below). Two component systems:
G.1.3. Alleles of ectopically expressed non-Drosophila effector productsA note on ribozymes: FlyBase has a foreign ribozyme gene, symbol LTSV\RBZ. Alleles of LTSV\RBZ capture the different variants, e.g., for a heat inducible ftz-targeted ribozyme: LTSV\RBZhs.ftz (syntax 'promoter.target gene') will be named. '+m' minigenes The minigene allele designation is used in its narrow sense, i.e., where the only difference between the allele and the wild type is the removal of more or less non-essential sequences. Thus the minigene allele symbol designation reserved for those cases where the gene's own promoter is driving its expression. The minigene allele symbols begin with 'm', for minigene, and are followed by the construct symbol used in the publication. If no construct symbol has been used, the string 'mIa' where 'm' stands for minigene, 'I' for the first author's last name initial and 'a' for the first in the series is used. If the function of the minigene is stated to be indistinguishable from that of the wild type allele, the 'm' is preceded by a '+'. Tags Genes can be modified by the addition of a tag allowing the product to be identified, purified, or targeted to a particular subcellular distribution. Tagged alleles have the syntax 'gene-symbol x.T:y' , where x is an identifier and y is the name of the tag, e.g., Hsap\MYC, T:Ivir\HA1, SV40\nls2, e.g., dap1gm.T:Hsap\Myc. Where a tag is artificial, the species prefix Zzzz is used, e.g. T:Zzzz\His6. G.1.4. Classical alleles engineered into transgene constructs, including rescue constructsA class of alleles are named to capture fragments of genomic DNA used in rescue constructs. The symbol for the rescuing allele symbol begins with '+t'. This is followed by length as stated by authors, construct symbol if length is not given or '+tIa', where 't' stands for transgene, 'I' for the first author's last name initial and 'a' for the first in the series (if neither length nor construct symbol is stated). When rescue is incomplete, the construct is considered as carrying a mutant allele. Allele designator is construct symbol, 'length of genomic insert.tIa' if no symbol is given or 'tIa' where neither length nor construct symbol is stated. When a classical allele, e.g., wa, is put into a transgene construct it will get a new designation, e.g., wa.tIa, to reflect its transgenic environment, where 't' stands for transgene, 'I' for the first author's last name initial and 'a' for the first in the series FlyBase is, of course, happy to discuss and advise on use of nomenclature of these non-traditional alleles. G.2. Controlled vocabularies used by FlyBaseFor many reasons several of the fields in FlyBase use structured controlled vocabularies (aka ontologies). This makes it much easier (and more robust) to make links within the database, as well as making it much easier to search the database for information. Moreover, several of these controlled vocabularies are shared with other databases, and this provides a degree of integration between them. The controlled vocabularies are only implemented in certain fields in FlyBase. The controlled vocabularies currently used by FlyBase are:
All of these structured controlled vocabularies are in the same format, that used by the Open Biomedical Ontology group. This format is called the OBO format and files using it have the suffix '.obo', e.g. gene_ontology.obo. The OBO format is designed to be used with the freely-downloadable OBO-Edit tool. Users should be aware that controlled vocabularies undergo continual development; terms and definitions are refined, added, merged, split and obsoleted in an effort to improve the way they represent their various subjects. Both the current 'live' versions of each controlled vocabulary and the static versions taken at the time data for this FlyBase release was frozen are available to download from the Precomputed files download page under the Files menu of the Navigation bar. The detail of each controlled vocabulary term is displayed in a CV Term Report in FlyBase. Individual CV Term Reports can be reached either by clicking on the controlled vocabulary term where it is displayed in a report page (e.g. the GENE ONTOLOGY: Function, Process, and Cellular component section of the Gene Report), or by using the TermLink tool, which allows users to search directly for controlled vocabulary terms from any of the controlled vocabularies used by FlyBase. Controlled vocabulary terms can also be searched using the QueryBuilder tool, via their links to objects (such as genes) in FlyBase. If you wish to search using a controlled vocabulary term in QueryBuilder, you should select the GO/Anatomy CV DB dataset in the query segment box (see the QUERY BUILDER HELP section at the bottom of the QueryBuilder page for more details. G.3. Classification of Gene Products using Gene Ontology (GO) termsFlyBase uses Gene Ontology (GO) controlled vocabulary (CV) terms for cellular component, biological process and molecular function to describe properties of gene products. Although GO terms are intended to describe the properties of gene products, FlyBase currently assigns GO terms to genes rather than protein or RNA. FlyBase is one of the founding members of the Gene Ontology (GO) Consortium and follows the general guidelines for GO annotation as described in the GO documentation. FlyBase also participates in the GO Reference Genome Annotation Project. G.3.1. FlyBase GO dataGO data is displayed in the GENE ONTOLOGY: Function, Process, and Cellular component section of individual Gene Reports. In addition, the current release of GO data for all Drosophila melanogaster FlyBase genes can be found in the tab delimited text file gene_association.fb. The following provides a brief description of the columns in the gene_association.fb file.
The latest version of this data is also available for download here from the Gene Ontology consortium site. The accompanying README document includes a detailed description of the file format, FlyBase GO annotation policy and sources used for FlyBase GO annotations. Note that the GO data available from FlyBase will not necessarily be identical to that found on the GO website. The GO consortium validate the data FlyBase submits and remove lines of data that are no longer valid e.g. when a GO term becomes obsolete. G.3.2. EvidenceEvidence for a GO term consists of an evidence code that describes the type of analysis carried out together with, in some cases, a reference to another database object in that supports the evidence (see with/from Supporting Evidence below). Evidence codes The Gene Ontology Guide to GO Evidence Codes contains comprehensive descriptions of the evidence codes used in GO annotation. FlyBase uses the following evidence codes when assigning GO data:
G.3.2.1. Use of evidence codesConsistent with the aims of the GO reference genome project, FlyBase prefers to assign GO terms based on experimental evidence codes (IMP, IGI, IDA, IPI, IEP). Of these five codes, FlyBase uses IEP relatively infrequently since expression patterns generally provide less direct evidence for GO terms than the other four codes. FlyBase does use IEP where an author explicitly states that expression data is the evidence for a term. In 2008, GO introduced the new evidence code EXP (inferred from experiment) for use by groups that wish to submit data to GO that is known to be based on experiment but which has not been assigned with one of the granular experimental evidence codes. FlyBase will not use EXP in routine curation but in may incorporate externally supplied GO annotation that includes EXP. Evidence codes based on computer predictions (ISS, ISO, ISA, ISM, IEA, RCA), author statements (NAS, TAS) and curator inference (IC) will continue to be used in the absence of experimental data for the same or a more specific GO term. However, we aim to remove GO data with these codes when experimental evidence for the term is curated. The evidence code ND (no biological data available) is used for annotations to the three root GO terms: molecular_function ; GO:0003674", "biological_process ; GO:0008150" or "cellular_component ; GO:0008372". In FlyBase the use of any of these three GO terms, attributed to reference FBrf0159398 and supported by the ND evidence code, signifies that a curator has examined the available literature and sequence for this gene and that, as of the date of the annotation to the term, there is no information supporting an annotation to any more specific GO term in that ontology. Additional information about the way FlyBase uses evidence codes can be found in the README document. with/from Supporting Evidence Some evidence codes (IGI, IPI, ISS, ISO, ISA, ISM, IEA, IC) are used in conjunction 'with' supporting data in the form of a reference to another database object. These objects are identified by their database abbreviation followed by a colon and the unique identifier for the object in that database. A list of current database abbreviations can be found in the GO.xrf_abbs file. See the GO Annotation Guide for more details. GO terms based on sequence similarity FlyBase assigns GO terms to gene products based on similarity to other gene products using the evidence codes ISS, ISA, ISO and ISM. Since October 1st 2006, it has been obligatory to include an identifier for the sequence used to make the annotation; FlyBase ISS annotations made before this date do not all include such identifiers but will be updated gradually. In line with current guidelines for reference genomes, curators now check that the similar sequence can be annotated to the GO term with experimental evidence (IDA, IMP, IGI, IPI, IEP) before making an annotation based on sequence similarity. This policy was adopted to avoid circular similarity-based annotations. Consequently, GO terms are not curated based multiple sequence alignments if none of the sequences in the alignment have been experimentally verified. Annotations made before October 2006 have not necessarily been checked in this way. For example, the Drosophila gene bigmax is annotated with the GO term 'regulation of transcription' based on sequence similarity to Max. This annotation is legitimate because Max has been shown to regulate transcription in a direct assay. The combined evidence appears on the gene report in the format: inferred from sequence or structural similarity with FLYBASE:Max; FB:FBgn0017578 In this case we have give two identifiers (symbol and gene ID) for the same sequence; identifiers for the same sequence are separated by a semi-colon. If more than one sequence is used to make the annotation then the identifiers for the different sequences are separated by a comma. Note that this use of multiple identifiers is a different to that for IGI and IPI. IGI and IPI 'with' For both IGI and IPI, the 'with' column format has additional significance. All annotations inferred from genetic interaction (IGI) include an identifier for the interacting gene. If the GO term is inferred based on multiple genes interacting simultaneously then all interacting genes are identified using 'with' (separated by commas). However, if the GO term is inferred from multiple pairwise interactions these are treated as separate pieces of experimental evidence and appear with separately on the gene report. For example, Bruce is annotated with the GO term 'programmed cell death' based on two different pairwise genetic interaction experiments; the evidence appears on the gene report as: inferred from genetic interaction with FLYBASE:grim; FB:FBgn0015946 AND inferred from genetic interaction with FLYBASE:rpr; FB:FBgn0011706 Contrast this with, the following which would imply that all three genes had to interact together to provide evidence for the annotation: inferred from genetic interaction with FLYBASE:grim; FB:FBgn0015946, FLYBASE:rpr; FB:FBgn0011706 Similar notation is used for IPI where the interacting gene product is identified using 'with'. Where several gene products interact simultaneously they are recorded in a single annotation (separated by commas after the evidence code). Pairwise physical interactions are recorded independently with using separate evidence codes. IC 'from' Evidence inferred by curator (IC) is the case that includes 'from'. Curators use this code for those cases where an annotation is not supported by any evidence, but can be reasonably inferred by from other GO annotations, for which evidence is available. The object identified in the IC evidence is always a GO term identifier. For example, a protein shown to have transcription factor activity in a direct assay could be annotated with the GO term 'general RNA polymerase II transcription factor' (GO:0016251). In the absence of any evidence for the cellular location of that protein, if would be reasonable for the the curator to infer that it is (at least sometimes) located in the nucleus. This would lead the the annotation, nucleus inferred by curator from GO:0016251; the annotation is attributed to the reference that contains evidence for transcription factor activity. G.3.2.2. Use of QualifiersQualifiers are used as flags that modify the interpretation of an annotation. Allowable values are NOT, contributes_to, and colocalizes_with. On the gene report page, qualifiers precede the GO term in the CV column. More information about using qualifiers is available in the GO Annotation Guide. NOT NOT may be used with terms from any of the three GO ontologies (cellular component, biological process, molecular function). NOT is used to make an explicit note that the gene product is not associated with the GO term. This is particularly important in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method). For example, if a protein has sequence similarity to an enzyme such as galactosyltransferase, but has been shown experimentally not to have the galactosyltransferase activity, it can be annotated as NOT galactosyltransferase activity (GO molecular function term: GO:0008378). NOT can also be used when a cited reference explicitly says (e.g. "our favorite protein is not found in the nucleus"). Prefixing a GO term with the string NOT allows curators to state that a particular gene product is NOT associated with a particular GO term. This usage of NOT was introduced to allow curators to document conflicting claims in the literature. Note that NOT is used when a GO term might otherwise be expected to apply to a gene product, but an experiment, sequence analysis, etc. proves otherwise; it is not generally used for negative or inconclusive experimental results. colocalizes_with colocalizes_with is used only with cellular component terms. Gene products that are transiently or peripherally associated with an organelle or complex are annotated to the relevant cellular component term, using the colocalizes_with qualifier. This qualifier is also be used in cases where the resolution of an assay is not accurate enough to say that the gene product is a bona fide component member. contributes_to contributes_to is used only with molecular function terms. An individual gene product that is part of a complex is annotated to terms that describe the function of the complex. Many such function annotations include the qualifier contributes_to: Annotating individual gene products according to attributes of a complex is especially useful for molecular function annotations in cases where a complex has an activity, but not all of the individual subunits do. (For example, there may be a known catalytic subunit and one or more additional subunits, or the activity may only be present when the complex is assembled.) Molecular function annotations of complex subunits that are not known to possess the activity of the complex include the qualifier contributes_to. Note that contributes_to is not used to annotate a catalytic subunit. Furthermore, contributes_to may be used for any non-catalytic subunit, whether the subunit is essential for the activity of the complex or not. G.4. Computed Feature type of genesThe Feature type field of a Gene Report contains a single controlled vocabulary term from the Sequence Ontology (SO), which aims to describe the key type of the gene. The single term in this field is computed by FlyBase from the full list of SO terms listed in the SEQUENCE ONTOLOGY: Class of gene section of the Gene Report, according to the following rules:
G.5. Computed cytological dataG.5.1. Computed cytological locations of objects which have been mapped to the genome.Objects which have been precisely mapped to the genome (such as genes with annotations, or insertions of transposable elements with flanking sequence) have an inferred cytological location which is computed by FlyBase based on their sequence location. The system used is based on estimates that Sorsa published a few years ago of the size in kb of each polytene band. These estimates can be summed to give the length (according to Sorsa) in kb of a region between two very well-mapped entities ('anchors') that are also identified on the genome. The genome sequence gives a different number for that length, so we then apply a scaling factor, i.e. we calculate the cytology of each mapped object in the region between the anchors by interpolation from its sequence coordinates. The anchors we use are a set of over 1200 P-element insertions that have been localised on the genome by sequencing flanking DNA and on polytene chromosomes by Todd Laverty of the Berkeley Drosophila Genome Project. The scaling works out to be slightly different for each inter-anchor region, but we estimate that even in the middle of a region the error in the computed location should never be more than a band or two. As the remaining gaps in the genome sequence are filled, some currently unmappable stretches of sequence (especially near centromeres) will be joined up with the main sequence, and this will shift all the coordinates. Smaller changes will occur as a result of other gap-filling in the middle of arms. These will be reflected in updates to the map locations. FlyBase currently only computes cytological data in this way for objects that have been mapped to the D.melanogaster genome. Cytology computed in this way is currently displayed on FlyBase in the following places on the relevant Report: Gene Report
Insertion Report
GBrowse
G.5.2. Computed cytological location of insertions based on the gene in which it is inserted.Insertions of transposable elements that do not have flanking sequence may have a computed cytological location which is based on the computed cytological location of the gene into which they have inserted in the genome (displayed in the Affected gene(s) section of the Insertion Report). If the affected gene has a computed cytological location based on its sequence location (as described in 1. above) then this is displayed in the Insertion Report in the following field:
G.5.3. Computed cytological locations of objects based upon data from the literature.Genes that do not have a computed cytology based on their mapping to the genome (described in 1. above) may instead have a computed cytology based upon data from the literature. Aberrations may also have computed cytological breakpoints based upon data from the literature. Five categories of information are used to compute the cytological location of genes and aberration breakpoints:
Recombination, complementation and molecular information does not reveal polytene locations directly, but can be combined with orcein and in situ data to derive inferred polytene locations. FlyBase has produced software which produces a synthesis of the primary data, resulting in a computed cytological location that is a best guess of the polytene location of each gene or aberration breakpoint for which any relevant data are known to FlyBase. However, since this type of analysis is non-trivial when conducted on a large dataset, the statements computed in this way should be treated with caution, and users should also consult the five categories of information listed above to see the full extent of the primary data. The computed cytological location is presented as a range of uncertainty, whose ends are either polytene bands (such as 22F1) or lettered subdivisions (such as 22F). Heterochromatic bands (such as h41) are also used. Wherever possible, the computed range of uncertainty of a gene or breakpoint is the range consistent with ALL the data known to FlyBase. Thus, if in one publication a gene has been reported to lie in 35B1-4, and in another publication it is reported to lie in 35B3-6, and there is no other relevant information in FlyBase, the computed location will be 35B3-4. More complex situations arise from complementation and recombination data. For example, if Df(1)xyz is stated to have its proximal breakpoint at 15A1-4, and Df(1)pqr is stated to have its distal breakpoint at 15A3-6, and the Deficiencies are known to overlap (because there is a gene, abc, that they both delete), then both those breakpoints will be computed to lie in 15A3-4 -- as will the gene abc itself. If however two publications report cytological ranges that do NOT overlap, a choice must be made regarding which report to prioritize. This is done case-by-case, going back to the original literature. Certain guidelines are used: for example, genetic data on deficiencies are usually favored over cytological data, since point lesions very near to a deficiency are rare. However, inevitably some decisions are wrong -- especially when there is nothing to favor one report over another. Because of the inherent complexity of these computations, the basis for the computed range is often not obvious at first sight. FlyBase therefore includes one-line descriptions of the primary data from which each end of the range was determined. Some examples: For gene abc:
Computed cytological location: 15A3-4
Left limit from inclusion in Df(1)pqr (FBrf0012345) Right limit from inclusion in Df(1)xyz (FBrf0054321) For Df(1)xyz:
Computed breakpoints: 14D;15A3-4
Limits of break 1 from polytene analysis (FBrf0013579) Left limit of break 2 from inclusion of abc (FBrf0056789) Right limit of break 2 from polytene analysis (FBrf0098765) For Df(1)pqr:
Computed breakpoints: 15A3-4;15D
Left limit of break 1 from polytene analysis (FBrf0034567) Limits of break 2 from polytene analysis (FBrf0097531) Note that there is no requirement that any two data items derive from the same reference. NotationIf a computed cytological range is inferred from recombination data (for genes) or complementation (for breakpoints) they are enclosed in square brackets when no range (even a wider one) can be determined by other means (thus square brackets specifically denote the unavailability of any direct data). This is most commonly found for breakpoints of cytologically invisible deficiencies and for genes which were mapped by recombination but never cloned or mapped by complementation. 'One-ended' limits. The commonest example of this is when a deficiency is stated to delete certain genes, thus giving it a minimum extent, but no flanking undeleted genes are specified, so no 'maximum extent' can be computed. In such cases, if there is also no explicit cytology for the deficiency (and if it is also not stated to be cytologically invisible -- see below) the 'half-open' range is denoted by 'less than' and 'greater than' signs, as follows: For a deficiency that deletes three genes, all localized to 28D-E:
Computed breakpoints: <28E;>28D
Right limit of break 1 from inclusion of abc (FBrf0076543) Left limit of break 2 from inclusion of abc (FBrf0056789) Note that there is no 'limit line' for the left limit of break 1 or the right limit of break 2. Note also the superficially odd, but logically sound, mention of 28E for the left break and 28D for the right break. Proximity rather than orderThere are two cases in which locations are computed based on close proximity of a pair of objects, rather than on their chromosomal order. One is when two genes are reported to lie within 20kb or less on a molecular map. For example, if a gene xyz is stated to lie in 22F1-2 and a second gene, pqr, is stated to lie a few kilobases away from xyz (and there is no other relevant information in FlyBase), the computed location of pqr will be 22F1-2, even if there is no information on the chromosomal order of the two genes. The other case concerns cytologically invisible deficiencies. If a deficiency is stated to be cytologically invisible, the computation makes the assumption that it is less than a band in extent, so that the ranges of uncertainty of the left and right breakpoint should be identical. For example: if the deficiency in the previous example, which deletes a gene in 28D-E, were said to be cytologically invisible then its computed data would appear as follows: Computed breakpoints: [28D-E];[28D-E]
Left limit of break 1 from cytological invisibility (FBrf0002468)
Right limit of break 1 from inclusion of abc (FBrf0076543) Left limit of break 2 from inclusion of abc (FBrf0056789) Right limit of break 2 from cytological invisibility (FBrf0002468) Note the use of square brackets as described under "Notation", since this is a case where no explicit cytology is available. A statement that a deficiency is less than 20kb long is, for this purpose, treated as a statement that it is cytologically invisible. Cytology computed in this way is currently displayed on FlyBase in the following places on the relevant Report: Gene Report
Note: the one-line description of the primary data from which the range was determined is displayed in the Evidence for location column of the above section. Aberration Report
Note: the one-line description of the primary data from which the range was determined is displayed in the COMMENTS ON CYTOLOGY section. G.5.4. ToolsMap-based searches using CytoSearch use computed cytological locations, rather than the primary data reported in the literature. For this reason, it is always advisable to search using a slightly broader range than the one of interest, so as to match entities which have been placed by multiple investigators in slightly varying locations. The Cytolocation Advanced Search option in GBrowse uses computed cytological locations of objects which have been mapped to the genome (as described in 1. above). G.6. Personal communications to FlyBaseThe policy of FlyBase with respect to the incorporation of unpublished data into the database is as follows. Data will only be considered for curation if available to FlyBase in written or electronic form. FlyBase will not capture data from oral presentations at meetings or seminars, from posters or by word of mouth (we will, however, curate published abstracts). If colleagues wish unpublished data to be considered for incorporation into FlyBase then those data must be submitted to FlyBase in writing or by using the contact FlyBase form (electronic submissions are strongly preferred). Each personal communication will be assigned a FlyBase reference (FBrf) identifier number, and the data will be tied to this citation in the database. These references will appear in the FlyBase bibliographic files, and become citable publications upon entry into the public FlyBase database. Personal communications received in written form (i.e. not electronically) will be archived by FlyBase. For personal communications that have been sent by e-mail, the full text of the communication will be present within the Reference Report. We encourage the citation of these personal communications in the literature in the form: Gelbart, W.M. (1994). Personal communication to FlyBase.<http://flybase.org/reports/FBrf0075300.html> Personal communications are incorporated into the FlyBase bibliography and can be searched using either the QuickSearch or the QueryBuilder tool. G.7. Gene Model Annotation GuidelinesG.7.1. Criteria for AnnotationPurpose: To determine whether existing gene models are correct and complete and to determine if there is evidence for additional genes or transcripts not already represented by the existing models. Determine whether a protein-coding gene exists in a region.Gene prediction algorithms are sufficiently robust that this is rarely an issue for larger genes (200aa or greater), unless the gene consists of many small dispersed exons. To make a judgment in cases of small genes or genes comprised of small exons (for which there is no published information), available evidence is examined further. Four types of evidence are considered:
For gene models with only one of these four types of supporting data, models with a predicted CDS greater than 100aa are created or retained. If there are two or more types of supporting data, a gene model is created if the predicted CDS exceeds 50aa. If there is BLASTX homology to a similar small gene in other species, a smaller size limit is accepted. Is there one gene or several?Gene splits or merges are a common annotation correction and are based upon cDNA/EST data, BLASTX homologies, or corrections submitted by the community. A comment indicating that a merge or split has occurred, along with an indication of the type of data supporting the change, is placed in the annotation record. Determine the structure of the transcript(s).Internal intron-exon structures are based primarily upon EST/cDNA data. If these data are absent, we rely on gene prediction data. In a few cases, approximate gene structures are inferred from BLASTX alignments. In practice, many annotations are based upon a combination of these data types. Examples:
Determine the extent of the coding region.The Apollo annotation tool sets the translation start site to the 5'-most in-frame ATG. But, in cases supported by the literature, a non-ATG translation start site, or a downstream ATG may be used. In some cases, especially for annotations supported only by BLASTX data, it is not possible to identify a likely ATG start codon. In such cases, translation is started at the 5'-most internal in-frame codon and an explanatory comment is added. How many alternative transcripts exist?We annotate as many alternative transcripts as are supported by cDNA/EST and community data. We will also annotate an alternative transcript if there is overwhelming gene prediction evidence and/or BLASTX evidence. If non-contiguous EST data support alternative exons in several regions of the gene, it is not always possible to determine which of all possible combinations actually exist in vivo. The number of such alternative transcripts to be created is at the discretion of the annotator, and appropriate comments added. Note: Combinations of 5' ESTs and 3' ESTs from different cDNA clones are used to make gene models, and this may have artificially increased the number of alternative transcripts, since not all of these combinations may exist in vivo. Note: Partial annotations are avoided except in extreme circumstances. Curator comments.The Apollo annotation tool allows for the inclusion of comments associated with an annotated gene or a specific transcript of an annotated gene. We make extensive use of this capability, including controlled comments as well as free text comments. The collection of controlled comments was developed during the initial re-annotation stages, and is used as often as possible to facilitate consistency and to provide a means of tracking or querying for various atypical gene structures. For example, all predicted splices that fail to use the canonical GT/AG donor and acceptor splice site dinucleotides are noted, as are genes that have been reported to make use of non-ATG translation starts, genes that have a dicistronic transcript, and genes known to be or appearing to be mutant in the sequenced strain. Many of the controlled comments address the weaknesses or anomalies in the annotation: an unusual alternative transcript supported by a single EST, incomplete supporting data requiring extension of a gene model to the nearest translation start or stop, or than an ATG translation start codon could not be identified. Genes that are split or merged are commented and the type of evidence supporting the change indicated. Finally, cDNA clones that failed to accurately reflect the annotation (typically those that are incomplete or appear to include intronic sequences) are designated as problematic and have a comment attached. If such comments exist for a particular annotation, they can be found on the Gene Report, in the GENE MODEL AND FEATURES section, in a field labeled Comments on Gene Model. If comments exist for a particular transcript, they can be found on the annotated transcript report in a section called COMMENTS. This section will only appear on the report if there are comments attached to the transcript. G.7.2. Evidence used for gene model annotation as of March 2007Since the publication of the description of the r3.1 reannotation effort (Misra, et al., 2002), a number of new and expanded data sets allow much more accurate assessment of gene models in D. melanogaster. These include:
G.8. What does the annotation evidence score mean?The current implementation of the evidence scoring system is based on assessment of three different classes of evidence used to inform transcript annotations. These are
Note that, in the future, we plan to refine this scoring metric to include support based on comparative genomics and proteomic analyses, as well as to potentially provide details on the quantity and quality of each type of support. Each transcript gets a score that is based on the sum of the following categories:
The points assigned for each type of evidence allow one to easily and unambiguously determine what types of evidence exist that support a particular transcript annotation as each possible combination of supporting types receives a unique score. For example, to identify all transcripts with cDNA support one would look for all transcripts with a score greater than or equal to 8. If instead you wanted to identify transcripts with no aligned nucleotide support you would search for transcripts with scores of 0,2,4 or 6. And to identify those transcripts with both supporting ESTs and gene prediction support but without a full length cDNA or protein similarity you would seach for transcripts with a score equal to 5. Support means different things for different classes of evidence.For gene prediction support the ends of the predicted gene model must either match or be within the annotated CDS of a transcript and the internal predicted exon/intron junctions must match the annotated junctions along the entire length of the prediction. The rules are the same for EST and cDNA alignments except that the assessment is based on the entire annotated transcript and not just the coding region. For protein similarity a positive score is simply based on a region of aligned protein sequence overlapping any annotated CDS exon of an annotated transcript on the same strand. This simplistic assessment likely produces a fair number of false positives and we hope to refine this aspect of assessment to provide more meaningful confidence values. |
|