Genome Assembly Terminology

Posted on March 13, 2012 by Marco Barnig

Below is a list of commonly used terms and definitions in the field of genomics (source : Genome Reference Consortium).

Assembly : a set of chromosomes, unlocalized and unplaced sequences and alternate loci used to represent an organism’s genome
Chromosome Assembly : a relatively complete pseudo-molecule assembled from smaller sequences that represent a biological chromosome
Diploid Assembly : a genome assembly for which a Chromosome Assembly is available for both sets of an individual’s chromosomes
Haploid Assembly : the collection of Chromosome assemblies, unlocalized and unlocalized sequences and alternate loci that represent an organism’s genome
Primary Assembly : a primary assemblies represents the collection of assembled chromosomes, unlocalized and unplaced sequences that, when combined, should represent a non-redundant haploid genome
Assembly Units : collections of sequences used to define discrete parts of an assembly
Genome Patch : a contig sequence that is released outside of the full assembly release cycle
FIX patch : FIX patches are released to correct an error in the assembly and will be removed when the new full assembly is released
NOVEL patch : NOVEL patches are sequences that were not in the last full assembly release and will be retained with the next full assembly release
Alternate Locus :
Unlocalized Sequence : a sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome
Unplaced Sequence : a sequence found in an assembly that is not associated with any chromosome
PAR (Pseudo-autosomal region) : a region found on the X and Y chromosomes of mammals that allow recombination between the sex chromosomes
AGP File : a file used to describe the instructions for building a contig, scaffold or chromosome sequence
Contig : a contiguous sequence generated from determining the non-redundant path along an order set of component sequences
Component : a low genomic level sequence used to construct the genome, typically these are either clone sequences, WGS sequence or a PCR fragment
Join : the sequence overlap between two adjacent components in a contig
Scaffold : an ordered and oriented set of contigs with gaps
Switch Point : the base at which the contig sequence stops being generated from one component sequence and switches to using the next component sequence
TPF (Tiling Path file) : provides the order of the component sequences used to build a contig, scaffold or chromosome

The Human Genome Project

Posted on March 12, 2012 by Marco Barnig

The Human Genome Project (HGP) was a 13-year international scientific research project coordinated by the U.S. Department of Energy (DOE) and the National Institutes of Health (NIH). The primary goal was determining the sequence of chemical base pairs which make up DNA, and identifying and mapping the approximately 20,000-25,000 genes of the human genome from both a physical and functional standpoint.

The project began in October 1990; a complete draft of the genome was announced in April 2003, two years earlier than planned. The U.S. National Center for Biotechnology Information (NCBI) house the gene sequence in a database known as GenBank, along with sequences of known and hypothetical genes and proteins.

Specialised computer programs are necessary to analyze the data, because the data itself is difficult to interpret without such programs. Among the organizations creating powerful tools for storing, visualizing and searching Genome data are the Genome Bioinformatics Group at the University of California , Santa Cruz (UCSC), the European Bioinformatics Institute (EBI = part of the European Molecular Biology Laboratory EMBL) and the Wellcome Trust Sanger Institute (WTSI).

The EBI and WTSI launched in 1999 a joint scientific project called Ensembl, which aim is to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of human species and other vertebrates and model organis ms.

Ensembl Genomes release 13 was launched on March 8, 2012, bringing the total genomes supported to 341.

The process of identifying the boundaries between genes and other features in a raw DNA sequence is called genome annotation. It consists of two steps:
1. identifying elements on the genome, a process called gene prediction
2. attaching biological information to these elements

The value of a genome is only as good as its annotation. To create a gold standard reference annotation, the Human and Vertebrate Analysis and Annotation (HAVANA) team of the WTSI uses tools developed in-house to manually annotate human, mouse and zebrafish genomes. Based on these data a central repository for high quality manual annotation of vertebrate finished genome sequence, called The Vertebrate Genome Annotation (VEGA) database, has been created.

The EBI hosts the The Protein and Nucleotide Database Group (PANDA) providing all its sequence resources and The HUGO Gene Nomenclature Committee (HGNC), the only worldwide authority that assigns standardised nomenclature to human genes. HGNC has assigned unique gene symbols and names to over 33,000 human loci, of which around 19,000 are protein coding. The HGNC website genenames.org is a curated online repository of approved gene nomenclature and associated resources.

In September 2003, the National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, to carry out a project to identify all functional elements in the human genome sequence. Both UCSC and WTSI are participating in the ENCODE project.

The WTSI set up a sub-project of the ENCODE project; called GENCODE (Encyclopædia of genes and gene variants) to annotate all evidence-based gene features in the entire human genome at a high accuracy. The Gencode gene sets are used by the entire ENCODE consortium and by many other projects as reference gene sets :

1000 Genomes
Genome at Home (Stanford University)

Genome Browsers and BioGPS

Posted on March 12, 2012 by Marco Barnig

A genome browser is a graphical interface for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene prediction and structure, proteins, expression, regulation, etc.

A detailed list of existing genome browsers is available at Wikipedia. The renowned genome browsers are the following :

Ensembl
UCSC Genome Browser
Map Viewer
Integrated Microbial Genomes (IMG)
Integrative Genomics Viewer (IGV)
GBrowse software system (framework for many additional genome browser)

GBrowse is part of GMOD, the Generic Model Organism Database project, a collection of open source software tools for creating and managing genome-scale biological databases. Another open source bioinformatics projects is Galaxy, a web-based platform for data intensive biomedical research.

BioGPS is a gene portal built with two guiding principles in mind : customizability and extensibility.

Internet with a Brain

Your browser becomes your personal assistant and Internet gets a synthetic consciousness

Tag Archives: genome

Genome Assembly Terminology

The Human Genome Project

Genome Browsers and BioGPS