GeneChip Expression Analysis Technology

DNA oligonucletides (oligo) are short stretches of DNA sequences. Due to the double-stranded nature of DNA, one can design an oligo which has the complementary sequence to any gene of interest. If these oligos, or probes, are attached to a solid surface in a defined grid (x rows and y columns), a genechip (also called microarray or DNA chip) has been created.

If the genechip is put in contact with a solution containing sequences of gene products, these products (targets) will bound with their complementary probes. This process is called hybridization. The more products of a gene are in the solution, the more will hybridize with the probe on the surface of the microarray.

To identify the hybridized targets, it is necessary to label them. There are several techniques to do this marking. Beneath labeling with radioactive isotopes, the most common non-radioactive technique is fluorescent dye (FISH : fluorescence in situ hybridization). Most popular are Cyanine dyes, especially Cy3, fluorescent in the green region and Cy5, fluorescent in the red region.

One dye color is sufficient to measure the abundance of particular gene products in particular regions by scanning the microarray. The most common approach however is a two-color design where one of the samples of the gene products is a universal reference sample.

A gene product is the biochemical material, either functional RNA or protein, resulting from the activity (expression) of a gene. The amount of gene products depends on how active a gene is. In most experiments the ribosomal RNA (rRNA) is used as the gene product, because rRNA is one of only a few gene products present in all cells. Ribosomal RNA provides a mechanism for decoding mRNA into amino acids.

After the hybridization, the unbound material is washed away and the microarray is scanned. Once the data is collected, it can be analyzed by sophisticated bioinformatics tools. The results are usually published and shared with the scientific community in specialized data-bases.

The following list of links provides further informations and some interactive animations about genechip expression analysis technologies :

The Human Genome Project

The Human Genome Project (HGP) was a 13-year international scientific research project  coordinated by the U.S. Department of Energy (DOE) and the National Institutes of Health (NIH). The primary goal was determining the sequence of chemical base pairs which make up DNA, and identifying and mapping the approximately 20,000-25,000 genes of the human genome from both a physical and functional standpoint.

The project began in October 1990; a complete draft of the genome was announced in April 2003, two years earlier than planned. The U.S. National Center for Biotechnology Information (NCBI) house the gene sequence in a database known as GenBank, along with sequences of known and hypothetical genes and proteins.

Specialised computer programs are necessary to analyze the data, because the data itself is difficult to interpret without such programs. Among the organizations creating powerful tools for storing, visualizing and searching Genome data are the Genome Bioinformatics Group at the University of California , Santa Cruz (UCSC), the European Bioinformatics Institute (EBI = part of the European Molecular Biology Laboratory EMBL) and the Wellcome Trust Sanger Institute (WTSI).

The EBI and WTSI launched in 1999 a joint scientific project called Ensembl, which aim is to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of human species and other vertebrates and model organis  ms.

Ensembl Genomes release 13 was launched on March 8, 2012, bringing the total genomes supported to 341.

The process of identifying the boundaries between genes and other features in a raw DNA sequence is called genome annotation. It consists of two steps:
1. identifying elements on the genome, a process called gene prediction
2. attaching biological information to these elements

The value of a genome is only as good as its annotation. To create a gold standard reference annotation, the Human and Vertebrate Analysis and Annotation (HAVANA) team of the WTSI uses tools developed in-house to manually annotate human, mouse and zebrafish genomes. Based on these data a central repository for high quality manual annotation of vertebrate finished genome sequence, called The Vertebrate Genome Annotation (VEGA) database, has been created.

The EBI hosts the The Protein and Nucleotide Database Group (PANDA) providing all its sequence resources and The HUGO Gene Nomenclature Committee (HGNC), the only worldwide authority that assigns standardised nomenclature to human genes. HGNC has assigned unique gene symbols and names to over 33,000 human loci, of which around 19,000 are protein coding. The HGNC website genenames.org is a curated online repository of approved gene nomenclature and associated resources.

In September 2003, the National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, to carry out a project to identify all functional elements in the human genome sequence. Both UCSC and WTSI are participating in the ENCODE project.

The WTSI set up a sub-project of the ENCODE project; called GENCODE (Encyclopædia of genes and gene variants) to annotate all evidence-based gene features in the entire human genome at a high accuracy. The Gencode gene sets are used by the entire ENCODE consortium and by many other projects as reference gene sets :

DNA sequencing and bioinformatics

DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA. The chain-termination method developed by Frederick Sanger and coworkers in 1977 at the University of Cambridge in England became the method of choice for DNA sequencing.

Once a DNA sequence has been obtained from an organism, it is stored in silico in digital format. In silico is used as an analogy to the Latin phrases in vivo, in vitro, and in situ, which are commonly used in biology. It means performed on computer. Usually the DNA sequences are stored in sequence databases that can be searched using a variety of methods. One of these methods is BLAST (Basic Local Alignment Search Tool), a registered trademark of the National Library of Medicine.

The application of computer science and information technology to the field of biology and medicine is called Bioinformatics. A renowned Genome Bioinformatics website is developed and maintained by the Genome Bioinformatics Group, a cross-departmental team within the Center for Biomolecular Science and Engineering (CBSE) at the University of California Santa Cruz (UCSC).

A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns.

A list of links to some biological databases and software tools is shown below :

  •  BLAST : Basic Local Alignment Search Tool
  • ENCODE : Encyclopedia of DNA Elements Consortium – an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI)
  • Neanderthaler Project : Neandertal Genome Analysis Consortium Tracks
  • Genome Browser : this program zooms and scrolls over chromosomes, showing the work of annotators worldwide
  • Gene Sorter : this program displays a sorted table of genes that are related to one another
  • Blat : this program quickly maps a sequence to the genome
  • Table Browser : this program retrieves data associated with a track in text format
  • Visi Gene : this program is a virtual microscope for viewing in situ images
  • Genome Graphs : this program is a tool for displaying genome-wide data sets
  • Mouse Genome Informatics : the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease

Genetics

Genetics is the science of genes, heredity, and variation in living organisms. It’s a discipline of biology and can be applied to the study of all living systems, from viruses and bacteria, through plants and domestic animals to humans. The modern science of genetics, which seeks to understand the process of inheritance, began with the work of Gregor Mendel in the mid-19th century.

The molecular basis for genes is deoxyribonucleic acid (DNA). Genes correspond to regions within DNA, a molecule composed of a chain of four different types of nucleotides :

  • adenine (A)
  • cytosine (C)
  • guanine (G)
  • and thymine (T)

Genetic information exists in the sequence of these nucleotides. DNA exists as a double-stranded molecule, coiled into the shape of a double-helix. Each nucleotide in DNA  pairs with its partner nucleotide on the opposite strand: A pairs with T, and C pairs with G. Thus, in its two-stranded form, each strand contains all necessary information, redundant with its partner strand.

Genes are arranged linearly along long chains of DNA base-pair sequences. Eukaryotic organisms, which include plants and animals, have their DNA arranged in multiple linear chromosomes. These DNA strands are often extremely long; the largest human chromosome is about 247 million base pairs in length.

The full set of hereditary material in an organism (the combined DNA sequences of all chromosomes) is called the genome.