The Human Genome Project

Posted on March 12, 2012 by Marco Barnig

The Human Genome Project (HGP) was a 13-year international scientific research project coordinated by the U.S. Department of Energy (DOE) and the National Institutes of Health (NIH). The primary goal was determining the sequence of chemical base pairs which make up DNA, and identifying and mapping the approximately 20,000-25,000 genes of the human genome from both a physical and functional standpoint.

The project began in October 1990; a complete draft of the genome was announced in April 2003, two years earlier than planned. The U.S. National Center for Biotechnology Information (NCBI) house the gene sequence in a database known as GenBank, along with sequences of known and hypothetical genes and proteins.

Specialised computer programs are necessary to analyze the data, because the data itself is difficult to interpret without such programs. Among the organizations creating powerful tools for storing, visualizing and searching Genome data are the Genome Bioinformatics Group at the University of California , Santa Cruz (UCSC), the European Bioinformatics Institute (EBI = part of the European Molecular Biology Laboratory EMBL) and the Wellcome Trust Sanger Institute (WTSI).

The EBI and WTSI launched in 1999 a joint scientific project called Ensembl, which aim is to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of human species and other vertebrates and model organis ms.

Ensembl Genomes release 13 was launched on March 8, 2012, bringing the total genomes supported to 341.

The process of identifying the boundaries between genes and other features in a raw DNA sequence is called genome annotation. It consists of two steps:
1. identifying elements on the genome, a process called gene prediction
2. attaching biological information to these elements

The value of a genome is only as good as its annotation. To create a gold standard reference annotation, the Human and Vertebrate Analysis and Annotation (HAVANA) team of the WTSI uses tools developed in-house to manually annotate human, mouse and zebrafish genomes. Based on these data a central repository for high quality manual annotation of vertebrate finished genome sequence, called The Vertebrate Genome Annotation (VEGA) database, has been created.

The EBI hosts the The Protein and Nucleotide Database Group (PANDA) providing all its sequence resources and The HUGO Gene Nomenclature Committee (HGNC), the only worldwide authority that assigns standardised nomenclature to human genes. HGNC has assigned unique gene symbols and names to over 33,000 human loci, of which around 19,000 are protein coding. The HGNC website genenames.org is a curated online repository of approved gene nomenclature and associated resources.

In September 2003, the National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, to carry out a project to identify all functional elements in the human genome sequence. Both UCSC and WTSI are participating in the ENCODE project.

The WTSI set up a sub-project of the ENCODE project; called GENCODE (Encyclopædia of genes and gene variants) to annotate all evidence-based gene features in the entire human genome at a high accuracy. The Gencode gene sets are used by the entire ENCODE consortium and by many other projects as reference gene sets :

1000 Genomes
Genome at Home (Stanford University)

Genome Browsers and BioGPS

Posted on March 12, 2012 by Marco Barnig

A genome browser is a graphical interface for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene prediction and structure, proteins, expression, regulation, etc.

A detailed list of existing genome browsers is available at Wikipedia. The renowned genome browsers are the following :

Ensembl
UCSC Genome Browser
Map Viewer
Integrated Microbial Genomes (IMG)
Integrative Genomics Viewer (IGV)
GBrowse software system (framework for many additional genome browser)

GBrowse is part of GMOD, the Generic Model Organism Database project, a collection of open source software tools for creating and managing genome-scale biological databases. Another open source bioinformatics projects is Galaxy, a web-based platform for data intensive biomedical research.

BioGPS is a gene portal built with two guiding principles in mind : customizability and extensibility.

Chromosomes

Posted on March 11, 2012 by Marco Barnig

A chromosome is an organized structure of DNA and protein found in cells. It is a single piece of coiled DNA containing many genes, regulatory elements and other nucleotide sequences.

Chromosomes can be divided into two types—autosomes, and sex chromosomes. Human cells have 23 pairs of large linear nuclear chromosomes (22 pairs of autosomes and one pair of sex chromosomes), giving a total of 46 chromosomes.

The specific location of a gene or DNA sequence on a chromosome is called a locus (plural : loci). A variant of the DNA sequence at a given locus is called an allele. The ordered list of loci known for a particular genome is called a genetic map. Gene mapping is the procession of determining the locus for a particular biological trait.

Diploid and polyploid cells whose chromosomes have the same allele of a given gene at some locus are called homozygous with respect to that gene, while those that have different alleles of a given gene at a locus, are called heterozygous with respect to that gene.

The number of genes and base pairs per chromosome varies among the different sources available on the net. The following list shows statistics from the Major Assembly GRCh37, patch 7, released by the Genome Reference Consortium on February 11, 2012.

No	Sequenced	# Genes	% DNA	# base pairs (millions)
1	May 2006	3.511	8	250
2	April 2005	2.368	8	243
3	April 2006	1.926	6.5	198
4	April 2005	1.444	6	191
5	September 2004	1.633	6	181
6	October 2003	2.057	5.5	171
7	July 2003	1.882	5	159
8	January 2006	1.315	4.5	146
9	May 2004	1.534	4.5	141
10	May 2004	1.391	4.5	136
11	March 2006	2.168	4.5	135
12	March 2006	1.714	4.5	134
13	March 2004	720	3.5	115
14	January 2003	1.532	3.5	107
15	March 2006	1.249	3.5	103
16	December 2004	1.326	3	90
17	April 2006	1.773	2.5	81
18	March 2004	557	2.5	78
19	March 2004	2.066	2	59
20	December 2001	891	2	63
21	May 2000	450	1.5	48
22	December 1999	855	1.5	51
X	March 2005	1.672	5	155
Y	June 2003	429	2	59
Total :		36.463	100	3.094

The following list gives links to different chromosome repositories :

Cells, tissues and organs

Posted on March 10, 2012 by Marco Barnig

The cell is the basic structural and functional unit of all known living organisms. It is the smallest unit of life that is classified as a living thing, and is often called the building block of life. Humans contain about 10 trillion cells.

Organisms can be classified as unicellular (consisting of a single cell; including most bacteria) or multicellular (including plants and animals).

All cells contain the hereditary information necessary for regulating cell functions and for transmitting information to the next generation of cells.

There are two types of cells: eukaryotic and prokaryotic. There are about 210 distinct human cell types. The process by which cells specialise from progenitor cells into the enormous variety of cell types that make up the body is termed differentiation. Unspecialised cells (called Stem Cells) produce cells with specialised structures. The final mature cells may be white blood cells of the immune system; neurons of the central nervous system with dendritic ‘trees’ connecting to thousands of other nerve cells, contractile cells of skeletal muscle or of smooth muscle.

A gamete is a cell that fuses with another cell (ovum and sperm) during fertilization (conception) in organisms that reproduce sexually. A zygote is the initial cell formed when two gamete cells are joined. In multicellular organisms, it is the earliest developmental stage of the embryo.

Early embryos consist of Stem Cells that can produce any type of cell. These cells are described as Totipotent. Stem Cells are also found in a few places in adults, but these can only differentiate into a limited number of types of cell and are called Multipotent.

Cells that work together to perform a particular function are organised into Tissues which are grouped into four main categories :

Epithelial Tissue – Linings and layers
Connective Tissue – Holding structures together
Muscle Tissue – Actuation of movement
Nervous Tissue – Communication via electrical signal

Tissues that work together to perform a larger function are organised into Organs (Leaves, heart, kidneys, …). Organs may be further organised into Organ Systems, that carry out an overall function (Circulatory System, Nervous System, Reproductive System, …).

How do cells become specialized even though DNA is identical in every cell ? The short answer is that not every gene that is encoded by the DNA is expressed in every cell. All cells express a certain set of genes, often called “housekeeping” genes. These genes encode proteins that are essential for every type of cell. Each specialized type of cell also expresses a tissue-specific set of genes, which are unique to that particular tissue or organ. The cells become specialized for a particular function during the development of the organism.

DNA sequencing and bioinformatics

Posted on March 9, 2012 by Marco Barnig

DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA. The chain-termination method developed by Frederick Sanger and coworkers in 1977 at the University of Cambridge in England became the method of choice for DNA sequencing.

Once a DNA sequence has been obtained from an organism, it is stored in silico in digital format. In silico is used as an analogy to the Latin phrases in vivo, in vitro, and in situ, which are commonly used in biology. It means performed on computer. Usually the DNA sequences are stored in sequence databases that can be searched using a variety of methods. One of these methods is BLAST (Basic Local Alignment Search Tool), a registered trademark of the National Library of Medicine.

The application of computer science and information technology to the field of biology and medicine is called Bioinformatics. A renowned Genome Bioinformatics website is developed and maintained by the Genome Bioinformatics Group, a cross-departmental team within the Center for Biomolecular Science and Engineering (CBSE) at the University of California Santa Cruz (UCSC).

A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns.

A list of links to some biological databases and software tools is shown below :

BLAST : Basic Local Alignment Search Tool
ENCODE : Encyclopedia of DNA Elements Consortium – an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI)
Neanderthaler Project : Neandertal Genome Analysis Consortium Tracks
Genome Browser : this program zooms and scrolls over chromosomes, showing the work of annotators worldwide
Gene Sorter : this program displays a sorted table of genes that are related to one another
Blat : this program quickly maps a sequence to the genome
Table Browser : this program retrieves data associated with a track in text format
Visi Gene : this program is a virtual microscope for viewing in situ images
Genome Graphs : this program is a tool for displaying genome-wide data sets
Mouse Genome Informatics : the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease

Genetics

Posted on March 8, 2012 by Marco Barnig

Genetics is the science of genes, heredity, and variation in living organisms. It’s a discipline of biology and can be applied to the study of all living systems, from viruses and bacteria, through plants and domestic animals to humans. The modern science of genetics, which seeks to understand the process of inheritance, began with the work of Gregor Mendel in the mid-19th century.

The molecular basis for genes is deoxyribonucleic acid (DNA). Genes correspond to regions within DNA, a molecule composed of a chain of four different types of nucleotides :

adenine (A)
cytosine (C)
guanine (G)
and thymine (T)

Genetic information exists in the sequence of these nucleotides. DNA exists as a double-stranded molecule, coiled into the shape of a double-helix. Each nucleotide in DNA pairs with its partner nucleotide on the opposite strand: A pairs with T, and C pairs with G. Thus, in its two-stranded form, each strand contains all necessary information, redundant with its partner strand.

Genes are arranged linearly along long chains of DNA base-pair sequences. Eukaryotic organisms, which include plants and animals, have their DNA arranged in multiple linear chromosomes. These DNA strands are often extremely long; the largest human chromosome is about 247 million base pairs in length.

The full set of hereditary material in an organism (the combined DNA sequences of all chromosomes) is called the genome.

Wellknown, registered, dynamic and private ports

Posted on February 24, 2012 by Marco Barnig

In computer networks a port is a communications endpoint in a computer’s host operating system. A port is associated with an IP address of the host, as well as the type of protocol (TCP, UDP, SC)used for communication. A port is identified for each address and protocol by a 16-bit number, commonly known as the port number which completes the destination address for a communications session. Different IP addresses or protocols may use the same port number for communications.

The Internet Corporation for Assigned Names and Numbers (ICANN) is responsible for the global coordination of Internet protocol resources which includes the registration of commonly used port numbers.

The port numbers are divided into three ranges :

well-known ports (0 – 1023
registered ports (1024 – 49151)
dynamic or private ports (49152 -65535)

Examples :

Well-known ports :

1 : Echo
20 & 21 : File Transfer Protocol (FTP)
23 : Telnet remote login service
25 : Simple Mail Transfer Protocol (SMTP)
43 : Whois
53 : Domain Name System (DNS) service
80 : Hypertext Transfer Protocol (HTTP) used in the World Wide Web
110 : Post Office Protocol (POP3)
143 : Internet Message Access Protocol (IMAP)
194 : IRC
443 : HTTP Secure (HTTPS)
554 : RTSP
636 : LDAP

Registered ports :

1234 : VLC
1220 : Qicktime Server Admin
1935 : RTMP
2948, 2949 : MMS
3306 : MySQL
5004, 5005 : RTP
5060, 5061 : SIP
5269 : XMPP
5500, 5800, 5900 : VNC
8008 : HTTP Alternate
25565 : MySQL

Dynamic and private ports :

The dynamic port numbers (also known as the private port numbers) are the port numbers that are available for use by any application to use in communicating with any other application, using the Internet’s Transmission Control Protocol (TCP) or the User Datagram Protocol (UDP).

More informations about computer ports are available at the following links :

List of TCP and UDP port numbers : Wikipedia
Ephemeral Port : Wikipedia

Concurrent connections in browsers

Posted on February 22, 2012 by Marco Barnig

The HTTP/1.1 RFC says a single-user client SHOULD NOT maintain more than 2 simultaneous connections with any server or proxy. This rule is on a per server basis. Using multiple domain names, such as 1.mydomain.com, 2.mydomain.com, 3.mydomain.com, allows a web developer to achieve a multiple of the per server connection limit, even if all the domain names are CNAMEs to the same IP address. This is called domain sharding. There are several issues with this technique : The main one is that domain sharding results in more DNS lookups and takes extra time to make the initial connections.

Modern browsers don’t follow the guideline and exceed 2 connections per server. The following list shows some actual values :

Browser	connections per host	max connections
IE 9	6	35
Firefox 10	6	60
Safari 5.1	6	35
Chrome 19	6	40
Opera 11	6	35
iPhone 4	4	30
Android 3	6	35
BlackBerry 7	5	5
Opera Mobile	2	6

The data has been provided by Browserscope, a community-driven project for profiling web browsers. The goals of Browserscope are to foster innovation by tracking browser functionality and to be a resource for web developers.

Every web developer can participate in the Browserscope project by gathering test results from users “in the wild”. The project was launched in September 2009. The owner’s of the project are Lindsey Simon and Steve Souders.

More informations about concurrent (simultaneous, parrallel) browser connections are available at the following links :

K. Scott Allen : A Software Developer’s Guide to HTTP Part III–Connections

Adaptive and context-aware images

Posted on February 16, 2012 by Marco Barnig

Last update : July 1, 2014

Each image has an innate, original height and width that can be derived from the image data. This derived height and width information is content, not layout, and should therefore be rendered as HTML <img> element attributes. To override height and width for adaptive images, CSS is the best approach.

Context-aware images in responsive web design are created by declaring a 100% width in CSS :

img {width : 100%}

Modern browsers keep the image’s proportions intact. To render a context-aware image at its native dimensions and to resize it only if it exceeds the width of its container, the classic solution is the CSS code :

img {max-width :100%}

But scaling down images is not sufficient. Sending huge image files to low-performance devices with narrow screens is not very clever. The early HTML image tags had a lowscr attribute which is no longer supported by modern browsers.

Sending the right-sized image to devices without wasting bandwith is one of the challenges in responsive web design. The main problem is that the HTML img tag has only one source attribute today.

The different techniques proposed in the recent past by the pioneers of context-aware images, listed at the end of this post, can be grouped as follows :

client-side current technology solutions
server-side current technology solutions
standardized future-technology solutions

Client-side current technology solutions :

CSS with background-images controlled by media-queries : Harry Roberts
dirty script to load temporary images and altering the image source path before handing it to the DOM parsing (1×1 pixel transparent GIF and <noscript> tag) : Jake Archibald, (Mairead Buchan, Antti Peisa, Vasilis van Gemert)
dynamic modification of the base tag : Scott Jehl
CCS3 content & attr() : Nicolas Gallagher

Server-side current technology solutions :

php-script dealing with AJAX requests : James Fairhurst
php-script dealing with image arrays : Craig Russel
php-script dealing with cookies : Matt Wilcox, Keith Clark, Scott Jehl, (Mark Perkins, Andy Hume)
service to resize images – tinySCR, now Sencha.io.SCR : James Pearce, (Graham Bird, Andrea Trasatti)
service to adapt mobile content to cell-phones with Opera Mini pre-installed : Opera Software ASA

Standardized future-technology solutions :

new html attribute pointing to a list of sources : Dominique Hazael-Massieux, Anselm Hannemann
new picture tag : Bruce Lawson, Jake Archibald, Nicolas Gallagher
progressive resolution-enhanced streaming images
new HTTP headers for content negotiation

Problems :

“Starting with small images for mobiles and upgrade to large images for desktops, without loading both and working with all browsers” is the goal of responsive design with context-aware images. Until now there is no solution that meets this objectif.

The common problems are listed below :

dynamic image tags : double downloads with external scripts ; race problems with internal scripts in some browsers ; inconsistencies when diffrent scripts change the base tag
temporary image : without javascript the image never loads
noscript tag : some browsers cannot manipulate the noscript tag
CSS generated content : only supported in Opera10+
cookies : no effect when cookies are disabled
same URL : caching fails
php scripts : CDN routing not possible
device detection : not reliable for all browsers

Some pioneers in the field of “adaptive images” are presented hereafter :

Scripts, programs and tools :

New Nine Media & Advertising : WordPress plugin
Scott Jehl, Filamentgroup : javascript responsiveimgs.js
Matt Wilcox : Adaptive Images for Responsive Designs ; http://adaptive-images.com/
James Fairhurst : Responsive Images with PHP an jQuery
Sencha.io : cloud platform for building mobile web apps and optimized fast image delivery
Opera Software ASA : proxy to adapt web content provided to devices with pre-installed Opera Mini browsers

Contributions to W3C and drafts by W3C :

RICG (Responsive Images Community Group) : Use Cases and Requirements for Standardizing Responsive Images
Dominique Hazael-Massieux : Adaptive images
Asbjørn Ulsberg : Adaptive Images
Mathew Marquis : Florian’s Compromise

Blogs :

Estelle Weyl : Clown Car Technique : Solving Adaptive Images In Responsive Web Design (61 comments)
Christopher Schmitt : Adaptive Images for Responsive Web Design (10 likes) ; The problem with adaptive images (6 comments)
C. Dain Miller : Adaptive images : solving the responsive image problem (20 comments)
Jason Grigsby : Responsive Web Design Business Challenges (14 comments) — Preferred solutions for responsive images (27 comments) — Clarification on device detection for images (5 comments) — Device detection as the future friendly img option (18 comments) — Responsive IMG’s : Choosing between semantic markup and working code (5 comments) — Other mobile first responsive web design challenges (7 comments) — Responsive IMGs – Part 1 (22 comments ) — Part 2 – In-depht Look at Techniques (37 comments) — Part 3 – Future of the image tag (22 comments) — Where are the Mobile First Responsive Web Designs? (25 comments) — Weekend Reading : responsive web design and mobile context (3 responses)
Keith Clark : Responsive images using cookies (30 comments)
Bruce Lawson : Notes on Adaptive Images (yet again!) (79 comments) ; Adaptive images : end of round one (61 comments) ; Responsive images – interim report (20 comments)
Ethan Marcotte : Fluid images (50 comments)
Chris Coyier : Techniques for Context Specific Images (37 comments) ; Which responsive images solution should you use? (39 comments)
Robert Nyman : Discussing alternatives for various sizes of the same image & introducing src property in CSS as an option (44 comments)
Jake Archibald : Adaptive Images for Responsive Designs… Again (31 comments)
Harry Roberts : Responsive images right now (29 comments)
Yoav Weiss : Simpler responsive images proposal (5 comments) — Preloaders, cookies and race conditions (9 comments) — Simpler responsive images proposal (5 comments) — Responsive images – hacks won’t cut it (8 comments) — My take on adaptive images (9 comments)
Scott Jehl : Responsive Images: Experimenting with Context-Aware Image Sizing (25 comments)
James Pearce : First, Understand your Screen (85 comments)
Craig Russel : Responsive Images and Context Aware Image Sizing (18 comments)
Mairead Buchan : Creating responsive images using the noscript tag (112 comments)
Nicolas Gallagher : Responsive images using CSS3 (6 comments)
Tim Wright : Picturefill 2.0: Responsive Images And The Perfect Polyfill (46 comments)
Eric Portis : Responsive Images Done Right: A Guide To And srcset (31 comments)

more contributors :

Internet with a Brain

Your browser becomes your personal assistant and Internet gets a synthetic consciousness