Putative novel functional families in Human ncDNA
The material in this page supplements
our recent ISMB 2004 paper:
Gill Bejerano, David Haussler, Mathieu Blanchette.
"Into the heart of darkness: large-scale clustering of human non-coding DNA."
Bioinformatics (2004 ,Aug 4);20 Suppl 1:I40-I48.
Paper Abstract
MOTIVATION: It is currently believed that the human genome contains
about twice as much non-coding functional regions as it does
protein-coding genes, yet our understanding of these regions is very
limited.
RESULTS: We examine the intersection between syntenically
conserved sequences in the human, mouse and rat genomes, and sequence
similarities within the human genome itself, in search of families of
non-protein-coding elements. For this purpose we develop a graph
theoretic clustering algorithm, akin to the highly successful methods
used in elucidating protein sequence family relationships. The
algorithm is applied to a highly filtered set of about 700,000
human-rodent evolutionarily conserved regions, not resembling any known
coding sequence, which encompasses 3.7% of the human genome. From
these, we obtain roughly 12,000 non-singleton clusters, dense in
significant sequence similarities. Further analysis of genomic
location, evidence of transcription and RNA secondary structure reveals
many clusters to be significantly homogeneous in one or more
characteristics. This subset of the highly conserved non-protein-coding
elements in the human genome thus contains rich family-like structures,
which merit in-depth analysis.
Human ncDNA Clusters
Two text files are available for download:
- ismb2004.clusters.txt - a list of all clusters found.
- ismb2004.intervals.txt
- a companion list of the genomic location of each interval mentioned
in the above file, relative to the human July 2003 (NCBI Build 34)
genome draft sequence.
Work is in progress to further refine the clustering method and annotate these novel clusters.
Source Code
The source code for the clustering algorithm is currently not available for download.
Please refer to the paper for an exact description of the underlying algorithm.
Gill Bejerano
Last modified: Wed Sep 1 13:56:55 PDT 2004