Stanford/Packard Scientist's Data-mining Technique Strikes Genetic Gold

January 11, 2006

Stanford/Packard Scientist's Data-mining Technique Strikes Genetic

Gold

http://www.medicalnewstoday.com/medicalnews.php?newsid=36009

A new method to mine existing scientific data may provide a wealth of

information about the interactions among genes, the environment and

biological processes, say researchers at the Stanford University

School of Medicine, Lucile Packard Children's Hospital and Harvard

Medical School. Like panning for gold, they used the powerful

technique to sift through millions of bits of unrelated information -

in this case, gene expression data from so-called microarray

experiments - to pinpoint genes likely to be involved in leukemia,

aging, injury and muscle development.

" This is just the tip of the iceberg, " said bioinformatics specialist

Atul Butte, MD, PhD, who is also a pediatrician at Lucile Packard

Children's Hospital at Stanford. " Nearly 100 different diseases have

been studied using microarrays, spanning all of medicine. This is a

new way to explore this type of data. We can study virtually

everything that's been studied. " Butte is the first author of the

study, which is published in the Jan. 6 online issue of Nature

Biotechnology.

The advance comes with a caveat, however: clinically useful nuggets

will be buried under the avalanche of data inundating international

repositories each year unless scientists come up with a way to better

classify their experiments and results.

" Libraries figured out a long time ago how to classify items using

the Dewey decimal and other systems, " said Butte, who estimates that

the contents of the databases are more than doubling each year. " We

need to write software now that will help scientists assign the

proper concepts to each experiment. "

Microarray experiments allow researchers to compare the expression

patterns of tens of thousands of individual genes over time in

diseased and healthy cells, or in many other experimental conditions.

Each experiment generates thousands of pieces of data about the

cell's genes. Although biologists use the technology routinely,

focusing only on the few results pertinent to their particular

research topic, most scientific journals require that their authors

submit all of their data to international databases for use by other

researchers.

Butte and his Harvard co-author, Isaac Kohane, MD, PhD, used computer

programs to automatically categorize the tens of thousands of

microarray experiments in a single database based on the terms, or

concepts, used by the submitter to describe the experiment. They then

looked for findings shared by several experiments with similar

concepts, such as tissue type, for example. Comparing results from

many similar experiments allowed them to identify correlations that

may not be statistically significant in just one experiment.

Butte and Kohane identified several previously unknown correlations:

nine genes whose expression increased or decreased significantly with

aging, two genes that are highly expressed in response to injury, and

another gene in which the expression drops significantly in leukemic

cells. They also confirmed these relationships by studying genes

known to be associated with muscle tissue in both humans and mice.

Their classification system was stymied, however, when scientists

included too much or too little information in the text annotations,

or used imprecise words such as " pool, " which can mean either a body

of water or the action of combining the contents of two or more

tubes.

" As a community, we've standardized the way the data itself is

represented, " said Butte, " but there are no formal requirements for

the accompanying textual descriptions of this data. Sometimes people

seem to almost copy and paste their entire scientific paper into the

text box. We need to clean up our annotations because now we're

showing that they have value. "

Butte and Kohane favor using the existing Unified Medical Language

System, which consists of more than 1 million biomedical concepts, to

vastly simplify the computerized sorting of the thousands of

microarray experiments submitted to databases each year. Without such

a system, valuable information will simply be lost as the results

pile up. The National Institutes of Health recently funded the

National Center for Biomedical Ontology, a consortium led by Stanford

professor Mark Musen, MD, PhD, to develop ontologies to allow

scientists to describe their data in standardized ways.

" All the answers are already there, " said Butte. " We've reached a

critical mass with this data. But unless we're careful, we're going

to end up with a big mess. "

Stanford University Medical Center integrates research, medical

education and patient care at its three institutions - Stanford

University School of Medicine, Stanford Hospital & Clinics and Lucile

Packard Children's Hospital at Stanford. For more information, please

visit the Web site of the medical center's Office of Communication &

Public Affairs at mednews.stanford.edu

Sign In

Stanford/Packard Scientist's Data-mining Technique Strikes Genetic Gold

Recommended Posts

Guest guest

Link to comment

Share on other sites

Join the conversation

Activity