Difference between revisions of "Some assembly required"

From Rachel Carson College Wiki
Jump to: navigation, search
m
imported>Pmmckerc
m
 
(No difference)

Latest revision as of 12:41, 18 October 2018

Some Assembly Required

By Roger Smith and Alexandra Weber Morales

June 01, 2001 Dr Dobbs


Picture this: In a tiny, cluttered garage office stuffed with children's pictures and toys, an overflowing bookcase (The Idiot's Guide to Red Hat Linux, How Brains Think) and two Linux and Windows workstations, a 41-year-old doctoral candidate at the University of California at Santa Cruz is coding the program that will assemble a first draft of the human genome. He and his colleagues are motivated partially by fear: Some predict that if the international consortium he's helping doesn't speed its progress in charting the nucleotide structure of human DNA, the very material that defines us as a species could be locked up by commercial patents. In four 80-hour weeks—from May 22 to June 22, 2000—the amiable but intense Jim Kent will write most of the 10,000 lines of GigAssember, taking time out to put ice packs on his wrists to ward off pain from the repetitive stress they must endure.

He works at this frenetic pace because a technologically endowed private venture, headed by maverick DNA researcher J. Craig Venter, is itself closing in on sequencing the three billion nucleotides that make up our genome. Though Venter disavows the accusation that he plans to patent thousands of genes, he does have something to prove. After the National Institutes of Health rejected his proposal to quickly map the genome with the "shotgun" method that he'd used to sequence the entire flu bacterium in 1995, Venter, a former NIH lab chief, founded Celera Genomics in 1998. The money and machinery behind the start-up came from Applera Corporation, parent of Applied Biosystems, which builds state-of-the-art DNA sequencing robots. "Discovery can't wait" is the Rockville, Maryland-based Celera's motto. Nor did it wait—with Celera breathing down their necks, researchers for the public Human Genome Project beat their own deadline for drafting the genome.

The first heat is over. After the histrionics had died down and a scientific publishing flap was resolved, the Human Genome Project and Celera Genomics agreed to release their results in the Feb. 15, 2001 issue of Nature and the Feb. 16, 2001 issue of Science, respectively. The story of how Jim Kent—along with computer science professor David Haussler, UCSC's coordinator for genome research—played a part in getting a once-plodding project to the finish line is one of personal motivation in the face of extremely distributed development processes. But first, a brief history.

Slipping the Schedule

The idea of tackling the human genome surfaced in 1985, starting with a brainstorming session convened by Robert Sinsheimer, then chancellor at UCSC. The next year, Nobel laureate Renato Dulbecco suggested that whole-genome sequencing could revolutionize cancer research, while Charles DeLisi, head of the office of Health and Environmental Research at the Department of Energy (DOE), proposed a crash program for meeting that goal.

By 1990, the Human Genome Project had gotten underway and was aiming for completion in 2005 at a cost of $3 billion. There were three goals: first, locate specific genes to their relative positions on the chromosomes; second, physically map the positions, in numbers of base pairs, of known genes and landmarks; and third, sequence the entire chain, using DNA from several individuals.

Early progress was rapid. Genes were identified for debilitating diseases such as muscular dystrophy, Alzheimer's and some cancers. Still, by 1998 only 3 percent of the entire human genome had been sequenced. Celera's entry into the race that year made a big splash, both in the popular press and among scientists in the emerging discipline of bioinformatics (computational biology). "It's like a private company in 1967 announcing they're going to race NASA to the moon," says Harvard professor and AIDS researcher William Haseltine.

"There is no denying that Celera was helpful in giving us the incentive to move quickly and focus our efforts," Kent told Software Development Technical Editor Roger Smith in a recent interview at his office. "On the other hand, while the organization of the public effort is quite diffuse, it's been remarkably cooperative. In some ways, it's amazing that it could possibly work, but it does."

The Human Genome Project is a truly international effort, with much of the funding coming from the Wellcome Trust, a British medical charity. The U.S. DOE's Human Genome Program and the NIH's National Human Genome Research Institute (NHGRI) oversee research in the U.S. Over half of the worldwide effort has taken place here, in various genome centers at the National Laboratories (Los Alamos, Lawrence Livermore and Lawrence Berkeley) and universities such as Baylor, MIT and Washington University in St. Louis.

Running on a Linux cluster of up to 100 Pentium III workstations, Kent's GigAssembler program analyzes data consolidated from the genome consortium's sequencing laboratories and stored in GenBank, a public DNA sequence database. Designed by the National Center for Biotechnology Information to provide the scientific community access to the most up-to-date and comprehensive DNA sequence information, GenBank contains annotated, publicly available DNA sequences. Exchanging data on a daily basis with the DNA DataBank of Japan and the European Molecular Biology Laboratory, GenBank has grown exponentially during the past few years, from two million sequence records in 1997 to approximately 11 million in February 2001. While large sets are sometimes still submitted by tape, scientists can now use a Web tool, BankIt, to submit simple sequences.

"It's a unique situation, in my experience," Kent says. "No one really has any authority over anyone else. The only way we can proceed is by consensus. Yet, at the same time, it's been going quite quickly, largely because our interests are so aligned—we all very strongly want to do the same thing."

Too Many CDs

After earning his bachelor's and master's degrees in mathematics at UCSC in 1983, Kent began writing graphic and animation programs for Amiga and Atari personal computers. He shifted to IBM PCs with the advent of VGA cards, writing the Animator program for Autodesk Inc. The software sold well, financing more academic pursuits.

"Around 1996, I got bored. It seemed to me that my job for the last three years was doing the same thing on a new variant of the Microsoft operating system. It was a shakeout period before they settled on the DirectX [graphics] standard, and their APIs were changing so fast, especially their graphic APIs: two or three major graphic APIs per operating system. I was fed up with it. The last straw was when the developers' kit for Windows 95 came out on 12 CDs," Kent remarks. "The entire human genome fits on one CD. You can't tell me it [software] needs to be that complicated."

Back in school, Kent began analyzing the DNA of C. elegans, the one millimeter-long roundworm much studied by biologists. In December 1999, he was tapped by Haussler—himself recruited by Eric Lander, director of the Genome Center at MIT's Whitehead Institute—to help analyze the genetic blueprint of a slightly more complicated organism: H. sapiens.

How Sequencing Works

At the core of each living cell, be it human, rabbit or worm, are developmental blueprints—a complete gene complement—written in deoxyribonucleic acid (DNA) molecules. DNA is a twisted ladder structure built of nucleotides: "base pairs" of adenine (A) bonded to thymine (T) or cytosine (C) bonded to guanine (G). A strand of these bases—half of the helix—is abbreviated like this:

ATTCGAGCTCGGTACCTTTTCCTGCCATG

The chain of nucleotides that comprises the genome for the H. influenzae bacterium is 1.8 million base pairs long; those of the fruit fly and the human are 120 million base pairs and 3.5 billion base pairs long, respectively. At the root of the sequencing problem is the fact that current technology can only read about 500 nucleotides at a time. The most common method of sequencing is based on Frederick Sanger's Nobel prize-winning technique, developed in 1977, for using enzymes to synthesize DNA chains of varying length in four different reactions, stopping the DNA replication at positions occupied by one of the four bases, and then determining the resulting fragment lengths.

Unpolymerized As, Cs, Ts and Gs are incubated with single-stranded DNA and DNA polymerase. A small fraction of a given base is chemically modified so that, if incorporated, it stops the DNA polymer from growing. Since millions of chains are synthesized in this reaction, you end up with a population of DNA chains that stop at each instance of that given base. Thus, if the template is:

GCAATCAGTACCACTA

you end up with the following chains:

GCA GCAA GCAATCAGTA GCAATCA GCAATCAGTACCA GCAATCAGTACCACTA

Figure 1. Fluorescent Trace-Data From Ensembl Trace Server


Enzymes are used to synthesize DNA chains of varying length in four different reactions, stopping the DNA replication at positions occupied by one of the four bases, and then determining the resulting fragment lengths. A dedicated sequencing machine uses gel electrophoresis—placing the fragments in charged polymeric goo and watching them migrate depending on their length—to sort the chains. Each base is tagged with a specific fluorescent dye, allowing the DNA sequence to be read by the machine.

Similar reactions are performed with the other three bases. A dedicated sequencing machine, such as the $300,000 Applied Biosystems PRISM 3700 DNA Analyzer, automates the initial reaction and analysis. It uses gel electrophoresis—placing the fragments in charged polymeric goo and watching them migrate depending on their length—to sort the chains. Each base is tagged with a specific fluorescent dye, allowing the DNA sequence to be read by the machine, as illustrated in Figure 1. Now comes the hardest part: putting millions of pieces back together in the right order.

Deconstructing Mary

The private and public human genome projects started out with two fundamentally different approaches to assembling small, accurately read sequences into the larger draft map of the genome. Taking advantage of one of the largest civilian computers in the world, an $80-million Compaq supercomputer with four terabytes of memory, Celera Genomics used Venter's "whole-genome shotgun sequencing," a method that blasts the DNA randomly into thousands of partially overlapping fragments, then uses the overlaps to put the genome back together, somewhat akin to a giant jigsaw puzzle. Kent offers the following "nursery rhyme" analogy to explain the differences between Venter's shotgun approach and the public consortium's hierarchical method.

In assembling the overlapping pieces, sometimes placement is uncertain, or pieces don't fit. One of the most challenging engineering problems, according to Kent, is coping with the numerous similar DNA sequences, or "repeats"—recurring sets of letters in the human genome.

The hierarchical approach breaks the repeats into big pieces, which are sequenced separately and then combined; that is,

maryhadalittlelamblittlelamblittlelamb

might be sequenced and assembled into one piece, and

maryhadalittlelambwhosefleecewaswhiteassnow

assembled into another; and only afterward would the two big pieces be combined. The "paired read" shotgun approach comes at the problem in a different fashion, Instead of completely sequencing a 4,000-base-long assembly, say, you sequence only the shorter, 500-base ends and record the distance between them:

maryhadali--------cewaswhite

Because this measurement is a known distance, these "paired read" markers help you do both an original assembly and check the quality of a more finished assembly.


Assembly Required

The two methods are complementary, Kent points out. "Toward the end of the project, in March of 2000, Celera used parts of the hierarchical approach in a divide-and-conquer effort to make the problem smaller, while we used paired reads, as well." With either approach, the alignment problem (locating overlaps in the smaller DNA sequences) is computationally intensive. "At one point, Celera had a pile of 50 million reads. I had, at most, 400,000 longer pieces because, basically, I was doing the second step of a hierarchical assembly.

"Computationally, the alignment problem is this," Kent explains. "You have two strings. How can we put them together so that the most bases will match? On our [Linux] cluster, we spend three days doing alignments and maybe two hours doing the assembly on the alignments."

Similarly, Celera's process of comparing every read against every other read in search of complete end-to-end overlaps of at least 40 base pairs (and with no more than 6 percent difference in the match) took 10,000 CPU hours, or about five days, running on a suite of 40 four-processor Alpha SMPs with 4 gigabytes of RAM each. Since the chemical procedures used to sequence DNA aren't perfect (the error rate is about 5 percent), it takes more than simple string comparisons to find two overlapping pieces of DNA. A procedure that works well for finding these overlaps is to build up an index that indicates where every 12-mer (12 letters, or 24 bits of data) in the DNA database is located.

Once the index is built, multiple query sequences can be quickly located within the original database. Since the average query sequence is about 500 bases long, there are typically at least 15 or 20 12-mers inside of such a sequence, provided there is no error. Therefore, a program can use the index to quickly look in both the query and the database sequence for neighboring clusters of 12-mers. In a matter of milliseconds, this clustering reduces the problem of locating a 500-base query in a 3 billion-base database to that of locating a 500-base query in perhaps a 1000-base window of the database. More leisurely algorithms can then work out the alignment details inside of the smaller window, also in a matter of milliseconds.

Asked if the public consortium's data is of poor quality—a criticism leveled by Venter—Kent responds: "Well, a third party needs to judge. Celera may be biased because of their financial involvement. We're using the same [sequencing] machines as they are. Our reads are as accurate as theirs. It's a question of the assembly.

"We took the hierarchical approach, which has some advantages. You can give out different pieces to different people to work on. This means some pieces are finished earlier than other pieces. At this point in the project, approximately one-third of the entire genome is completely finished—all the gaps have been removed, processed by humans, run through all kinds of quality checks. What we are talking about is the unfinished areas of the draft where our typical scaffold, or ordered piece size, is 10X [10 times the size of an individual gene], while with Celera, the typical scaffold size is 100X, or an order of magnitude larger."


Patenting a Discovery?

This year, Kent's concern that private efforts would force scientists to go through the U.S. Patent and Trademark Office in order to work on the assembled human genome was alleviated by the consortium's publication.

"Until quite recently, the level of effort required to get a gene patent was quite low," he says. "You could spend two to three months in the lab generating the sequence and all of two hours on a computer analyzing it, and you might end up with the raw material for 500 to 1,000 patents." The patent office tightened up the rules last January, after complaints from leaders of the public consortium, including current NHGRI Director Francis Collins and James Watson, first director of the Human Genome Project and best-known for discovering, with Francis Crick, the double-helix structure of DNA. "But," Kent maintains, "I'm still not sure the bar is high enough."

The mapping of the genome is comparable to the discovery of the periodic table of elements in the 19th century, after which the next 100 years were spent filling in the holes.

"Say you had purified a new metal—found tungsten, perhaps—people [are being given the right to] patent anything that involves tungsten: making a foil, conducting electricity, conducting heat—anything that follows from any property of the metal," Kent points out. Next to the periodic table, mapping the human genome is a vast undertaking; instead of a hundred or so elements, it entails 30,000 to 50,000 genes.


Stone Soup


In addition to the GigAssembler program, Kent created another tool called the Human Genome Browser (http://genome.ucsc.edu/ goldenPath/hgTracks.html), which gives Web users a quick display of various portions of the genome at different scales, along with more than two dozen tracks of information (genes, assembly gaps, chromosomal bands and so on) associated with the completed human genome sequence. As Kent explains, the browser he wrote follows an open, "stone soup" development process, with contributions to the kettle from the Ensembl group in England, Genoscope in France, Baylor College of Medicine in Houston, Washington University in St. Louis, University of Washington in Seattle, the International SNP Consortium, Affymetrix Inc., Softberry Inc. and NCBI.

"There are about 20 or 25 data tracks in the browser that can be turned on or off now. And since I was responsible for putting in four or five of the tracks, I might as well call these the carrots," Kent says, pointing at the screen, "since carrots are traditional incentives."

Celera, too, has created software for surfing the genome: a "Discovery System" that allows subscribers to use its databases, non-proprietary genome and biological datasets, computational tools and super-computing power.

"The model is, you don't own the data; it's not secret information," said Venter in an April 2000 television interview on the Public Broadcasting Service's Newshour. "It's such a large data set that [the model is] making it useful, making it interpretable, making it so that pharmaceutical companies, scientists, universities and government can use the human genetic code and understand what it means, how to come up with new treatments for disease.

"Just printing the Drosophila genome in very tiny print covering the entire sheet of paper is a stack of paper about 5.5 feet tall. That's the genetic code. That's just the As, Cs, Gs and Ts. That's not the interpretation. That's not the linkage out to tens of thousands of scientific articles in the literature, to disease information."

In January, Venter's company signed a multi-year subscriber agreement with the University of California system allowing UC investigators access to all of Celera's database products—an agreement that Venter calls "personally very gratifying." Venter's own 1975 Ph.D. in physiology and pharmacology comes from UC San Diego.

And, in another sign that the rivalry has begun to subside, Celera recently received $21 million from the NIH—part of a two-year, $58 million grant package split between Celera and Baylor College of Medicine in Houston—to sequence the rat genome.


The Genomic Economy

It's clear now: Our genes are but a small portion of the nucleotide chains coiled within our cells. They are isolated signposts separated by miles of base pairs—transposons and other "dark matter"—that seem to lead nowhere. In a convergence of robotics, parallel computing power, algorithmic optimization and open source collaboration, a generation of software engineers will find employment in the computationally intensive field of genomics, trying to unravel the deepening mystery of why human cells contain so much apparently useless code. It should be a nice living—as long as they ice their wrists.