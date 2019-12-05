An old saying holds that a picture is worth a thousand words. But to a computer analyzing DNA, it could soon be worth much more.
For decades, scientists have represented DNA’s and RNA's genetic code as long strings of letters — sometimes, billions of letters. Reading this code poses challenges for both scientists and their computers, but it's vital for understanding everything from tumors to ecology to a night of imbibing.
"For example, we can say, 'That’s an alcohol dehydrogenase,' which is of interest to college students because that's what lets you drink without dying," said Travis Wheeler, an associate professor of computer science at the University of Montana.
Wheeler, who was talking about the series of letters that define a particular protein coding gene, is the principal investigator on a project that recently received a $1.1 million research grant from the National Institutes of Health. It aims to label sequences of genetic code with a new tool: image-recognition software.
“If we view the DNA sequences as images ... then we can pretty easily bring to bear all the powerful techniques that are typically used in computer vision to this problem of analyzing DNA as well,” said Doug Brinkerhoff, an assistant professor in UM’s computer science department and a co-investigator on the project.
Brinkerhoff previously used software to analyze farmland and glaciers, “teaching computers to extract the interesting features from satellite images.” Meanwhile one of his colleagues, Wheeler has used computers to analyze the complex structures within our own cells.
DNA, RNA and protein molecules contain and carry the information that drives the processes of life, which scientists analyze as long sequences of letters. For today's biologists, it's important to identify which parts of these sequences control specific functions. Wheeler said that when the software gets something wrong, “it can cause a cascade of erroneous decisions downstream.”
Recently, Jack Roddy, a graduate student in the department who did not reply to a request for comment, identified a way to keep those original errors from happening in the first place.
The computer was incorrectly identifying the end stretches of two aligned DNA sequences — edges that his trained eye could easily see.
“He was looking at the edges of these sequences’ alignments and saying, ‘I as a human can really easily see where the bounds of this should be,’’” Brinkerhoff said.
So “if we as humans can see where the edges of alignment ought to be, then we could teach a computer to do the same thing,” he said.
Now, the researchers aim to apply image-recognition software to stretches of genetic code. The National Institutes of Health recently bought into the idea with the four-year, $1.1 million research grant. Wheeler, the project’s principal investigator, said that money will fund a software engineer and a few student researchers to help with methods development.
“I would be pretty happy with cutting that error rate by three-quarters,” he said. “We’re not going to get rid of all of it, but if we can get rid of three-quarters or 90% of that error rate, it dramatically reduces how much we're calling incorrectly.” They’ll also use the grant money to speed up the sequence-labeling process.
However much speed the UM team adds, Wheeler said the end results will be open to others for further experiments. “I'm government-funded and I view my work as a result as belonging to the community.”