(Posted Thu. Aug 29th, 2013)

Aug. 29: As the Maize Genetics and Genomics Database project moves forward, Off the Cob caught up with MaizeGDB Curator Dr. Jack Gardiner for an update on the progress made over the last quarter. In the interview, Gardiner explained how the U.S. Department of Agriculture’s Agricultural Research Service-supported database in Ames, Iowa has been exploring the possibilities for incorporating text mining functionality as it develops.


Recently, Gardiner attended a meeting that will help MaizeGDB become an early adopter, shaping this new technology to best benefit the scientists and corn breeders who will use it.


“I just got back from a workshop on text mining at the Information Sciences Institute at University of Southern California,” Gardiner said. “The Information Sciences Institute at USC is a group of about 350 computer scientists dedicated to information processing and communications technology.  At MaizeGDB, we are looking at text mining as a way to capture information buried deep in the literature.  At this point, text mining software tools are not a mature technology, but we know that at some point in the near future they will be. We at MaizeGDB want to be the first in line to use this technology.”


As many outside the industry may not be familiar with text mining, he went on to explain what precisely this technology does and how it relates to the mission of MaizeGDB.


“In a nutshell, text mining is a way to extract valuable information that is buried within written documents without actually having the paper be read by a pair of human eyes,” Gardiner explained. “The text undergoing text mining in our case is a scientific paper on maize, but really it could anything, a book, a magazine or even a mail order catalog.  From the perspective of MaizeGDB, we want to capture information on corn research.  We know that there are about 7,000 to 9,000 papers published on maize every year, and we also know that there is just no way that we can read them all. Even if we could, we would make mistakes and miss things in the process.  We also know that the number of papers published is growing every year; we are going to have to address this problem at some point.  Equipped with the right software there are just things that computers can do better than a pair of human eyes.”


Gardiner explained that, while this is a challenging problem, it is one which the scientific community must solve to best utilize the growing pool of data from which it has to draw.


“It is a tough problem but there are a lot of smart people working on it at the Information Sciences Institute and elsewhere,” he said. “The rewards are huge and, frankly, it is a research problem that has to be solved sooner or later.  I have mentioned big data in the past, but the type of data gleaned from text mining is a bit different.  While smaller in number, information retrieved from text mining is just as valuable.  Data retrieved from text mining tends to be more about biological processes such as drought tolerance, yield or disease resistance.  In other words, text mining often reveals information about the genes that underlie the plant trait or phenotype.”

Gardiner then addressed how MaizeGDB can become part of that solution and the assets this program brings to the table.


“At this meeting at the Information Sciences Institute, attendees were roughly fell into two different areas of expertise: biological curators, like me, or computer scientists.  Biological database curators are charged with recruiting data into the database and making it useful for the end users, scientists and corn breeders who work in the laboratory and corn field respectively.  Computer scientists are primarily concerned with writing software programs and, in this case, work in the field of natural language processing, which combines computer science, artificial intelligence, and linguistics.  Essentially their task is to write programs that allow computers to extract meaning from human language or to essentially teach computers how to read and understand what they have read.” 


By bringing these groups together, the meeting created a shared understanding that will allow these two important parties to work collaboratively in an effective manner.


“The whole purpose is to create synergy,” Gardiner said. “I don’t know a lot about writing software, and they don’t know a lot about the maize literature.  Both groups need each other if effective tools for text mining of the biological literature are going to be developed. When MaizeGDB participates in activities like these, not only are we helping to develop tools for text mining, we are positioning the MaizeGDB database to be an early adopter of these tools. This will give maize geneticists and breeders the tools to effectively keep abreast of the thousands of maize papers that are published each year.” 


Gardiner concluded by noting that this software will be useful for state corn checkoffs also as it will help avoid the duplication of funded research.


“Any time you can tap into a large body of literature you have a way to avoid duplication of research,” he noted. “There is just so much corn research that has been done or is being done. I think text mining could be a useful tool to help corn checkoffs to avoid investing in research that has already been done.  In these economic times, everyone is concerned about spending their limited dollars wisely and getting the most bang for their buck.  Besides, who wants to reinvent the wheel?”