Computer-Assisted Serendipity

While I think we naturally conclude the explosion of information in the medical research world is a good thing, there are of course challenges. The problem is compounded when you consider the information both inside and outside an organization.

It’s always exciting to see advances involving things like big data and semantic web applied to medical research. Supplementing or enhancing the human researcher, not replacing them, simply described as “computer-assisted serendipity” in this interesting article describing work at Oak Ridge National Laboratory focused on literature-based discovery, is worth a look.

A side effect of this information explosion, however, is the fragmentation of knowledge. With thousands of new articles being published by medical journals every day, developments that could inform and add context to medicine’s global body of knowledge often go unnoticed.

Uncovering these overlooked gaps is the primary objective of literature-based discovery, a practice that seeks to connect existing knowledge. The advent of online databases and advanced search techniques has aided this pursuit, but existing methods still lean heavily on researchers’ intuition and chance discovery. Better tools could help uncover previously unrecognized relationships, such as the link between a gene and a disease, a drug and a side effect, or an individual’s environment and risk of developing cancer.


Mining Twitter Hashtags for Bad Drug Interactions

Over the years, its not been uncommon to get asked what value I see in Twitter. While my typical answer revolves around the value I get from it personally (keeping up, observing trends, sharing items of value, healthy stimulation from the seemly random sharing from others), this article, “New Role for Twitter: Early Warning System for Bad Drug Interactions” from the University of Vermont, provides an example of something pretty compelling from the academic realm.

And the research team also aims to help overcome a long-standing problem in medical research: published studies are too often not linked to new scientific findings, because digital libraries “suffer infrequent tagging,” the scientists write, and updating keywords and metadata associated with studies is a laborious manual task, often delayed or incomplete.

“Mining Twitter hashtags can give us a link between emerging scientific evidence and PubMed,” the massive database run by the U.S. National Library of Medicine, Hamed said. Using their new algorithm, the Vermont team has created a website that will allow an investigator to explore the connections between search terms (say “albuterol”), existing scientific studies indexed in PubMed — and Twitter hashtags associated with the terms and studies.

Correlating the use of hashtags to potential real world events, in this case drug interactions, can create a potential early warning system that can feed other more traditional practices. This brings to mind related things like Google’s monitoring of flue trends, where public health institutions can also benefit–not just paid advertisers.

I suppose a better answer to the value question should include the exciting thought of what innovation is to come.


Exabyte Scale of Genomics Data (and cat videos)

DNA, Image Source:
DNA, Image Source

Its no surprise that genomics represents a terrific big data challenge, but noting that its data has doubled every seven months over the last ten years is remarkable given how the field is poised to really explode in the coming years.

This article points out the comparison with astronomy and social media:

The authors estimate that the genomics information so far, from sequencing different organisms and a number of humans, has produced data on the petabyte scale (a petabyte is a million gigabytes). However, over the last decade, genomic sequencing data doubled about every seven months, and will grow at an even faster rate as personal genome sequencing becomes more widespread. The researchers estimate that by 2025, genomics data will explode to the exabyte scale – billions of gigabytes. This surpasses even YouTube, the current title holder among the domains studied for most data stored.

Frankly, it is refreshing to see such a valuable area of study surpassing a repository of countless cat videos as a leading data management problem in our society.