The ContentMine

Back to: 
Big Data in Medicine: Exemplars and Opportunities in Data Science

Peter Murray-Rust, Richard Smith-Unna, Stephanie Unna, Mark MacGillivray, Jenny Molloy, Ross Mounce, Graham Steel

The ContentMine

Peter Murray-Rust, Richard Smith-Unna, Stephanie Unna, Mark MacGillivray, Jenny Molloy, Ross Mounce, Graham Steel

Abstract

TheContentMine project (contentmine.org) is designed to crawl, scrape, normalize and mine the scientific literature, extracting hundreds of millions of facts annually. ContentMining has been legal in the UK since 2014-06 for non-commercial research and we are now mining the daily output of scientific publications (>2000 articles per day). Mining is through plugins and currently we can extract biosequences, species, chemistry, phylogenetic trees, genes, identifiers and bibliographic metadata. These are linked to biodatabses, crystallography, Wikidata, Pubchem and other references. The output serves as a semantic index for rapid searching of semantic science, which traditional search engines cannot do.
We are working with collaborators to apply this technology to the clinical trials literature and public health reports, providing daily feeds of information but also assisting in the compilation of meta-analyses and systematic reviews.


We see this as a machine-human symbiont and are looking to collaborate with domain experts in medicine and clinical or pre-clinical research. All systems and content are fully Open (Apache2, CC BY, CC0 as appropriate)