25 January 2019 Categories: Communication, Featured, Science

An Interactive Lecture Presented by Dr. Peter Murray-Rust,
Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, UK

Monday, April 23, 4:00 pm
Virginia Dale Room/University Club
Lory Student Center

Presentation Abstract:

Digital repository tools such as DSpace and Fedora are now being widely promoted for the capture and preservation of primary academic research output and many institutions and funders are starting to mandate such processes. In principle this effort can be extended to the data on which scientific research rests and has the potential of generating a huge resource for data-driven practice. In Cambridge, we have started to explore this, and I shall report – with interactive demonstrations – on what is currently possible.

However there are many problems that do not apply to research articles (usually in PDF).
These include:

  • Ownership of data. Although in principle data cannot be copyrighted, in practice many publishers require scientists to hand over their data which they can then re-sell. Even without this there are strong cultural issues about who owns the data – the funder, the institution, the research group, the scientist, etc.
  • Syntactic problems. PDF is a disaster – it destroys data. The technical answer is simple – use XML, but the culture must change.
  • Semantics and ontology. Digital data is only useful if we know what it means and how it can be used. This requires a community-wide issue, ideally by learned societies.
  • Metadata. All metadata must be populated by machines. This should be possible for rights, formats, provenance, For discovery and indexing the only realistic approach is automatic indexing of free text (neither scientists nor librarians have the time or knowledge to do this).
  • Repositories. The relative success of reposition of PDFs suggests that it should be straightforward to do the same for scientific data. It is anything but easy. We need schemas, use cases, and much software.

But the major problem is getting it to happen. We believe that the best place to start is with theses. Here the institution is (or should be) in control of what is done, including the requirement to reposit semantic theses (i.e. additional to any PDF) and are starting the JISC-funded SPECTRa-T project to develop protocols and tools.

When data are properly reposited enormous new opportunities in data-driven science arise. We have developed a protocol for the automatic semantic capture of crystallography, computational chemistry and spectroscopy. All tools are Open and we are encouraging anyone to take them and adapt them for their institutions. In this way we can use social computing to tackle the problems of scale and maintenance. We have also managed to capture all currently published crystallography from those journals which allow us to do so. This creates a large knowledgebase for molecules which – if widely adopted – will replace much of the broken commercial aggregation of chemical data by out-of-date secondary abstracting services.

But academia – disciplines, libraries and computational science – and funders have so far been apathetic to these problems and opportunities. I shall show what is possible and give demonstrations of what can be done with scientific data in repositories.