Scientific Literature

There are three main choices when attempting to engage in ABS monitoring of the scientific literature

  1. commercial databases (such as Web of Science or Scopus)
  2. Free databases such as crossref or
  3. Combinations of the above

Below we list packages in Python and R for accessing the scientific literature. If you know of packages in other languages (such as Ruby or in Javascript) please raise an issue on Github. The list is not meant to be comprehensive but for ABS monitoring tools such as Crossref should be sufficient as a starting point.

Commercial Databases

Commercial databases such as Web of Science from Clarivate Analytics or Scopus are widely used in Universities. If seeking to build up a collection one option is to retrieve dois for a subject from the commercial provider and then to retrieve open access metadata from a service such as crossref.


The R community is well served by a range of packages from ROpenSci. These include packages such as fulltext for searching across multiple data sources and utilities such as pdftools or extractr for working with pdfs, tabulizer for extracting tables from pdfs etc. Equivalent tools may be available in Python or other languages and will be documented here as they become available. Please raise an issue on Github if you have suggestions.

Open Access Databases


The main open access database for scientific literature is Crossref which contains meta data on over 96 million publications. Use cases include looking up researchers by author name and affiliation. Or, where a list of DOIs is available send the DOIs to retrieve the metadata. Note that search capacity in Crossref is limited at present and multi-phrase searching may produce unexpected results.

  1. Crossref
  2. Crossref API
  3. Crossref Python
  4. Crossref R package
  5. Crossref Ruby gem
  6. Crossref Javascript

A text mining client for crossref is also available in R as crminer

Provides full text search and open access to over 133 million publications. You need to register for an API key here.

  1. CORE API documentation
  2. rcoreoa Package
  3. Python pyoacore


PubMed contains publications that are related to medicine

  1. PubMed API
  2. rentrez R Package
  3. easyPubMed R Package
  4. Python Biopython package and see this gist

Arxiv databases

There are a range of open access preprint databases that have become increasingly popular

  1. Arxiv
  2. SocArxiv
  3. BioArxiv