Taxonomic Data Sources

Taxonomic Data

Taxonomic data is central to the ability of countries to monitor access to genetic resources and benefit-sharing. The reason for this is that it is not possible to monitor something you don’t have.

Here are the principal taxonomic data resources. R users should note that they are very well served by the range of API taxonomic packages from ROpenSci thanks to the work of Scott Chamberlain and collaborators. In this article I will focus on some of the major data sources by pointing to the APIs, R packages and python or other versions. You will also want to take a look at Scott’s article on taxonomic data in R here.

If you are an R user then the logical starting point will be the taxize package as this provides access to many different taxonomic data sources (including almost all of those listed below). So… start there!. If you will be using the data at scale then also check out taxizedb. If you are a Python fan you should try Scott’s pytaxize. Both packages are written by the legendary Scott Chamberlain. Note that one advantage of the Python version is that it can be used in a web app more easily than the R version (where Shiny would be needed, and would be slower). However, bear in mind that pytaxize may be a little behind taxize.

While taxize will normally take you where you want to go, the individual packages may provide you with more specialised data or easier access for some purposes. So bear in mind that if taxize doesn’t meet your immediate needs… or seems complicated… try the dedicated package first. That is also often a good way to get your head around the basics of how things work before launching into taxize.

We will focus here on data sources that can be used to build up a picture of national level data.

The Global Environmental Information Facility (GBIF)

GBIF is the major source for global species occurrence (georeferenced) dara and taxonomic data and aggregates the major data services mentioned below

You can find an introductory tutorial on rbgif here. My tutorial on accessing GBIF with rgbif is here includes importing larger datasets and exploring issues with occurrence data. A second extended piece looks at mapping GBIF data with Leaflet, some of the issues you will run into and ideas on how to handle them. A taxize tutorial is here.

The Catalogue of Life and the Integrated Taxonomic Information Service (ITIS)

You can access COL and ITIS through taxize.

Species Names

Species names are messy and it is important to capture name variants wherever possible. The main source for species names and their variants at scale form part of the Global Names Architecture project. These include:

The Global Names Index. Accessible through taxize
GNI API. Accessible through taxize
The Global Names Resolver. Accessible through taxize
The Global Names Architecture provides a suite of packages for species name identification in texts. Particular attention is drawn to gnfinder written in Go. This uses a combination of dictionary based approaches and machine learning to identify species names in text files. The author Dmitry Mozzherin has been able to identify species in 50 million pages on a 40 core machine in 3 hours. The go packages for Linux, Mac and Windows are basically brand new and the instructions are a bit thin at the moment but that will improve with time as Dmitry irons out any bugs and gets to the user side of this.
Biodiversity Heritage Library name finder

Marine Species

Marine species data is served through GBIF. The following are primary sources for GBIFs coverage of marine species.

WoRMS

World Register of Marine Species. Accessible through taxize.
worms Rpackage
Worms Python example
OBIS
OBIS Python

OBIS

For more specialised data you may want also to look at:

WoRDDS : The World Register of Deep Sea Species

Plant Species

The Plant List. Accessible through taxize

Viruses

GBIF does not really deal with viruses, except by accident,. For data on viruses you may wish to try the International Committee on the Taxonomy of Viruses (ICTV) which provides an annual list in an excel spreadsheet that can be accessed here.

DNA Sequence and DNA Barcode Data

Note that I have not yet dug into the data available from bioconductor

For Barcode of Life Data

BOLD website
BOLD APIs
BOLD R Package
pybold Python package’. Note last release, 2016.
bold_retriever Python package. Note last release, 2014.