Taxonomic data is central to the ability of countries to monitor access to genetic resources and benefit-sharing. The reason for this is that it is not possible to monitor something you don’t have.
Here are the principal taxonomic data resources. R users should note that they are very well served by the range of API taxonomic packages from ROpenSci thanks to the work of Scott Chamberlain and collaborators. In this article I will focus on some of the major data sources by pointing to the APIs, R packages and python or other versions. You will also want to take a look at Scott’s article on taxonomic data in R here.
If you are an R user then the logical starting point will be the taxize package as this provides access to many different taxonomic data sources (including almost all of those listed below). So… start there!. If you will be using the data at scale then also check out taxizedb. If you are a Python fan you should try Scott’s pytaxize. Both packages are written by the legendary Scott Chamberlain. Note that one advantage of the Python version is that it can be used in a web app more easily than the R version (where Shiny would be needed, and would be slower). However, bear in mind that pytaxize
may be a little behind taxize
.
While taxize will normally take you where you want to go, the individual packages may provide you with more specialised data or easier access for some purposes. So bear in mind that if taxize doesn’t meet your immediate needs… or seems complicated… try the dedicated package first. That is also often a good way to get your head around the basics of how things work before launching into taxize.
We will focus here on data sources that can be used to build up a picture of national level data.
GBIF is the major source for global species occurrence (georeferenced) dara and taxonomic data and aggregates the major data services mentioned below
You can find an introductory tutorial on rbgif
here. My tutorial on accessing GBIF with rgbif is here includes importing larger datasets and exploring issues with occurrence data. A second extended piece looks at mapping GBIF data with Leaflet, some of the issues you will run into and ideas on how to handle them. A taxize
tutorial is here.
You can access COL and ITIS through taxize.
Species names are messy and it is important to capture name variants wherever possible. The main source for species names and their variants at scale form part of the Global Names Architecture project. These include:
Marine species data is served through GBIF. The following are primary sources for GBIFs coverage of marine species.
WoRMS
OBIS
For more specialised data you may want also to look at:
GBIF does not really deal with viruses, except by accident,. For data on viruses you may wish to try the International Committee on the Taxonomy of Viruses (ICTV) which provides an annual list in an excel spreadsheet that can be accessed here.
Note that I have not yet dug into the data available from bioconductor
For Barcode of Life Data