About

This application is developed and maintained by David P. Shorthouse using specimen data from the Global Biodiversity Information Facility (GBIF) and pass-through authentication provided by ORCID. It was launched in August 2018 as a submission to the annual Ebbe Nielsen Challenge.

The approximately 100M specimen records included in this project are from specimens collected 1950-2018 and have content in their recordedBy (collector) or identifiedBy Darwin Core fields. Collector and determiner names are parsed and cleaned using the dwc_agent ruby gem now available as a stand-alone utility. The similarity of people names is scored using a graph theory method outlined by R.D.M. Page and incorporated as a method in the dwc_agent gem. These scores are used to help expand the search for users' candidate specimens, presented in order of greatest to least probable. If you declared alternate names in your ORCID account such as a maiden name, these are also used to search for candidate specimen records. Fully processing 100M specimen records is a scalable, repeatable process and requires approximately 2 hours on a laptop with 16GB of RAM.

Rationale

Despite the world-wide importance of natural history collections, most are at risk because they are critically underfunded or undervalued. A significant contributing factor for this apparent neglect is the lack of a professional reward system that quantifies and illustrates the breadth and depth of expertise required to collect and identify specimens, maintain them, digitize their labels, mobilize the data, and enhance these data as errors and omissions are identified by stakeholders. See thoughts by McDade et al. 2011 who describe what are the necessary elements for a professional reward system in museum science. If people throughout the full value-chain in natural history collections received professional credit for their efforts, ideally recognized by their administrators and funding bodies, they would prioritize traditionally unrewarded tasks and could convincingly self-advocate. Proper methods of attribution at both the individual and institutional level are essential.

The Global Biodiversity Information Facility (GBIF) has prioritized a focus on people. The very first planned item in their Implementation Plan for 2017-2021 and Annual Work Programme states, "1.a.i: Develop mechanisms to support and reflect the skills, expertise and experience of individual and organizational contributions to the GBIF network, including revision of identity management system and integration of ORCID identifiers". ORCID is the organization that is best positioned to capture and resolve the identity of people in academia and professional settings. However, a bottleneck that prevents the full execution of GBIF's plan is a legacy of intractable, text-based content shared from natural history museums that ambiguously record people or organizations implicated in specimen data, none of which include pre-determined links to ORCID identifiers. Typical content shared under the Darwin Core terms recordedBy (collector) and identifiedBy (determiners) is unstructured and variable. They may include variously ordered or unorderd people names, suffer from insensitivity to cultural preferences, additionally express full or abbreviated names of organizations, or other annotations that collectively make extraction of people names from these fields difficult. The full solution requires multiple approaches. A progressive approach would be to associate ORCID identifiers with specimens when these are digitized. Another, retrospective approach is to engage users by giving them the freedom to declare the actions they took on previously digitized specimens.

Access to Data

Bloodhound data are exposed as CSV downloads or JSON-LD documents on public user profile pages. Individual occurrence records are exposed as JSON-LD documents.

Occurrence Records
https://bloodhound.shorthouse.net/occurrence/477976412.json

Where /occurrence/477976412 is that provided by GBIF, https://gbif.org/occurrence/477976412

Code

The MIT-licensed code is available on GitHub. Technologies at play include Apache Spark to group occurrence records by raw entries in recordedBy and identifiedBy and to import into MySQL, Neo4j to store the scores between similarly structured people names, Elasticsearch to aid in the searching of people names once parsed and cleaned, Redis to coordinate the processing queues, and Sinatra/ruby for the application layer.