2011 Abstract

From ETaxonomy

Jump to: navigation, search

Building specimen-data curation pipelines using Kepler workflow technology in a Filtered Push network


Dou, Lei1, Hanken, James2, Ludaescher, Bertram1, Macklin, James A.3, McPhillips, Timothy M.1, Morris, Paul J.2,4, Morris, Robert A.4, Wang, Zhimin4

1UC Davis Genome Center, University of California, 451 Health Sciences Drive, Davis, CA 95616, USA
2 Museum of Comparative Zoology, Harvard University, 26 Oxford Street, Cambridge, MA 02138, USA
3Agriculture and Agri-Food Canada, Wm. Saunders Building, Central Experimental Farm, Ottawa, Ontario K1A 0C6, Canada
4Harvard University Herbaria, 22 Divinity Avenue, Cambridge, MA 01238, USA


The Filtered Push (FP) project aims to improve the quality of natural science collections data by directing assertions about data quality from consumers back to curators of the original distributed datasets and other interested parties. We demonstrate how data curation processes in an FP network can be automated and simplified by building curation pipelines with actors from the curation package of the Kepler workflow system. Our curation workflow imports a to-be-cleaned specimen dataset from a spreadsheet, database or FP network. Diverse services and tools are integrated through the workflow actors, helping data curation in different dimensions: using visualization services (e.g., Google Maps), which show specific data distribution patterns, it is easy to spot quality problems in the input dataset, including latitude/longitude transpositions, use of non-standard taxon names and misspellings of collectors’ names. To correct these problems, curation operations are introduced; e.g., taxon names are normalized through name authority services such as IPNI. Related records, e.g., those for duplicate herbarium specimens, are automatically identified and curated through data clustering and fusion. The fused records, together with the original records, are imported into a Google spreadsheet. The appropriate curator is directed by email to this spreadsheet, where curation results can be confirmed or further edited. Finally, the workflow compiles all relevant information as a synthesis of proposed changes, remaining problems, and expectations about proposed actions to be taken on the original dataset. Interested parties can assess the credibility of assembled curated data through examination of the data lineage in a provenance browser, which can display curatorial changes made by each curator or software agent. All operations made in the curation pipeline are depicted as a graph, which can be intuitively traversed and queried in the provenance browser. The curation package is available to developers now; user release is scheduled for April 2011.

Personal tools
All Hands Meeting