Lineage Issues for Scientific Data and Information

James Frew
Rajendra Bose

(position statement for Workshop on Data Derivation and Provenance, Chicago, IL, October 17-18, 2002)

We'd like to see the workshop address the following issues:

1. Standard lineage representations

Some sort of standard representation is needed so data lineage can be communicated reliably across system, network, and organizational boundaries. A primary motivation for this is the reliable assembly of ad-hoc distributed processing chains, encompassing multiple organizations or entities whose sole interconnections are via public networks and protocols. Current data lineage standards (e.g., SDTS) are inadequate: they offer general guidance for including lineage metadata as unstructured text, from which a program cannot recover the lineage graph.

What is the right way to do this? A "lineage BLOB" that travels along with a data product, continuously appended like a log or audit trail? A sequence of permanent references to out-of-band (independently maintained) lineage? What infrastructure (e.g. persistent identifiers; formats) do these solutions require?

2. Automated lineage acquistion

Regardless of how lineage information is communicated, lineage standards should require a level of structure that allows lineage to be created and interpreted automatically. Machines can't interpret unstructured lineage, and (experience shows) humans won't create it. Lineage acquistion should be automated (so that we take advantage of the things computers can track easily) and unobtrusive (so current work methods of scientists and researchers are not disrupted.)

We have taken some small steps toward this goal with the "Earth System Science Workbench", a database system that automatically tracks science data processing performed in external environments by arbitrary (i.e. not necessarily database-aware) applications. How can such capabilities be integrated into the standard infrastructure of distributed computing environments?

References

Bose, R., 2002. A Conceptual Framework for Composing and Managing Scientific Data Lineage. In: J. Kennedy (Editor), 14th International Conference on Scientific and Statistical Database Management. IEEE Computer Society, Edinburgh, Scotland, pp. 15-19.

Frew, J. and Bose, R., 2001. Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products. In: L. Kerschberg and M. Kafatos (Editors), 13th International Conference on Scientific and Statistical Database Management. IEEE Computer Society, Fairfax, VA, pp. 180-189.