which may lead to an ever-expanding set of statements taking the form of a graph.
Therefore, it is not surprising that Linked Data are disseminated on the Web more
and more, and both archivists and records managers are slowly following this trend,
creating and distributing information in the form of Linked Data, thus changing
system designs and descriptive practices. However, the archival community has not
yet developed an ontology modelling and representing provenance, whereas the data
science community has already created its own ontology for representing entities
and relationships with respect to the origin and provenance: the PROV ontology
defines provenance as "information about entities, activities, and people involved in
producing a piece of data or thing, which can be used to form assessments about its
quality, reliability or trustworthiness." (World Wide Web Consortium, 2013). The
basic model of the PROV ontology is simple and recursive, allowing for great
complexity and expressiveness. Its core concepts are Entities, Agents, and Activities
(Figure 1).19 Call them Objects, Agents and Functions, and the picture in Figure 1
will make perfect sense in the archival domain.20
The PROV ontology focuses on lineage, that is, on data's origins and history, in order
to provide a tool that may help tracing the objects back to their creation, which is a
critical issue for the verification of the process used to generate the data.
Provenance, particularly in the archival domain, is a broader concept that serves to
identify and possibly document any input, entity, system and process that has
affected data.21 Therefore, while the PROV ontology may be a good starting point,
there remains much space for discussion on how to best model archival provenance.
What is quite clear though is that we are very likely to adopt the Linked Data
paradigm. Therefore, it is worth highlighting some pros and cons of Linked Data,
in order to understand the effects on provenance.
Linked Data: benefits
The benefits of Linked Data are quite evident: due to their characteristics (i.e.,
the compliance with the requirements established by Tim Berners-Lee) they are
inherently shareble, extensible and re-usable.22 Resources - be they physical or
conceptual - are identified by language-agnostic URIs and connected through
properties that can be identified by a URI as well. This leads to the creation of a
gigantic network: this mechanism allows anyone not only to add further
descriptions (i.e., statements) that increase the size and the density of the original
network of relationships; but also, to establish new connections among different,
separated networks. Once a connection is made, the two networks become a whole,
that is, an enlarged and enriched network of relationships, which eventually leads to
a Giant Global Graph, as Tim Berners-Lee called it (Berners-Lee, 2007). In addition,
semantic enrichment - that is, the process of adding layers of metadata to content -
allows computers to process data and make sense of it, so that they can find, filter,
and connect information.
Linked Data in the archival domain allow for the creation of new pathways not only
for exploring archival descriptions, but also for accessing the resources themselves.
In fact, on the one hand Linked Data makes it possible to explore the complex web of
relationships that make up and define the context in which objects are placed, and
that has a fundamental value in Archival Science. In this sense, we could say that
Linked Data increase the possibilities for exploring the metalevel (that is, the
descriptive dimension). On the other hand, the fragmentation of information
operated by Linked Data creates new and numerous points of direct access to
resources. In theory, Linked Data are an unlimited source of access points.23 This is
perfectly in line with the non-specialist research practices suggested by search
engines: more and more users get to the archival sources through Google and other
search engines, with an approach that favors the process of disintermediation
archives in liquid times
wasDerivedFrom
Entity
wasAttributedTo
Agent
used
actedOnBehalfOf
wasAssociatedWith
startedAtTime
endedAtTime
wasInformedBy
xsd:dateTime
xsd:dateTime
Activity
Figure 1. The basic classes and properties of the PROV ontology.
19 The PROV ontology defines the classes Entity, Activity and Agent, and link such classes through properties,
as shown in Figure 1. An Entity is a "physical, digital, conceptual, or other kind of thing with some fixed
aspects; [it] may be real or imaginary." An Activity is "something that occurs over a period of time and acts
upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or
generating entities." An Agent is "something that bears some form of responsibility for an activity taking
place, for the existence of an entity, or for another agent's activity." See World Wide Web Consortium,
PROV-O: The PROV Ontology, W3C Recommendation 30 April 2013, eds. Timothy Lebo, Satya Sahoo and
Deborah McGuinness, accessed October 6, 2017, https://www.w3.org/TR/2013/REC-prov-o-20130430/.
20 The diagram included in ISDF is not much different. See International Council on Archives, ISDF:
International Standard for Describing Functions (Paris: ICA, 2007), 36.
238
giovanni michetti provenance in the archives: the challenge of the digital
21 Some authors distinguish between different types of Provenance, such as Why-Provenance, How-
Provenance, Where-Provenance, and Workflow-Provenance. See for example James Cheney, Laura Chiticariu
and Wang-Chiew Tan, "Provenance in Databases: Why, How, and Where," Foundations and Trends in
Databases I (2009): 379-474; Peter Buneman, Sanjeev Khanna and Wang Chiew Tan, "Why and Where:
A Characterization of Data Provenance," in Proceedings of the 8th International Conference on Database
Theory, London, 2001, eds. Jan Van den Bussche and VictorVianu (London, UK: Springer, 2001), 316-330;
Susan B. Davidson and Juliana Freire, "Provenance and Scientific Workflows: Challenges and
Opportunities," in Proceedings of the SIGMOD International Conference on Management of Data, Vancouver,
2008, ed. Jason Tsong-Li Wang (New York, NY: ACM, 2008), 1345-13 50. According to this perspective,
lineage is a type of Provenance.
22 This is especially true for Linked Open Data (LOD), that is, "Linked Data which is released under an open
licence, which does not impede its reuse for free." Berners-Lee, Linked Data.
23 Even with an advanced descriptive model like EAD we must go through a tag like <controlaccess>
to identify the access points.
239