arnoud glaudemans, rienk jonker and frans smit documents, archives and hyperhistorical
societies: an interview with luciano floridi
about data science being applied and used to obtain information from huge quantity
of data, no matter what the quality of the data is. Sentiment analysis of tweets is a
good example for analysing how people react to news, for example. When there is an
election people tweet a lot of details. Then you have literally millions of messages to
analyse. It is a really difficult and slippery job to have a massive algorithmic analysis
of data in terms of data science. How can you take all the data and squeeze some
good information out of it? This question comes up all the time, especially when you
deal with huge databases which have not been curated. Another strategy, which you
see taking place in some corners - especially in medical research - is having access to
highly curated, high quality small datasets. A typical example here is Google working
with health organisations in England, with access to medical records that are way
more reliable, truthful and authentic. Here you do not need a million records, but
maybe a thousand, as long as they are very good. You must be able to trust them. So,
there are two strategies: take huge quantities of data, throw lots of statistics at them
and try to squeeze something good out of them, or take smaller, very highly curated
sets, and work very precisely on the sort of training of algorithms and useful
information you wish to obtain. That is where data science is now exercising
different levels of influence.
Now, when it comes to the archival world, you normally find highly curated
documents there. That is why the great companies of the world are so interested.
Archival material combines two important features: high quantity and high quality.
Remember that data science is about using the data - and in this case to train
algorithms on them - to get the kind of information you want. Once the training is
done you do not need the data anymore. For instance, the machine needs to see ten
thousand pictures of cats. Once the machine knows how to recognise a cat, the
pictures are not needed anymore. So, when I have many radiographies of a particular
kind of cancer, the machine will learn that it is cancer. Once the training is
complete, there is no longer need for massive quantities of data. So, in that sense the
archival material is a training ground for data science, and it is very precious. All the
effort that has been put into providing high quality material is exploited to provide
good input or for training the algorithms. The point here is that all the work that has
been put into it should be paid.
EDITORS: A lot of those data are in possession of governments. It is free in the sense
of open data. So, the government cannot ask money for the data they are delivering.
FLORIDI: This is something I actually had a discussion about in the past. The
opening of national archives to free public use should be the norm. However, when
the free access to public archives generates income for companies, we might start
having a so called freemium solution, where people start paying increasingly for how
much they actually are exploiting the particular archive, to the point where it is at a
full price. Take for example the huge archive of an NGO that contains a massive
amount of agricultural data. There may be a discussion about whether to make it
public and freely available. Maybe to farmers and to the public, but free of charge to
a private company? I do not think so, because of the value and potential benefit of
the data, and the cost that the community has borne to collect and to curate the
data. The materiality of the digital, as we discussed earlier, is expensive.
Another example is free access to data about their ancestors for any individual in, for
instance, the archives in the Netherlands. This is all classic and very popular, and it
is a beautiful thing. Now, if you start using these data for more than just
genealogical reasons, e.g. by combining them with the DNA database, or start
selling products, it is a different story. For, who has the power to reorganise all those
data in a sort of profitable way? Companies yes, but not single individuals. It would
be naïve just to open everything and welcome anyone to take advantage of the data
resources made accessible. It is very expensive to collect and to curate all those data.
So some of the value should go back into the community. A private company should
pay an extra fee to use data from public archives. That funding could go back to the
archives, and more archival resources could be made available to the public. There is
an argument, at least here in the UK, in favour of opening databases and archival
material from the government for entrepreneurial use by start-ups, which could
have the opportunity to find ways of monetising the data. This is fine, but the data
will not be used only by start-up companies. This is why I think something like a
freemium model would be much preferable: free for individuals, more expensive for
companies, and the bigger the company the higher the fee may be.
EDITORS: This kind of regulation does not exist yet in the public sphere. It would be
very difficult to implement.
FLORIDI: Perhaps, but it is not unprecedented. Companies that for instance have
financial data and sell them, put online only some bits of data that is free for you to
see. But if you want 'the real thing' then you have to pay. It is not a model that
everybody knows, and it is not in use with public databases, but it provides a good
example.
As to open data, remember that the open data movement started as a political
movement in terms of transparency of government. However, it soon became
something else, once it became coloured by financial, and not longer political,
interpretations. Initially, the open data discussion was about making the
government more transparent and hence more accountable: one may see where the
money goes, who does what, and who is responsible for what kind of program, for
example. From there, the goals slowly morphed into commercial (re-)use by start
ups for innovation, and things ended up with potential exploitation by big
companies. What transparency is there in giving access to let's say records of
hospitals to a private company? It is not about transparency. It is not about a start
up. It is about a company that is taking a huge advantage of costly public records. I
am very much in favour of it, but I would add a price.
EDITORS: Could you elaborate on the term 'hyperhistory' It might be that the
hyperhistorical result in more or other tasks and goals for the archival community.
FLORIDI: Hyperhistory is a neologism I introduced in a recent book called The
Fourth Revolution - How the Infosphere is Reshaping Human Reality. It is based on a
simple idea. Time has classically been divided into prehistory and history. Prehistory
refers to any stage of human development where there exists no means of r ecording
the present for future consumption; in particular, societies without writing.
Prehistory ended around 6000 years ago in Europe and China where -
archives in liquid times
314
315