arnoud glaudemans, rienk jonker and frans smit documents, archives and hyperhistorical societies: an interview with luciano floridi about data science being applied and used to obtain information from huge quantity of data, no matter what the quality of the data is. Sentiment analysis of tweets is a good example for analysing how people react to news, for example. When there is an election people tweet a lot of details. Then you have literally millions of messages to analyse. It is a really difficult and slippery job to have a massive algorithmic analysis of data in terms of data science. How can you take all the data and squeeze some good information out of it? This question comes up all the time, especially when you deal with huge databases which have not been curated. Another strategy, which you see taking place in some corners - especially in medical research - is having access to highly curated, high quality small datasets. A typical example here is Google working with health organisations in England, with access to medical records that are way more reliable, truthful and authentic. Here you do not need a million records, but maybe a thousand, as long as they are very good. You must be able to trust them. So, there are two strategies: take huge quantities of data, throw lots of statistics at them and try to squeeze something good out of them, or take smaller, very highly curated sets, and work very precisely on the sort of training of algorithms and useful information you wish to obtain. That is where data science is now exercising different levels of influence. Now, when it comes to the archival world, you normally find highly curated documents there. That is why the great companies of the world are so interested. Archival material combines two important features: high quantity and high quality. Remember that data science is about using the data - and in this case to train algorithms on them - to get the kind of information you want. Once the training is done you do not need the data anymore. For instance, the machine needs to see ten thousand pictures of cats. Once the machine knows how to recognise a cat, the pictures are not needed anymore. So, when I have many radiographies of a particular kind of cancer, the machine will learn that it is cancer. Once the training is complete, there is no longer need for massive quantities of data. So, in that sense the archival material is a training ground for data science, and it is very precious. All the effort that has been put into providing high quality material is exploited to provide good input or for training the algorithms. The point here is that all the work that has been put into it should be paid. EDITORS: A lot of those data are in possession of governments. It is free in the sense of open data. So, the government cannot ask money for the data they are delivering. FLORIDI: This is something I actually had a discussion about in the past. The opening of national archives to free public use should be the norm. However, when the free access to public archives generates income for companies, we might start having a so called freemium solution, where people start paying increasingly for how much they actually are exploiting the particular archive, to the point where it is at a full price. Take for example the huge archive of an NGO that contains a massive amount of agricultural data. There may be a discussion about whether to make it public and freely available. Maybe to farmers and to the public, but free of charge to a private company? I do not think so, because of the value and potential benefit of the data, and the cost that the community has borne to collect and to curate the data. The materiality of the digital, as we discussed earlier, is expensive. Another example is free access to data about their ancestors for any individual in, for instance, the archives in the Netherlands. This is all classic and very popular, and it is a beautiful thing. Now, if you start using these data for more than just genealogical reasons, e.g. by combining them with the DNA database, or start selling products, it is a different story. For, who has the power to reorganise all those data in a sort of profitable way? Companies yes, but not single individuals. It would be naïve just to open everything and welcome anyone to take advantage of the data resources made accessible. It is very expensive to collect and to curate all those data. So some of the value should go back into the community. A private company should pay an extra fee to use data from public archives. That funding could go back to the archives, and more archival resources could be made available to the public. There is an argument, at least here in the UK, in favour of opening databases and archival material from the government for entrepreneurial use by start-ups, which could have the opportunity to find ways of monetising the data. This is fine, but the data will not be used only by start-up companies. This is why I think something like a freemium model would be much preferable: free for individuals, more expensive for companies, and the bigger the company the higher the fee may be. EDITORS: This kind of regulation does not exist yet in the public sphere. It would be very difficult to implement. FLORIDI: Perhaps, but it is not unprecedented. Companies that for instance have financial data and sell them, put online only some bits of data that is free for you to see. But if you want 'the real thing' then you have to pay. It is not a model that everybody knows, and it is not in use with public databases, but it provides a good example. As to open data, remember that the open data movement started as a political movement in terms of transparency of government. However, it soon became something else, once it became coloured by financial, and not longer political, interpretations. Initially, the open data discussion was about making the government more transparent and hence more accountable: one may see where the money goes, who does what, and who is responsible for what kind of program, for example. From there, the goals slowly morphed into commercial (re-)use by start ups for innovation, and things ended up with potential exploitation by big companies. What transparency is there in giving access to let's say records of hospitals to a private company? It is not about transparency. It is not about a start up. It is about a company that is taking a huge advantage of costly public records. I am very much in favour of it, but I would add a price. EDITORS: Could you elaborate on the term 'hyperhistory' It might be that the hyperhistorical result in more or other tasks and goals for the archival community. FLORIDI: Hyperhistory is a neologism I introduced in a recent book called The Fourth Revolution - How the Infosphere is Reshaping Human Reality. It is based on a simple idea. Time has classically been divided into prehistory and history. Prehistory refers to any stage of human development where there exists no means of r ecording the present for future consumption; in particular, societies without writing. Prehistory ended around 6000 years ago in Europe and China where - archives in liquid times 314 315

Periodiekviewer Koninklijke Vereniging van Archivarissen

Jaarboeken Stichting Archiefpublicaties | 2017 | | pagina 159