Preserving the Digital Heritage of the
World
some thoughts after having collected 30 million Swedish web pages
This paper is dedicated to the tremendous task of
archiving the digital heritage of the world. Much thought and many discussions and papers
have been given to the question of web archiving during the last years, while most of the
documents on the Internet have disappeared and cannot be retrieved again. I will discuss
different possible approaches to some of the problems relating to the
preservation of digital information, against a background of experiences from the
Kulturarw3 Project of the Swedish National Library as well as other
ongoing projects.
1. The Kulturarw3
Project
2. Collecting
3. Preservation of digital
information
4. Digitisation and
preservation
5. Access
6. Co-operation in its
beginning
About the Author
The Kulturarw3 Project started
in September 1996, when the Royal Library hired an engineer, Johan Palmkvist, and I was
made part time project leader. It was initially financed by a government grant of 3
million SEK (Swedish crowns) to test methods of collecting, preserving and providing
access to Swedish electronic documents.
The name Kulturarw3 means Cultural Heritage in Swedish but is properly spelled with a
"v" at the end. The "w" has the same sound value in our language and
we have indexed it to point out that the WWW or World Wide Web not only is something new
and modern, but also part of our cultural heritage.
The project has made six comprehensive
harvests of the Swedish Web since January 1997: two in each year. The seventh harvest will
soon be completed. A harvesting robot is used to search and retrieve Swedish web pages
within the domains ".se", ".com", ".net", ".org"
and ".nu". A flow chart shows how it works (fig. 1).

Fig 1
The current number of Swedish web pages on
the Internet is about 8 million from 63 000 web sites (37 000 of which are ".se"
and more than 26 000 registered in other domains). Including pictures, sound etc. there
are 16 million files. The total size is under 300 GB. Our collection so far comprises
about 75 million files and 1.4 terabyte or 70 DLT-tapes. So the big problem is not the
actual size of the archive but rather to handle the large number of files.
The project has got a special grant of 5
million SEK from the Knut and Alice Wallenbergs Foundation for obtaining archiving
equipment. An archival computer with a disk array of 1.5 terabyte is in use to test
different ways of organising the web archive. A tape robot storage system will be
installed in 2000; it also serves other projects dealing with
digital preservation.
There are at least a hundred different
file formats in the collection, some of them standardised or relatively standardised ones
such as HTML and the picture formats JPEG and GIF, other ones are proprietary and probably
have a shorter life span, such as different versions of MS Word, Excel and PowerPoint.
HTML, JPEG, GIF and plain text together make up 97% of the files. It is among the other 3%
of the files, that we will find the more immediate problems. Migration, that is conversion
to readable formats, will probably be the normal method to keep old documents readable.
Access to the Kulturarw3 documents is as a rule not allowed until a legal framework has
been created, which will be accomplished mainly by an ongoing
revision of the Swedish deposit law. The Swedish ministry of education has commissioned a
report, which was published late in 1998. The report suggests that the Royal Library and
the National Archive of Recorded Sound and Moving Images divide the respons-ibility of
preserving and giving access to the historic Web. However, the report also suggests that
the access will be limited to researchers from established research organisations. Such a
limitation would be contrary to the democratic aim of the Swedish deposit law to guarantee
free access to information. We are now waiting for the politicians to make a long-term
decision on web archiving.
Kulturarw3
co-operates with another, later, Royal Library project called Svesök (meaning
something like Swesearch in English), which is creating access tools to the current
Swedish web. Svesök gets text files from Kulturarw3, adds descriptions in the
Dublin Core format to a very small selection of home pages (3200 at the moment) and puts
them into a subject tree structure. This selection is the electronic publication part of
the Swedish National Bibliography. All the text pages which Svesök retrieves from
Kulturarw3 are automatically indexed. A search robot is
provided which lists those pages which have DC descriptions first. Kulturarw3
intends to find ways to save the efforts of the Svesök cataloguing into the historic web
archive.
At present the staff of Kulturarw3
consists of two persons, Allan Arvidson and Krister Persson.
The focus of this paper will now move from
a description of one way of preserving web material, the
Swedish project as it is today, to a discussion of some possible approaches to the
challenges of collecting web publications, and of preserving and giving access to digital
information.
(To the top)
What?
The first thing to decide is what
to collect. In todays projects you will find two main approaches.
The comprehensive one is represented by
the Kulturarw3 Project, by Brewster Kahles Internet Archive and, more recently, by the Finnish EVA
Project. The scope is to collect everything published on the Internet. These projects are
collecting millions of documents. The selective approach is represented by the PANDORA
Project of the National Library of Australia and EPPP (Electronic Publications Pilot
Project) of the National Library of Canada. The scope is to collect important publications
which can be made accessible at once. They are "only" collecting thousands of
documents.
An argument for being selective is that
you should not spend your limited resources on preserving lots of trash. However, doing an
intelligent selection is difficult and researchers in the future will criticise our
choices. Even if we try our very best, important digital information will get lost.
Computer storage is getting cheaper and
cheaper, while the cost of personnel is not. It might seem a paradox, but it is a fact
that the selec-tive projects use more staff than the comprehensive ones.
If selection is made in the indexing
process, and not in the collecting process, we have at least saved the publications and
the inevitable mistakes we will make when we select publications for cataloguing and
indexing, can be corrected in the future.
Who?
Who should preserve the digital
publications? There are at least three approaches to this problem. One is to make the
publishers and other institutions directly responsible as was advocated in the USA by the
Task Force on Archiving of Digital Information in 1996. The second is the national
approach exemplified by Denmark and by the Australian, Canadian, Finnish and Swedish
projects. The third is the international one represented by the Internet Archive.
Long-term preservation should be
undertaken by long-term institutions with stable financing, that last for hundreds of
years. To give the task to the national library in each country, widening its
responsibility for printed publications to include digital publications, based on
rewriting the deposit law, seems to be a good solution for many countries. Collection and
preservation is best done at one institution with good resources, while indexing and
selection might be done in co-operation with other institutions.
The institutional approach is not so
stable. It also combines badly with automatic, comprehensive collecting of web
publications, as each publisher and institution will find their own solution for
preservation of their own publications. Links pointing to resources on other sites will
not function.
The interactive character of the web pages
with links to other pages, regardless of national boundaries, speaks for the international
approach. But there seems to be a long road to go before it would be possible to create an
international institution for web archiving with long-term stable financing. It seems more
realistic to start co-operation between national web archives, not only to exchange
experiences and provide each other with support, but to create a forum for raising
questions of standards, exchange formats, communication between the archives, etc.
Waiting for a permanent solution, which
seems close in Sweden and Finland, but so far fairly distant in most other countries,
institut-ions, companies and individuals have to rely on themselves if they want to
preserve their old web pages.
How?
The usual way to collect web
documents is by harvesting, i.e. using a robot software, which searches for documents on
the Internet and retrieves them by downloading a copy. Another way is to let the publisher
deliver the material by tape, magneto-optic disk or via the Internet. The harvesting
approach is, of course, the only possible one for comprehensive web collecting. Dealing
with, as in Swedens case, the owners of more than 60 000 web sites would be a
nightmare. But most selective projects also seem to take the harvesting approach, as it is
simple and practical. However, for some of the sites protected by user accounts and
passwords, and for very large sites, delivery might be used in the future.
The short life of the pages is a special
characteristic of web publications, which makes them different from printed publications
and electronic publications on CD-ROM and other carriers. It is so cheap and easy to
change a web page. The average life of publications on the Web (or rather of editions and
issues of web publications) is only some months. One must take this into account when one
decides between the snapshot approach and the continuous approach.
The snapshot approach is to take two,
four, six or another number of snapshots of the Web each year and let that represent the
web publications of that year. It is an attractive way to select automatically and reduce
the size of the web archive. The main problem with the snapshot is that you will lose
information like newspaper and journal issues and other important pages, which have a
short life. This means that you have to give certain web publications special treatment,
which will increase staff time and costs. This is for instance done by the Australian
Pandora project.
The continuous approach is not in
practical use today. The idea would be to collect as many editions and issues as possible.
To do that one needs a harvesting robot that collects information about the frequency of
change of each URL, and uses that information for its collecting strategy. According to
expertise it would not be too difficult to construct such a robot software.
(To the top)
The next set of problems concerns the
long-term preservation and access of digital information in general (of which web
publications constitute one subset). The amount of digital information created is
increasing drastically. The time when word processors and economy systems were tools to
create written or printed documents is gone. Now more and more information is primarily
digital. It might be in a text format like MS Word, HTML or XML, in an image format like
TIFF or JPEG, in some kind of data base or in a more specialised system. Today, not only
print-outs but also printed reports should often be regarded as secondary forms which are
used to spread the information or a selection of it on paper, as well as different digital
formats like HTML, PDF and reports in Excel could be secondary forms to spread the
information on intranets or the Internet. But for long-term preservation, most institutions and companies still stick to paper and in some
cases microfilm, when they are not turning a blind eye to the
problem.
I will take one example from a research
library perspective: what happens to the manuscripts of today? For centuries, The Royal
Library has collected personal archives of authors and other persons related to the
publication and production of books. These are frequently used sources for studies in
literature, art, history and other academic disciplines. Today, the corresponding material
is in the authors PC till she or he buys a new computer, when most of it gets lost.
Therefore, on the initiative of the author and professor Sven Lindqvist, a member of the
library board, we have just started a project to find ways of preserving digital personal
archives. Such an archive might include different versions of texts reflecting the
creative process, as well as e-mail correspondence and research material collected by the
author.
The difficult preservation problem is not
the lifelength of tapes and other carriers of the information, as it is easy to copy the
1s and 0s of which the digital information consists, and the copy is identical with the
original. The difficulty is the short life of the software
and hardware environments. You need a new computer every third year and have problems
reading documents older than ten years, because todays software can only import a
limited selection of file formats.
The question is: how are our successors
going to read the digits we have preserved for them? There are at least three approaches
to this problem: the technological museum; the migration; and
the emulation approaches. The first approach would be to create a technological museum
with old computers and software. But this would be only a temporary solution as you soon
will run out of spare parts and the cost to uphold the knowledge of how to run the systems
and software will rise tremendously.
The migration approach means successive
conversions of the files to current formats, when the old ones are outdated. That means a
maintenance cost for the archive, but a cost that can be controlled. One of the drawbacks
of migration is that it is inevitable that you will sometimes lose some information or
functionality of a document when it is converted from one software to another. Even if the
textual contents will be correct and complete, some of the authenticity of the document
will get lost. So you need a good strategy, trying to use standards and as few conversions
as possible.
The emulation approach means reading old
files by writing new software in your current computer environment,
emulating the old programs or at least the reading part of them. In a way, this is the
most comfortable approach. You save the information in the original format and rely on the
ability of future generations to create reading software for their use.
In my opinion a combination of the
migration and emulation approaches is the best way to go.
(To the top)
Yesterdays information can be made
accessible by digitisation of texts and images on paper and other traditional materials.
But the digitisation and digital library projects of today are focused on availability
only and are seldom taking preservation consequences into account. When the quality of the
digital images is low, the projects will generate more use of the originals for study and
reproduction and cause a threat to their survival.
On the other hand, preservation planning
seldom includes digitisation as a means of protecting the originals from use and tear.
There is a need to bring professionals from the different spheres together into
co-operation and united thinking.
At the Royal Library we have started such
a project, called Platform for image databases, tackling questions like the quality needed
for archival copies, presentation copies and delivery copies (in terms of e.g. resolution
and colour depth), standards, recommendable file formats for archival purposes,
presentation and delivery copies, compression, safety, permanence of digital information
and migration as well as questions concerning registration and cataloguing of images,
search methods and user interfaces. A full report in Swedish will be published before
summer.
(To the top)
If you search for a tree on the Internet
today, you will get the whole forest as an answer. In the long list presented, you will be
lucky if you find a relevant hit on page seven. This problem will not lessen and the list
will not be shorter in a historic web archive. Cataloguing, even if it is done at a
minimum level, can hardly be accomplished for more than some per mille of the web pages.
(8000 is one per mille in Swedens case.) Therefore, it is important to promote the
use of metadata in order to help and encourage the producers to make their own cataloguing
and put that onto the page.
After years of discussion, it seems that
the Internet community rallies around the metadata format Dublin Core. The Royal Library
promotes metadata by meetings and by information on the Web, by having a template for
Dublin Core creation at Svesök, and by encouraging other actors to also provide Dublin
Core templates.
Automatic indexing and cataloguing might
also be used more for digitised material in the future. A possible development for the
retrieval of web publications is illustrated in figure 2.
Web retrievability
The large squares represent the Web or the
Swedish Web or some part of the Web

Professional cataloguing
Dublin Core |
Automatic cataloguing |
Fig 2
To give access to web publications is part
of the much larger challenge to make all kinds of historic digital information easily
available in the future, both documents and objects which are originally digital and those
which are the results of the digitisation of older collections in archives, libraries and
museums. In a project called Automatic Indexing of Newspapers at the Royal Library we try
using optical character reading of old newspapers, some of them printed in gothic type.
The resulting digital text is then used just as it is (without any expensive corrections
made) for indexing and fuzzy search. The hits are linked to the image files of the
newspaper pages, as the digital text is corrupt and almost unreadable.
Let us hope that techniques developed
within different specific projects can be scaled and applied to large segments of digital
infor-mation in the future. The important thing is to find as automated methods as
possible for the retrieval of and access to the information, as the cost of manual
handling is much too high for such a large number of objects.
(To the top)
In 1997 Kulturarw3 initiated an
informal group of technical co-operation within the Nordic countries called the Nordic Web
Archive. We have also downloaded web pages for seven Central American countries, viz.
Belize, Costa Rica, El Salvador, Guatemala, Honduras, Nicaragua and Panama. Within IFLA,
the International Federation of Library Associations and Institutions, there is a growing
interest in web archiving. An open session on the subject will be held this year in
Jerusalem.
There is certainly need for much more
co-operation in the future both on preservation of digital information as a whole and on
web archiving. Then, just to take one example, it will perhaps be possible to follow an
old link on a web page in one national web archive to the proper document from the same
time in another.
(To the top)
Johan Mannerheim, fil. kand. med
både naturvetenskapliga och humanistiska ämnen, bibliotekarie, avdelningsdirektör, chef
för Data och IT-enheten vid Kungl. biblioteket. Han har byggt upp den nationella
mikrofilmningen av svensk dagspress i KB:s regi och medverkat i standardisering inom
områdena mikrofilmning och "imaging". Han är boksamlare med inriktning på
bokens historia och har undervisat vid Grafiska institutet i ämnet. Sedan mitten av
90-talet har han ägnat sig åt att utveckla KB:s infrastruktur på IT-området och är
bl.a. ansvarig för den insamling av svenska webbsidor som KB bedriver för att bevara dem
för framtiden.
(To the top)
Literature
E-plikt
att säkra det elektroniska kulturarvet, 1998, (SOU 1998:111), governmental
committee report on securing the electronic cultural heritage,
http://utbildning.regeringen.se/propositionermm/sou/pdf/1998/sou98_111.pdf
(in Swedish)
Internet Archive
http://www.archive.org/
Kulturarw3
http://kulturarw3.kb.se/html/kulturarw3.eng.html
(in English)
http://kulturarw3.kb.se/index.html (in
Swedish)
Kungl. bibliotekets yttrande angående pliktutredningen, Pronouncement of
the Royal Library on the committee report E-plikt
http://www.kb.se/BIBSAM/EPLIKT/kbsvar.htm
(in Swedish)
Lagen om pliktexemplar (SFS 1993:1392), the Swedish deposit law
http://www.notisum.se/rnp/sls/lag/19931392.HTM
(in Swedish)
Metadata och Dublin Core
http://www.kb.se/bus/dc/dcstart.htm (in
Swedish)
National Library of Canada Electronic Collection
http://collection.nlc-bnc.ca/e-coll-e/index-e.htm
PANDORA Project
http://pandora.nla.gov.au/
Platform for image databases
http://www.kb.se/DoIT/bildbas_eng.htm
Preserving Digital Information
Report of the Task Force on Archiving of Digital Information commissioned by The
Commission on Preservation and Access and The Research Libraries Group, Inc. May 1, 1996
http://www.rlg.org/ArchTF/
Project EVA
http://linnea.helsinki.fi/eva/english.html
The Royal Library
http://www.kb.se/ENG/kbstart.htm (in
English)
http://www.kb.se/ (in Swedish)
Svesök, search for current Swedish web pages
http://www.svesok.kb.se/ (in Swedish)
(To the top)
© Johan Mannerheim 2000
Return
to Human IT 1/2000 |