One-way roads and dead-ends
along the information highway

Mikael Gunnarsson

Inst. Bibliotekshögskolan, Högskolan i Borås


Abstract

The graphical World Wide Web, which is no older than a few years, is often regarded as either a new mass medium or a vast resource of information. This text presents a different view of the web, which stresses the web's hypertextuality.

The author touches upon one taxonomy for hypertext, hence characterizing the web as a collaborative hypertext, which shows that the different epithets - i.e. unstructured, unreliable - attributed to the web are somewhat inadequate, and are shadowing the true problems of the web: its lack of metadata, reliable object identifiers and human coordination.


Introduction

Whenever you come across some article on the web these days, you will probably see complaints about the disorder and absent structure of the web. I always wonder what is meant by "structure" in these cases.

In a library you may come across some public work stations for web access with a sign telling you that the maximum time for searching is 15 or 30 minutes or so. So what would be the maximum time when reading on the web?

When discussing the importance and benefits of future electronic publishing one may be confronted with arguments stating the improperness and untrustworthiness of new technology.

All this bear witness - in my opinion - of a vast misunderstanding of what the web is and what its purposes are. So. in this text I'll try to give a different, perhaps clarifying view of what the web is, or rather, the meaning of the distributed, collaborative hypertext of the web.

My aim is to describe the concept of hypertext and to relate this to the actual shortcomings of the web, when it comes to find meaning in searching, reading and browsing it.

(Åter till början av artikeln)

Documents and texts

We who are focussing on electronic communication and documentation constantly talk about documents or texts while others may talk about books, magazines or newspapers. We try to escape the problematic concepts of information, knowledge and data. I would say that this is most often really necessary, since every attempt to define them seems to elude the pincers of one's mind, as it must involve regarding human interaction with messages, signs and symbols on a cognitive, psychological or sociological level.

But at least we should elaborate a little more on what we mean by documents and texts, for the contextual matters of our missions.

There is nowadays a lot of talk about the notion and mechanisms of hypertext.

The word hypertext was coined back in the sixties by Ted Nelson in his 1965 paper A file structure for the complex, the changing and the indeterminate. His unfinished project was named Xanadu alluding to both the poetic fragment of Samuel Coleridge's Kubla Khan (1798) and the palace Xanadu in Welles' Citizen Kane. Xanadu was a palace which was never completed, just like the project of Ted Nelson, which attempts to "design and develop" a system for the containment of our total docuverse. (Nelson, 1993, preface)

The phenomenon of hypertext though, is not new, so what is it really about?

First I will state that by documents I mean on the one hand containers of more or less complex messages, intended to overcome the communication difficulties of distance in time and/or space. The containers are then carried on some physical media, like paper or magnetic disks, and represented by a bundle of binary signs or ink stains forming alphanumeric, structural or pictural signs. On the other hand I mean by documents an instance of collected smaller documents grouped together, and thus forming a new message.

There is no assumption that a document has fixed physical delimitations.

This notion of document is in both cases synonymous with text and thus clings to the ideas of hypertext people like Roy Rada and Ted Nelson. (Rada, 1991, p. 2, Nelson, 1993, pp. 1/14-1/19) There are of course reasons for having two words like text and document, but I won't argue about that here.

(Åter till början av artikeln)

Hypertext

Documents may be said to act as communicators themselves and therefore considered as equivalent to one person giving messages to one or many persons. Interactive communication though, is not usually possible with documents as communicators as with living and responding actors.

Part of these conditions is explained when describing ordinary paper bound texts as having a linear structure, just as many lectures are strictly linear if the audience is quiet and just follows the trails of the lecturer.

A novel is nearly always read from the beginning to the end, but that is actually not often the case with non-fiction just as the tight sequential structure of a lecture may be broken up by questions from the audience.

The paper media for text-storage and/or the time-line forces the creator of a message to present it sequentially. A highly complex issue though, may always be approached from different directions and it is almost necessary to do so in order to get aquainted with it. This is really what I am trying to do in this text, to give a sequential instance of another approach to understand the web.

A recent master thesis on the concept of relevance emphasizes that whether a message is relevant or not depends on the reader's preconditional suppositions, theories and hypotheses rather than on close matching of the reader's expressions with those of the text. (Philipson, 1996) With non-fiction literature that statement is true when the reader tries to find particular passages of a text which fit his temporary needs rather than reading it sequentially.

But the point is that, when using paper or planning a lecture, the creator must try to think sequentially.

For many of the hypertext stakeholders hypertext has a promise of freedom from sequentiality.

/.../ a non-linear way of presenting information. Rather than reading or learning about things in the order that an author, or editor, or publisher sets out for us, readers of hypertext may follow their own path, create their own order -- their own meaning out of the material.

(Amaral 1996)

One reason for to characterize documents as containers is that documents may contain anything, they are like buckets or carts which the reader fills with whatever (s)he wants or comes across.

This freedom of choice that the reader is offered by many documents points to the essence of what hypertext stakeholders argue for. The forced sequentiality of traditional text is from their point of view a strait jacket for the writer.

From the reader's point of view though, almost every document can in fact be characterized as hypertext. In this text you will probably come across bewildering concepts which force you to turn to other sources in order to explore the meaning of them. That is also my intention and in that respect I am writing hypertext. But this view of hypertext is not particularly fruitful and doesn't explain why there is so much talk about hypertext, so let's return to documents and their properties for a while, to end up in the usually percieved concept of hypertext.

Documents - as well as texts - may contain textual messages as well as pictures, sounds, movies or other kinds of objects.1. The web shows astonishing evidence for this nowadays, as its technology moves on with integration of new applications and programming languages like Java. It is probably true that in a while we may come across even smelling objects on the web, if our machines are furnished with appropriate equipment for that.

In the paper paradigm documents are restricted to contain a few kinds of objects: text and pictures. If you want to deliver a message by text and moving pictures with sound, you must choose another physical carrier for your document and thus you will end up in another infrastructure for preserving, distributing and presenting your message. The paper messages - regarded by religious attitudes especially among many librarians - is handled by another organisation than that which takes care of electronic documents like video and audio. The binary document though, is a promise of unity. Almost any kind of message can be represented by binary signs, and thus a binary represented hypertext extends the traditional notion of document.

Rada (1991) has elaborated one taxonomy for characterizing hypertext. He states that there are four types of hypertext, namely Microtext, Macrotext, Collaborative hypertext and Intelligent hypertext. I will not argue for this terminology here, just state that it is the third type that is really interesting for our mission and which I will end up talking about just soon. It is also necessary to say that I make use of this taxonomy, even though I am uncertain at this point as to the usefulness of it.

Hypertext consists of mainly two components, nodes and links.2. Nodes may be synonymous with a fixed document or a part of that document. Remember that a document is formed by its context, not by itself. The links then connect two or more nodes together in some way. If the connection is done in some automated way (most often with the help of binary technology) we are inclined to talk about it as hypertext in its usual sense, but even the use of footnotes or phrases like "as mentioned before" serves as pointers from one node to another.

It is evident that links between nodes are of different types, which are of great importance for the stuctural context. A link may point to a node which verifies a statement or it may point to a node which elaborates on one concept. Many hypertext systems don't support this kind of link typing, which consequently may become a considerable disadvantage for large hypertexts like the web. (Allan, 1995)

(Åter till början av artikeln)

Microtext and Macrotext

The 17th century encyclopedia Dictionnaire historique et critique (1695-1697) by Pierre Bayle is an extremely illuminating example of a highly elaborated hyperdocument, where every page consists of a few lines of primary text with footnotes pointing to explorative remarks on some concept in the primary text. With the topology of Rada we would call this a microtext or with his other term small-volume hypertext. (Rada, 1991, p. 22ff.) With Monk we will characterize the links as internal. (Monk, 1990, p. 20)

If we then move on to think about a thesis with an accompanying bibliography to which the author makes references, then we may say that the contents of the thesis in some way incorporate the contents of the cited works. Most readers are not of course forced to include the cited works in his or her reading, but that should be the case if anyone wishes to understand the complete message of the thesis. It is then almost clear that a traditional bibliography just forms some kind of topnode which consists of links and pointers to other nodes - or documents - with something in common, maybe just a topic or the fact that they are all published in one country, in which case it is called a national bibliography. The complete mass of published documents of one country accessed from that topnode may then be seen as one Macrotext or in other words, a large-volume hypertext. With Monk we will characterize the links as external, bringing together different physical documents. (Monk, 1990, p. 20)

The difference between Microtext and Macrotext is not that of extension. It is a matter of responsibilities for the creation of nodes and links. In a Microtext you have a rather clear notion of who is responsible for the creation and distribution of the document. In the case of paper carried documents you may need only one physical object to read the complete microtext, even though that is not always the case, as with continuing thematic newspaper articles.

It should be mentioned that the great promise of current hypertext in binary representation is that hyperdocuments may be created, which just consist of a bunch of links to already existing nodes, thus forming a new document. A new message will be created from old ones. A frightening fact for some, but think about the sampling principle of modern music, the journalist's reporting or the student's overview of recent research. Isn't it just packets of old messages in a new wrapping and consequently given new dimensions. So what's so frightening about reuse?

(Åter till början av artikeln)

Collaborative hypertext

Since the possibilites of recording messages arose, documents have been used to support human cooperation. A large part of an organisation's potential to act lies in the proliferance of different kinds of recorded messages - documents - and methods for creating and distributing them. Moreover, it would be nearly impossible to interact nowadays without the possibility to communicate with recorded messages.

These kinds of messages may or may not be intended to be of everyone's interest, but they may nevertheless be of extreme importance for the cooperative organisation. The documents containing these messages may be memos, schedules or PMs, but they may also be manuals, documents describing an organisation's goal or budget calculations. It is plausible to think of the creation of these documents as a process of constant refinement. One person may begin by presenting an issue. This issue is then followed by another person's position or argument on the issue, which will probably generate comments, personal or public annotations and so on.

All these stages of the process will generate documents. If they will be preserved for a long time is another question, but they anyhow present you with a whole bundle of nodes tied together with something which may be characterized as contextual and loosely defined relations. The nodes are all in some way related to each other, but the relationships are not always declared as messages and carried with the documents. The links will cease to exist when they are forgotten, as well as the nodes themselves if they are not archived.

The technology to support these processes have sometimes been called groupware and the collection of the nodes and links of this grouptext is what Rada has termed Collaborative hypertext. The difference with this category as opposed to the other two mentioned earlier, is that the nodes are very much smaller, the links may be ephemeral and a final document is aimed at which is probably exposed to constant revisions. The responsibilities for the creation of the nodes are not always clear, and the reading of the final document or retrieval of a particular message poses the user with more or less different difficulties than in the two earlier mentioned types of hyperdocuments. In order to access and read this grouptext we need other tools than before.

The Augmentation system of Douglas Engelbart and and W K English in the 60s3. and the graphical IBIS of Conklin and Begeman (Conklin, 1988) in the 80s are two examples of systems that intend to support collaboration in this way.

Neither this kind of hypertext is essentially new. Anyone acquainted with the Jewish religion might see the parallel in the documents which make up the platform for its religious and cultural code, the shariah of the jews. The Talmud is a very good example on a collaborative hypertext. The Talmud is a cumulated collection of commentaries on the Torah and on other comments as well as on life itself. It may be interesting to note that as the Torah - that is mainly the Holy Bible - is regarded as the "written Torah", the Talmud - together with Midrash - is regarded as the "oral Torah". (Nigosian, 1986) The importance of oral tradition is seldom recognized in our European world of sequential writing. From our point of view, that which is not fixed is of less value. This poses a great problem when it comes to the ever changing web.

Concludingly I must approach the question of "what about all this talk of hypertext just now?". The simplest answer is that the binary technologies have presented us with methods and mechanisms that allow us to make the links between nodes almost invisible.4. The idea of hypertext goes in fact back to the thirties and fourties when Vannevar Bush wrote his essay As we may think (1945) and designed his Memex-machine. His aim was to apply modern technology for to build a machine that could handle macrotext with the links automated. The shortcomings of his project though, lies in the fact that his machine was designed on analog principles and not on binary. The Memex was never completed.

The binary principles used for storage of nodes and links seem to have fed the notion of Nelson's compound hypertext, where a document only consists of pointers into a global repository of a vast amount of data furnished with links between chunks of data. (Nelson, 1993) It is then possible to talk about dynamic documents which will be created on-the-fly at the reader's request. These dynamic documents may or may not be sequential, but nevertheless hyperdocuments from the system's point of view.

The importance and the difficulties of collaborative hypertext lies in how we put it together, store it and make it available. Two of the most prominent difficulties of a grouptext involve keeping track of a document's history and making it readable to anyone who may have the need for it. These are questions which will not be answered here, but I am using the questions for to focus our problems with the web.

(Åter till början av artikeln)

The web

Alas, the web is not in the restricted sense an information retrieval system, it is a collaborative hypertext. It was never intended to be a searching-tool in itself, it is just something which in a rather primitive way has been given ineffective but necessary searching-tools through the efforts of corporations and researchers.

The proposal of Tim Berners-Lee, the "father" of the web, to the CERN community in 1989-1990 points to the original intentions of the web.

The actual observed working structure of the organisation is a multiply connected "web" whose interconnections evolve with time. In this environment, a new person arriving, or someone taking on a new task, is normally given a few hints as to who would be useful people to talk to. Information about what facilities exist and how to find out about them travels in the corridor gossip and occasional newsletters, and the details about what is required to be done spread in a similar way /.../

/.../ We should work toward a universal linked information system, in which generality and portability are more important than fancy graphics techniques and complex extra facilities.

(Berners-Lee, 1996) 5.

A hyperdocument, to which category it belongs doesn't matter, is just like sequential texts intended to be encoded by the perciever. The difference is that the hyperdocument is from the system's point of view fragmentised into chunks of linked nodes, a structure which may be difficult to encode. So the argument that the web is unstructured seems rather silly as to the systems oriented view. It has a structure, although complex and uncoordinated.

The system or user interface has to present the reader with some front-end that makes the hypertext readable, browsable and understandable, just as it has to when it comes to sequential text. The difference here from the paper paradigm is that binary documents have to be presented on a computer-screen, where you don't turn pages, you scroll through the text just as was done in ancient times with scrolls of papyrus.

(Åter till början av artikeln)

The indices of the web

When it comes to vast amounts of text it is also necessary that the user interface presents the reader with tools that make the text searchable on keywords, an index. This latter is what Monk among others has termed navigation external to the hypertext, which really points to the fact that the index is not necessarily a part of the hypertext which it points to. (Monk, p. 20f.)

Tim Berners-Lee, though, in his World Wide Web - summary includes the indices in his description of the web.

The WWW world consists of documents, and links. Indexes are special documents which, rather than being read, may be searched. The result of such a search is another ("virtual") document containing links to the documents found

(Berners-Lee, 1991-1992)

The index serves the reader of a sequential paper carried text with a tool for finding certain passages of the text, when (s)he doesn't want to read the whole text. Remember that this in fact qualifies the sequential text and the index together for the epithet hypertext (macrotext). An index may serve as an access point to nodes of a hyperdocument as well. The real problem here is that the indices of the web often don't point to nodes, they point to physical files. One may say that embedded objects like pictures and sounds reside on the same level as the textfiles and are included within the indexpointer (in fact by the embedding of immediate links in the textfiles) but the links of textnodes are not bidirectional, and consequently parent nodes are left out.

The reader is directed to a node which forms a part of another more encompassing node, and as node-creators still try to use sequential ways to present messages, that node may not be usable out of its context.6. Thanks to a lot of the style guides accessible over the web and the global strive against standardized ways to publish on the web, many of the nodes give you a redirectional pointer to its parent node, and all is well. No, it's not! We still need some way to give the reader a clue of our hybrid between true hypertext and sequental writing. It is still extremely difficult to grasp the structure and extension of a complex web-node.

So what we really should want the indices to do, is to make up a new dynamic and unique hyperdocument, every time we search it. This is in fact true with other resources accessible over the web, like the Krakatoa chronicle : an interactive, personalized, newspaper on the web. (Kamba, 1995)

Moreover, as the web has grown so much, the indices are often incomplete and out of date, and the only way to find a relevant access-point is to begin in one of its indices, find one link and hoping to find the way to a relevant node by reading and browsing. I wonder why people always talk about searching the web, when it is really a question of reading and browsing.

Searching occurs when a person knows the label for some information and wants only that specific information. 7.

(Rada, p. 13)

Many people seem to trust the indices, when they really should trust themselves. You are in fact using the web for reading and the indices - that's not really the web, in a strict sense - for getting one or more starting points.

The indices are not so much to blame either. There are almost no metadata other than the URL for these indices to act upon, and there must be, since the growth of the web requires automatic indexing. Of course the methods for automated indexing create some metadata on the premises of keyword position and frequency, but many indices also neglect the actual existence of important metadata furnished by the creators.8. The Alta Vista is one exception; it handles metadata of two kinds, manually embedded keywords and descriptions.

Another comparison for illuminating web-use may be done with the use of encyclopedias, which may be said to act as both microtext and macrotext. Most people choose one particular encyclopedia for finding out about a topic. If the topic is not treated by that encyclopedia one may move on to another one, but if it is treated it is hopefully furnished with references to relevant further reading. The encyclopedia is then an analogy to the web's indices and what they point to, then why do people argue about the ineffectiveness of the web? The encyclopedia is probably furnished with a lot fewer access-points and pointers to further reading. It may be that the relative constraints of an encyclopedia are satisfactory as it keeps the reader away from the anxiety of information overload and that the web's "overload" is unsatisfactory?

Of course, it is also the characteristics of the nodes to which the indices point. As the web is a collaborative hypertext and the encyclopedia is not (strictly speaking), then it is true that one may not have a clue at all about the context, integrity, version, accuracy and consistency of the node to which the index points. Academic theses, advertisements and personal home-pages are treated as equals by the indices.

(Åter till början av artikeln)

The rise of the web

The web was made public by CERN (Centre Européenne pour la Recherche Nucléaire) in 1989 as a technology for distributing collaborative hypertext. It was not originally intended for global electronic publishing, as was Nelson's Xanadu.

The release of the graphical NCSA Mosaic in early 1993 though, cleared the way out on the public arena for the web. The technology was in some ways astonishingly well suited for collaborative hypertext to a global extent, but it had some shortcomings for which the CERN is not to blame.

(Åter till början av artikeln)

Its main shortcomings and future promises for information retrieval

Many people use the web for information retrieval (IR) purposes, even though it is not designed for that purpose and not well suited for it, at least if we by the term IR mean the use of reporting systems and not systems for reading and browsing.

The first major shortcoming is that the web's address-scheme - which identifies a node - depends on location instead of node-identification. Attempts have long been made to establish another way of referring to its nodes. The W3C proposal of URN is probably the most important one, but the problem remains as no global agreement has been made to the use of it.

If a node is transferred to another location on a machine or to another machine, which sometimes for several reasons is necessary, the URL points to a possible empty place. File not found 404 is nowadays equivalent message to "out on loan" or "on binding". The writer has for a some time let students create topnodes or macrotexts as a part of their Library and Information education, thus including in their texts a compilation of particular topicrelated material found in the vast repository of web-accessible documents. But the unreliability of the URLs forces me to recommend microtext creation as a better alternative.

The second shortcoming is that a global collection of collaborative hyperdocuments or just one global collaborative hyperdocument - you may look at the web from both views - just like a large sequential book or catalog, needs some thorough indexing. For this process of indexing and later retrieval, whether based upon manual or automatic processing, there must exist some metadata on the nodes. But as the web as a document is ever changing and growing, without any coordinating control, this is not possible for the whole web, unless it is done by the creator himself.

It is most important to recognize the characteristics of the web, since the three-year evolution of the graphical web has presented us with a state-of-the-art that not much resembles that of the 1989 web. The four fundamental concepts on which the web resides - HTTP, HTML, URL and "webcompliant software" - may surely change or be subsided by other fundamentals in the near future. Even today the specifications of HTML and the possibilities of web-compliant software have changed radically since 1993. What will remain though, is the use of hypertext principles and of open networks like the Internet. Even the word web may also be dismissed by some other term.

What we may hope to see in the near future is the rise of a sub-web that is really thoroughly indexed and furnished with coordinated registration of node-identifiers (like the DNS-system works today). This would make us independent of the two major shortcomings for real information retrieval. This may be seen in the near future with the establishment of interim-solutions for alternative addressing like the PURL-server idea and for metadata-furnishing like the Dublin Core Metadata set.9.

(Åter till början av artikeln)

Conclusions

The point of mind reflected in this text has tried to make distinctions between the hypertext-phenomenon, the web, its document and the indices of this same document.

Presented with this state-of-the-art one may compare the preconditions for electronic publishing today with the preconditions of old times for paper publishing, when coordination, information literacy and bibliographical control were scarce. Remember the bibliographical gap of the 18th century swedish publishing.

I've tried to show that the shortcomings and disadvantages of the web really has to do not with its technology, rather with the absence of human coordination and the misinterpretation of what it can be used for. Do not ever trust the web as something which in itself may serve as a source of information.

On the other hand it is also wrong to talk about the web in the ligths of its contents. The web is nothing more than a technology which forms up a distribution channel for whatever is to be distributed, and it is really how nodes are created that is the problem of the web.

(Åter till början av artikeln)

Sources

Allan, James, Automatic hypertext construction. - Cornell Univ., 1995. Amaral, Kimberly, Hypertext and writing: An overview of the hypertext medium. - Univ. of Massachusetts Dartmouth, 1995.
http://www.umassd.edu/Public/People/KAmaral/Thesis/hypertext.html
Retrieved 1996-10-31

Berners-Lee, Tim, An executive summary of the World-Wide Web initiative. - 1991-1992.
http://www.w3.org/pub/WWW/Summary.html
Retrieved 1996-11-04

Berners-Lee, Tim, The original proposal of the WWW, HTMLized. - 1989, 1990, 1996.
http://www.w3.org/pub/WWW/History/1989/proposal.html
Retrieved 1996-11-04

Conklin, Jeff, Begeman, Michael, gIBIS : a hypertext tool for exploratory policy discussion // ACM transactions on office information systems. - Vol. 6, no. 4 (1988), pp. 303-331.

Ingwersen, Peter, Cognitive perpectives of information retrieval interaction : elements of a cognitive theory // Journal of documentation. - Vol. 52, no.1 (1996), pp. 3-50.

Kamba, Tomonari et al, The Krakatoa chronicle - an interactive, personalized, newspaper on the web. - 1995
http://www.w3.org/pub/Conferences/WWW4/Papers/93/
Retrieved 1996-10-31

Nelson, Theodor Holm, Literary machines. - 1993. 1 ed. - Mindful press, 1993.

Nigosian, Solomom, Judaism. - Crucible, 1986.

McKnight, Cliff et al, Hypertext in context. - Cambridge Univ. Press, 1991.

Monk, Andrew F., Getting to known locations in a hypertext // Hypertext : state of the art. - Oxford : Intellect, 1990. - Pp. 20-27

Philipson, Joakim, The relevance of citation. - Borås : BHS, 1996.

Rada, Roy, Hypertext : from text to expertext. - McGraw-Hill, 1991.

(Åter till början av artikeln)

Footnotes

1. See for example Ingwersen, 1996, p. 7 for a Library and Information Scientist's agreement on this
2. The anchor is also important, but that is mainly in the context of computer-assisted hypertext.
3. In this project the mouse was invented (McKnight, 1991, p. 9)
4. It is of course possible to furnish a paper bound node with a lot of links (internal and external), but it will surely meke it impossible to encode. Consider how papers like this one will seem awkward to a newspaper publisher, thanks to its (not so) many footnotes and citations. Furthermore, how difficukt it is to intsruct undergraduate students to write papers with thorough citations.
5. It is worth noting that developmental trends of software and technical specifications are to a considerable extent merket-driven and therefore ignores these intentions and forces integration of exactly "fancy graphics" and "extra facility" on the web.
6. The reason for not creating documents which only would consist of links, which would be true to the Nelsonic compound hypertext, is obviously the transitory nature of the URLs.
7. There is a no reason for talking about the browsers as searching-tools of the web, which is sometimes even in serious texts being done. The browser is a reading and browsing tool.
8. The HTML-specification allows the tag <META NAME="" CONTENT="">>, which some editors require or propose
9. See for example Beckett, Dave, Proposed Encodings for Dublin Core Metadata at URL http://www.hensa.ac.uk/pub/metadata/dc-encoding.html and Persistent URL Home Page at URL http://purl.org

(Åter till början av artikeln)


About the author

Mikael Gunnarsson has been employed at the Swedish Library School of Information Science since 1992, and has been teaching subjects mostly related to networking and electronic documentation.
Beginning as an undergraduate engineer in electronics, MG moved on to studies in theatre, acting, foreign langauges, and history of religion, until finally receiving his diploma in Library and Information Science . His academic interests lie in the hypertextuality of electronic documentation systems.


© Mikael Gunnarsson, 1997