Memory failure detected

A coalition of the willing is battling legal, logistical and technical obstacles to archive the riches of the mercurial World Wide Web for the benefit of future scholars. Zoë Corbyn reports

September 1, 2011

It is 2031 and a researcher wants to study what London’s bloggers were saying about the riots taking place in their city in 2011. Many of the relevant websites have long since disappeared, so she turns to the archives to find out what has been preserved. But she comes up against a brick wall: much of the material was never stored or has been only partially archived. It will be impossible to get the full picture.

This scenario highlights an important issue for future research - and one that has received scant attention. How can the massive number of websites on the internet - which exist for just 100 days on average before being changed or deleted - be safeguarded for future scholars to explore?

The extent to which content disappears without trace from the web is worrying, says Kath Woodward, head of the department of sociology at The Open University and a participant in the British Library’s Researchers and the UK Web Archive project, which aims to involve researchers in building special collections.

Not enough academics, she believes, are engaging with the topic. “We are taking it for granted that such material will be there, but we need to be attentive. We have a responsibility to future generations of researchers.”

Eric Meyer, a research Fellow at the University of Oxford’s Oxford Internet Institute, studies web archiving. He says that “because the internet is so integral to so much that goes on in the world today, we have to be serious about keeping track of it”.

And the past can erode very quickly, observes Mia Consalvo, associate professor in communication studies at Concordia University in Montreal, Canada, and president of the Association of Internet Researchers. “These issues are long term and worthy of investment,” she says.

Of the web archives in existence, the not-for-profit Internet Archive’s Wayback Machine is the oldest and most comprehensive. Established by Californian internet pioneer Brewster Kahle in 1996, five years after the World Wide Web began, it “crawls” the entire web taking regular snapshots of all websites that are not hidden behind passwords or paywalls.

The archive, which now contains more than 150 billion pages from more than 100 million sites, is free to access. Anyone who visits the site can retrieve material by typing in a web address of interest. The aim is to copy the entire World Wide Web every two months, Kahle says. From shopping to porn sites, the undertaking is meant to capture the “whole breadth” of who we are.

The Internet Archive does not seek permission from website owners before it archives their sites, although material can be removed if an owner requests it. But the archive has a limitation that future researchers may well lament: the sheer size of the web means that its regular crawls are shallow. Although many websites are captured, the Wayback Machine may record only their home page. It is a record with breadth but not depth.

“We do what we can, but we are not doing enough,” Kahle says. He had hoped that other organisations would see the “obvious need” for his project and come to its aid, but this has not really happened, he says. “Other organisations are doing (web archiving), but basically for their own purposes.”

Since the early 2000s, many national libraries have been attempting to preserve the web. Their focus is on archiving websites that fall within their national domains (in the UK, for example, those with.uk addresses, and.fr in France).

Libraries in different countries take different approaches depending on the legislative framework. Some, such as the national libraries of France, Denmark and Norway, harvest their entire national domain. Such efforts, like that of the Wayback Machine, achieve only shallow capture. Like the Internet Archive, they do not ask permission, which is an approach that is possible where “legal deposit” legislation for online publications has been enacted. This legislation is equivalent to the long-established statutory obligation for publishers to deposit copies of printed material in national libraries. It allows libraries to crawl, collect and republish the freely available websites in their country’s domain automatically without breaching copyright law.

But other countries, such as the UK and the US, rely on smaller-scale selective archiving. Websites are collected around topics, themes or events chosen by library curators, with sites harvested only when the copyright holder’s permission has been obtained. The approach lacks breadth, but as the operation is smaller, individual websites can be captured more comprehensively.

The British Library began permission-based selective web archiving in 2004, four years after the US Library of Congress initiated its own programme. Today, its UK Web Archive contains material from more than 10,000 websites. But it is still only a tiny fraction of the estimated 4.5 million sites that are either part of the freely accessible content in the UK’s web domain or relevant to it.

A disappointingly poor response rate for permissions also means that the resulting collection has holes.

“It is like Swiss cheese,” acknowledges Helen Hockx-Yu, head of web archiving at the British Library. “We get only about 30 per cent of the people we ask giving us permission. Most we just don’t hear from; and without the resources to chase them, we end up with a patchy collection.”

It is not that the UK lacks appropriate legal deposit legislation - the Legal Deposit Libraries Act was extended in 2003 to cover online publications. Rather it is that the regulations necessary to put the legislation into effect have not been forthcoming, eight years down the line. While the reasons for the delay are multifaceted, commercial publishers are among those to have raised concerns about the legislation, fearing that web archives could undermine their business models.

Both the British Library and the Library of Congress steer clear of trying to collect material, such as the content of news websites, that could impinge on commercial publishers’ business models. Thus there is no archived copy of the now-defunct News of the World website, even though researchers might one day wish to study online comments by readers of the tabloid newspaper.

Indeed, most news content that is published only online is simply falling through the cracks. In the US, a national working group has been set up to look at content deemed to be highly “at risk”, including news content, notes Abigail Grotke, leader of the Library of Congress’ web archiving team.

Hockx-Yu says the British Library is doing all it can, but she argues that it is “thoroughly about time” that the measures needed to implement rules for the legal deposit of web publications are put into place.

“There are websites that we haven’t been able to collect that have disappeared,” she says.

William Kilbride, executive director of the Digital Preservation Coalition, a membership organisation for UK bodies with an interest in digital preservation, agrees: “It really is a matter of urgency to have the regulations finalised.”

But even if the legal deposit regulations come into effect, they are unlikely to satisfy UK researchers. To take the concerns of copyright holders into account, the rules are likely to contain a requirement - thus far common to all countries that require the legal deposit of online publications - that access to the websites archived under the legislation will be restricted.

To view the material, researchers might have to go in person to one of the UK’s legal deposit libraries, in the same way that they often do to examine print publications.

Researchers describe this potential stipulation as nonsensical. “The whole point about the internet is that you can access it from wherever you are,” Woodward says.

While libraries are currently digitising 19th-century documents and making them available via the web, it is “deeply ironic” that websites from two years ago are being made less accessible, Meyer notes.

It is not only a question of which sites are kept and how they are accessed. Preserving the material can be a major technical challenge, too. New web formats - for example rich interactive pages built using Flash or JavaScript - and new technologies for displaying video and audio content are evolving all the time. This means that it is a constant battle to make sure the websites can be crawled, copied and then displayed in the archives in such a way that they look just as they did to their first online viewers.

For example, Hockx-Yu says, it was assumed that the British Library would be able to preserve 2,400 hours of video footage from UK artist Antony Gormley’s Fourth Plinth commission, in which members of the public were given a platform in Trafalgar Square. However, she says, “the content was streamed over a different protocol that our crawler didn’t understand”. Fortunately, the British Library’s web archiving team cracked the challenge in the end.

Similarly, although the Library of Congress has recently been given Twitter’s archive (see box right), earlier attempts to preserve segments of Twitter have met with difficulty, says Grotke.

Thomas Risse, senior researcher at the L3S Research Centre, a web science research centre at Leibniz University in Hanover, Germany, knows the problems only too well. He was the lead researcher on the European Living Web Archives Project, which was set up to improve crawling technologies and ran from 2008 until 2011. “We have made big steps, but constant development is necessary,” he explains.

In August, the Oxford Internet Institute’s Meyer published a report on researcher engagement with web archives for the International Internet Preservation Consortium (IIPC), an international body that brings together national libraries and other organisations involved in web archiving.

That report, Web Archives: The Future(s), sets out a number of possible scenarios. “Nirvana” would be a future in which usable and useful web archives form part of researchers’ standard toolkits. At the other end of the scale, “apocalypse”, web archiving technology has been so far outpaced by new formats that the archive is as unreadable as 1960s-era computer punch cards.

But the web archiving community’s current practices, the report continues, are producing something that is in danger of ending up as a “dusty archive”. In this scenario, archiving technology keeps pace with the latest developments and archives are well curated and maintained, but they sit largely unused, gathering “digital dust”.

Meyer asks a probing set of questions of today’s efforts. Who is going to want to travel to multiple fragmented archives to find material? It would make far more sense if it could all be accessed from one point. Who is going to want to study only specifically selected sites? History suggests that often it is the material that does not make it into official collections that is the most fascinating. Who is going to want to study only sites from one country’s domain? The web, after all, is global and interconnected. And in a world where we increasingly work remotely, who will be content with on-site, restricted-access archives?

Furthermore, Meyer points out, future researchers will want archived material that can be searched and analysed in the same way as current web content. The material stored should allow researchers to examine and elucidate patterns and trends.

The heart of the problem, Meyer believes, is that the web archiving community is stuck in a “preservation mindset”.

“As is too often the case with those who build resources, they are preserving websites without giving any real thought to how they might be used in the future,” he argues.

But according to Sean Martin, current chair of the IIPC and head of architecture and development at the British Library, things are changing. Libraries, he believes, are increasingly thinking about future use and the kinds of services that can be built on top of their archives to assist researchers.

“The objective in the early days had to simply be to collect the material, because if it wasn’t collected, there would be no possibility of future research. But we do now see an evolution.”

He points to one encouraging recent example. Memento, a tool developed by the Los Alamos National Research Library, pulls together web pages from different archives accessible over the web to show how a particular site has changed over time.

In other promising developments, the British Library has added new functions to make it possible to produce “word clouds” and N-grams (graphs showing how frequently specific words or phrases are used over time) from the data in the UK Web Archive, while the UK Government Web Archives which archives UK central government websites under Crown copyright, has introduced web continuity software that automatically redirects visitors arriving at old government websites to the relevant page in the archive.

Last year, a European project, the Longitudinal Analytics of Web Archive Data, began to look at how large-scale data analysis could be applied to archives.

Organisations are also trying to work together to cover broader territory, says Martha Anderson, director of program management for the US National Digital Information Infrastructure and Preservation Program run from the Library of Congress.

For Meyer, however, all this is only a start on the job at hand. “Maybe I have unrealistic expectations, but we are behind where I would like us to be,” he says.

But his hope is that in 20 years, decisions made today on how to preserve the content of the World Wide Web will be, in the words of his report, “lauded by the researchers of the future who have come to rely on the information and evidence of human endeavour embodied in the internet”.

Private lives, public benefits

Social media spaces such as Twitter and Facebook are bound to be of interest to researchers of the future, but their content changes by the second.

So how are they being preserved?

In April last year, Twitter donated its archive to the US Library of Congress. Every public tweet made since the inception of the website will be archived digitally, although the rules on how researchers will be able to access the material are still being drawn up.

But the case of Facebook is rather different because much of its content is password protected. However, researchers hope that in the future its archive will also be donated to a library. Privacy concerns could be addressed by means of a proviso that material would be made available only many years hence, suggests Eric Meyer of the Oxford Internet Institute.

Key online archives of web content

Internet Archive Wayback Machine

British Library’s UK Web Archive

UK Government Web Archive

Library of Congress Web Archives