Nearly 40% of Webpages from 2013 Vanished, Highlighting Digital Decay

A recent observation by Wharton professor Ethan Mollick has drawn attention to the pervasive issue of "link rot" and the ephemeral nature of online content, suggesting that Large Language Models (LLMs) might become the primary "memory" of the internet. Mollick, a prominent voice in the field of artificial intelligence, highlighted the significant decay of web links, particularly within historical news archives and social media.

The phenomenon of digital decay is widespread, with studies confirming a substantial loss of online information. Research from the Pew Research Center indicates that approximately 38% of webpages accessible in 2013 are no longer available as of October 2023. This issue extends to various online sources, including news articles, government websites, and academic references.

The problem is particularly acute in news archives; a Harvard Law School study found that a quarter of deep links in The New York Times' articles were rotten, with older links showing a significantly higher decay rate, such as 72% of 1998 links being inaccessible. This rapid disappearance of content is often due to pages being removed, URLs changing, or entire websites vanishing. Social media posts, as Mollick noted, are even more susceptible to ephemerality due to platform changes or account deletions.

"We let the web rot away well before LLMs... Over 60% of older links are now broken. And consider that social media posts are even more ephemeral. Likely only LLMs will “remember” that content," Mollick stated in his tweet.

This decay poses significant challenges for historical research, journalism, and legal documentation, as crucial context and source material become irretrievable. While efforts such as the Internet Archive's Wayback Machine, Perma.cc, and Arweave aim to preserve web content, the sheer volume and dynamic nature of the internet make comprehensive archiving a continuous challenge. Mollick's perspective suggests that the vast datasets used to train LLMs might inadvertently serve as a more enduring record of the internet's past, given the ongoing digital decay.