Wayback Machine and Cloudflare team up to archive more of the Web


Enlarge / Screenshot of the Internet Archive’s home page, including the WayBack Machine’s search box.

The Internet Archive and Cloudflare have teamed up to archive the content of websites that use Cloudflare’s Always Online service, increasing the odds that users will be able to view a recent version of a website during outages. The partnership will increase the number of webpages scanned by the Internet Archive, making the organization’s Wayback Machine more useful to Internet users in general.

“Websites that enable Cloudflare’s Always Online service will now have their content automatically archived, and if by chance the original host is not available to Cloudflare, then the Internet Archive will step in to make sure the pages get through to users,” said an announcement by Mark Graham, director of the Internet Archive’s Wayback Machine.

Cloudflare says its Always Online feature saves “a limited copy of your cached website to keep it online for your visitors” when the origin server is unavailable, ensuring that a website’s “most popular pages are represented.” Using the Wayback Machine will improve the Always Online service, Cloudflare CEO Matthew Prince said.

“The Internet Archive’s Wayback Machine has an impressive infrastructure that can archive the Web at scale,” Prince said.

The partnership will in turn improve the Wayback Machine’s ability to archive the Web. The nonprofit Internet Archive’s system doesn’t crawl the entire Web but has made more than 468 billion archived webpages available and is adding over 1 billion new archived URLs a day, Graham wrote. It does this “via a variety of different methods, such as ‘crawling’ from lists of millions of sites, as submitted by users via the Wayback Machine’s ‘Save Page Now’ feature, [websites] added to Wikipedia articles, referenced in Tweets, and based on a number of other ‘signals’ and sources, such [as] multiple feeds of ‘news’ stories,” Graham explained.

Cloudflare’s Always Online service is now one additional avenue for the Wayback Machine to find and archive websites. “As new URLs are added to sites that use that service they are submitted for archiving to the Wayback Machine,” Graham wrote. “In some cases this will be the first time a URL will be seen by our system and result in a ‘First Archive’ event.” In all cases, these newly archived URLs “will be available to anyone who uses the Wayback Machine.”

Graham predicts that the partnership will let the Internet Archive do a “better job of backing up more of the public Web, and in so doing help make the Web more useful and reliable.”

Users will get static webpages

Users who reach an archived version of a website when a server is offline will see only static pages. “Visitors who interact with dynamic parts of a website, such as a shopping cart or comment box, will see an error page caused by the offline origin web server,” Cloudflare said in a new support page that describes how the integration works. When a website is unreachable, Cloudflare says it will first check “Cloudflare’s cache for a stale or expired version of your website. When none exists, Cloudflare will go to the Internet Archive to fetch and serve static portions of your website.”

The Internet Archive integration is available to Cloudflare’s free users but will only back up the website every 30 days. Cloudflare’s paying customers will get more frequent backups, every 15 days for Pro users and every five days for Business and Enterprise users.

Cloudflare said its users must enable Internet Archive integration with the following steps:

  1. Log in to your Cloudflare account.
  2. Choose the domain for which you want to enable Always Online with Internet Archive integration. The Cloudflare dashboard displays.
  3. Click the Caching app.
  4. In the Caching app, select the Configuration tab.
  5. To enable Always Online, scroll to the Always Online Beta card and toggle it to On.
  6. To enable Internet Archive integration, click Update.

Be the first to comment

Leave a Reply

Your email address will not be published.


*