Skip to main content

Creating a Safari webarchive from the command line

Recently I’ve been trying to create a local archive of my bookmarked web pages. I already have tools to take screenshots, and I love them as a way to take quick snapshots and skim the history of a site, but bitmap images aren’t a great archival representation of a website. What if I also want to save the HTML, CSS, and JavaScript, and keep an interactive copy of the page?

There are lots of tools in this space; for my personal stuff I’ve come to like Safari webarchives. There are several reasons I find them appealing:

The one thing that’s missing is a way to create webarchive files programatically. Although I could open each page and save it in Safari individually, I have about 6000 bookmarks – I’d like a way to automate this process.

I was able to write a short script in Swift that does this for me. In the rest of this article I’ll explain how it works, or you can skip to the GitHub repo.

Prior art: newzealandpaul/webarchiver

I found an existing tool for creating Safari webarchives on the command line, written by newzealandpaul.

I did some brief testing and it seems to work okay, but I had a few issues. The error messages aren’t very helpful – some of my bookmarks failed to save with an error like “invalid URL”, even though the URL opens just fine. I went to read the code to work out what was happening, but it’s written in Objective‑C and uses deprecated classes like WebView and WebArchive.

Given that it’s only about 350 lines, I wanted to see if I could rewrite it using Swift and the newest classes. I thought that might be easier than trying to understand a language and classes that I’m not super familiar with.

Playing with WKWebView and createWebArchiveData

It didn’t take much googling to learn that WebView has been replaced by WKWebView, and that class has a method createWebArchiveData which “creates a web archive of the web view’s current contents asynchronously”. Perfect!

I watched a WWDC session by Brady Eison, a WebKit engineer, where the createWebArchiveData API was introduced. It gave me some useful context about the purpose of WKWebView – it’s for showing web content inside Mac and iOS apps. If you’ve ever used an in-app browser, there was probably an instance of WKWebView somewhere underneath.

The session included some sample code for using this API, which I fashioned into an initial script:

import WebKit

let url = URL(string: "https://example.com/")
let savePath = URL(fileURLWithPath: "example.webarchive")

let webView = WKWebView()
let request = URLRequest(url: url!)

webView.load(request)

// https://developer.apple.com/videos/play/wwdc2020/10188/?time=1327
webView.createWebArchiveData(completionHandler: { result in
  do {
    let data = try result.get()
    try data.write(to: savePath)
  } catch {
    print("Unable to save webarchive file: \(error.localizedDescription)")
  }
})

I saved this code as create_webarchive.swift, and ran it on the command line:

$ swift create_webarchive.swift

I was hoping that this would load https://example.com/, and save a webarchive of the page to example.webarchive. The script did run, but it only created an empty file.

I did a little debugging, and I realised that my WKWebView was never actually loading the web page. I pointed it at a local web server, and I could see it wasn’t fetching any data. Hmm.

We need a loop-de-loop

Using a WKWebView inside a Swift script isn’t how it’s normally used. Most of the time, it appears as part of a web browser inside a Mac or iOS app. In that context, you don’t want fetching web pages to be a blocking operation – you want the rest of the app to remain responsive and usable, and download the web page as a background operation.

This made me wonder if my problem was that my script doesn’t have “background operations”. When I ask WKWebView to load my page, it’s getting shoved in a queue of background tasks, but nothing is picking up work from that queue. I don’t fully understand what I did next, but I think I’ve got the gist of the problem.

I had another look at newzealandpaul’s code, and I found some lines that look a bit like they’re solving the same problem. I think the NSRunLoop is doing work that’s on that background queue, and it’s waiting until the page has finished loading:

// Wait until the site has finished loading.
NSRunLoop *currentRunLoop = [NSRunLoop currentRunLoop];
NSTimeInterval resolution = _localResourceLoadingOnly ? 0.1 : 0.01;
BOOL isRunning = YES;

while (isRunning && _finishedLoading == NO) {
  NSDate *next = [NSDate dateWithTimeIntervalSinceNow:resolution];
  isRunning = [currentRunLoop runMode:NSDefaultRunLoopMode beforeDate:next];
}

I was able to adapt this idea for my Swift script, using RunLoop.main.run(). I can track the progress of WKWebView with the isLoading attribute, so I kept running the main loop for short intervals until I could see this attribute change. I realised that createWebArchiveData is also an asynchronous operation that runs in the background, so I need to wait for that to finish too.

I added these two functions to WKWebView. Here’s my updated script:

import WebKit

let urlString = "https://www.example.com"
let savePath = URL(fileURLWithPath: "example.webarchive")

extension WKWebView {

  /// Load the given URL in the web view.
  ///
  /// This method will block until the URL has finished loading.
  func load(_ urlString: String) {
    if let url = URL(string: urlString) {
      let request = URLRequest(url: url)
      self.load(request)

      while (self.isLoading) {
        RunLoop.main.run(until: Date(timeIntervalSinceNow: 0.1))
      }
    } else {
      fputs("Unable to use \(urlString) as a URL\n", stderr)
      exit(1)
    }
  }

  /// Save a copy of the web view's contents as a webarchive file.
  ///
  /// This method will block until the webarchive has been saved,
  /// or the save has failed for some reason.
  func saveAsWebArchive(savePath: URL) {
    var isSaving = true

    self.createWebArchiveData(completionHandler: { result in
      do {
        let data = try result.get()
        try data.write(to: savePath)
        isSaving = false
      } catch {
        fputs("Unable to save webarchive file: \(error.localizedDescription)\n", stderr)
        exit(1)
      }
    })

    while (isSaving) {
      RunLoop.main.run(until: Date(timeIntervalSinceNow: 0.1))
    }
  }
}

let webView = WKWebView()

webView.load(urlString)
webView.saveAsWebArchive(savePath: savePath)

This works, but there’s a fairly glaring hole – it will archive whatever got loaded into the web view, even if the page wasn’t loaded successfully. Let’s fix that next.

Checking the page loaded successfully with WKNavigationDelegate

If there’s some error getting the page – say, my Internet connection is down or the remote server doesn’t respond – the WKWebView will still complete loading and set isLoading = false. My code will then proceed to archive the error page, which is unhelpful. I’d rather the script threw an error, and prompted me to investigate.

While I was reading more about WKWebView, I came across the WKNavigationDelegate protocol. If you implement this protocol, you can track the progress of a page load, and get detailed events like “the page has started to load” and “the page failed to load with an error”.

There are two methods you can implement, which will be called if an error at different times during page load. Because I’m working in a standalone script, I just have them print an error and then terminate the process – I don’t need more sophisticated error handling than that.

I also wrote a method that checks the HTTP status code of the response, and terminates the script if it’s not an HTTP 200 OK. This means that 404 pages and server errors won’t be automatically archived – I can do that manually in Safari if I think they’re really important.

Here’s the delegate I wrote:

/// Print an error message and terminate the process if there are
/// any errors while loading a page.
class ExitOnFailureDelegate: NSObject, WKNavigationDelegate {
  var urlString: String

  init(_ urlString: String) {
    self.urlString = urlString
  }

  func webView(
    _: WKWebView,
    didFail: WKNavigation!,
    withError error: Error
  ) {
    fputs("Failed to load \(self.urlString): \(error.localizedDescription)\n", stderr)
    exit(1)
  }

  func webView(
    _: WKWebView,
    didFailProvisionalNavigation: WKNavigation!,
    withError error: Error
  ) {
    fputs("Failed to load \(self.urlString): \(error.localizedDescription)\n", stderr)
    exit(1)
  }

  func webView(
    _: WKWebView,
    decidePolicyFor navigationResponse: WKNavigationResponse,
    decisionHandler: (WKNavigationResponsePolicy) -> Void
  ) {
    if let httpUrlResponse = (navigationResponse.response as? HTTPURLResponse) {
      if httpUrlResponse.statusCode != 200 {
        fputs("Failed to load \(self.urlString): got status code \(httpUrlResponse.statusCode)\n", stderr)
        exit(1)
      }
    }

    decisionHandler(.allow)
  }
}

let webView = WKWebView()

let delegate = ExitOnFailureDelegate()
webView.navigationDelegate = delegate

To check this error handling worked correctly, I tried loading a website while I was offline, loading a URL with a domain name that doesn’t have DNS, and loading a page that 404s on my own website. All three failed as I want:

$ swift create_webarchive.swift
Failed to load web page: The Internet connection appears to be offline.

$ swift create_webarchive.swift
Failed to load web page: A server with the specified hostname could not be found.

$ swift create_webarchive.swift
Failed to load web page: got status code 404

Adding some command-line arguments

Right now the URL string and save location are both hard-coded; I wanted to make them command-line arguments. I can do this by inspecting CommandLine.arguments:

guard CommandLine.arguments.count == 3 else {
  print("Usage: \(CommandLine.arguments[0]) <URL> <OUTPUT_PATH>")
  exit(1)
}

let urlString = CommandLine.arguments[1]
let savePath = URL(fileURLWithPath: CommandLine.arguments[2])

And then I can call the script with my two arguments:

$ swift create_webarchive.swift "https://www.example.com/" "example.webarchive"

For more complex command-line interfaces, Apple has an open-source ArgumentParser library, but I’m not sure how I’d use that in a standalone script.

Running it over my bookmarks

Once I’d written the initial version of this script and put all the pieces together, I used it to create webarchives for 6000 or so bookmarks in my Pinboard account. It worked pretty well, and captured 85% of my bookmarks – the remaining 15% are broken due to link rot. I did a spot check of a few dozen archives that did get saved, and they all look good.

My script worked correctly in the happy path, but I went back and improved some of the error messages. I saw a lot of different failures when archiving such a wide variety of URLs, including esoteric HTTP status codes, expired TLS certificates, and a couple of redirect loops. Now those errors are reported in a bit more detail and not just “something went wrong”.

I also tweaked the code so it won’t replace an existing webarchive file. I do this by adding .withoutOverwriting to my write() call. I don’t want to risk overwriting a known-good archive of a page with a copy that’s now broken.

The finished script

I’ve put the script in a new GitHub repository: alexwlchan/safari-webarchiver. This repo will be the canonical home for this code, and I’ll post any updates there.

It includes the final copy of the code in this post, a small collection of tests, and some instructions on how to download and use the finished script.