Matt Radford

Messing with links since <blink>

Downloading an entire website on a Mac using wget

Published on 

I recently had to take a copy of a client’s website before they transferred from another provider. It was running an old copy of Joomla, and getting backend access proved difficult. So we opted to grab a static copy of the site and keep that live until we had their new WordPress website ready.

There are plenty of apps out there that will download whole websites for you, but the simplest way is to use wget. If you don’t have a copy, you can install wget on a Mac without using MacPorts or HomeBrew using this guide from OS X Daily.

Once it’s installed, open Terminal and type:

wget -help

You’ll see there are a ton of options. At it’s simplest, you can just type:

wget example.com

That will download a copy of the index page of example.com to whichever directory you’re calling wget from in Terminal. But I wanted to get a copy of the whole website, and have it to work locally, i.e. using root-relative URLs, rather than referring back to example.com live on the web.

So here’s the code:

wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --random-wait --domains example.com --no-parent www.example.com

Let’s step through the options used:

--recursive

Recrusively download the directories, up to a max of 5 deep.

--no-clobber

Can also use “-nc”. Stops the same files on a server being downloaded more than once.

--page-requisites

Causes Wget to download all the files that are necessary to properly display a given HTML page. Including such things as inlined images, sounds, and referenced stylesheets.

--html-extension

Renames HTML files as .html. Handy for converting PHP-based sites, such as the Joomla one I needed to copy.

--convert-links

After the download is complete, convert the links in the document to make them suitable for local viewing.

--restrict-file-names=windows

Escapes characters to make them safe on your local system.

--random-wait

Don’t act like we’re downloading the whole site…

--domains example.com

The domain you want to download the whole site from.

--no-parent www.example.com

Do not ever ascend to the parent directory when retrieving recursively.

After all that you’re left with a folder that should be a complete copy of the domain you’ve targeted. Very handy.

However, typing all that is a bit of a pain. I think a bash script taking the domain as an input would save the pain of typing all that out, maybe even wrap it up into an app using Appify. Hmm, one for the to-do list.