A Practical Guide to Archiving Web Pages Offline with Monolith

14 views 0 likes 0 comments 18 minutesTutorial

Learn how to use the `monolith` CLI tool to convert any web page—complete with images, CSS, and scripts—into a single, self-contained HTML file. This guide covers installation, essential commands, dynamic page handling, batch processing, and troubleshooting for reliable offline archiving.

#CLI Tools # Web Archiving # Rust # Offline Reading # Productivity # Open Source
A Practical Guide to Archiving Web Pages Offline with Monolith

Ditch "Save As": A Practical Guide to Archiving Complete Web Pages with Monolith

Last week, a colleague sent me a dozen links for a technical reference doc, asking me to "archive them when I have time." My usual workflow? Open browser → Right-click "Save As" → End up with an HTML file plus a messy folder of assets → Zip it up → Send it → Hope it opens correctly on their end (spoiler: relative paths often break).

It's too clunky. Today, I want to introduce monolith, a lightweight CLI tool written in Rust that bundles an entire webpage—images, CSS, and JavaScript—into a single .html file. Open it offline, and it renders exactly like the live site. With 15k+ GitHub stars, zero runtime dependencies, and cross-platform support, it's a game-changer for developers, ops engineers, and content creators.

By the end of this guide, you'll be able to: Install monolith, archive a full-featured webpage into a single HTML file, master resource filtering, batch processing, and handle dynamic SPAs.


Prerequisites

  • OS: macOS, Linux, or Windows
  • Experience: No programming background required; basic terminal knowledge is enough
  • Optional: If compiling from source, ensure you have cargo (Rust package manager) installed

Quick Installation

The developers have made installation seamless across all major platforms. Pick the method that suits your environment:

macOS / Linux (Homebrew)

bash 复制代码
brew install monolith

Windows (Winget)

bash 复制代码
winget install --id=Y2Z.Monolith -e

Global Install (Requires Rust)

bash 复制代码
cargo install monolith

Verify the installation:

bash 复制代码
monolith -V

If you see the version number, you're ready to go.

Why recommend package managers? Because monolith is a static binary. brew or winget fetches the precompiled version and adds it to your $PATH. No need to wrestle with build dependencies like libssl. Switch to cargo install or download binaries from the Releases page only if you're on an isolated network or need the latest edge features.


Core Usage: One-Liner Web Archiving

The most basic usage requires just a URL and an output path:

bash 复制代码
monolith https://lyrics.github.io/db/P/Portishead/Dummy/Roads/ -o portishead.html

What happens under the hood?

  1. monolith fetches the target HTML via HTTP.
  2. It parses the document, identifying all <link>, <img>, <script>, and <style> references.
  3. It downloads each asset and converts it into data URLs (e.g., data:image/png;base64,...).
  4. The inlined assets replace the original references, and the final output is saved to portishead.html.

Double-click the file, open it offline, and it renders identically to the live version. This is the fundamental difference between monolith and your browser's "Save As": it doesn't rely on external asset folders. Everything lives in one file.

Quick Reference for Common Flags

Flag Purpose Use Case
-o FILE Output to file Required
-i Remove images Reduce file size when only text is needed
-c Remove CSS Pure data extraction
-j Remove JavaScript Strip tracking scripts/ads
-I Isolate document (inline all assets) The default/recommended archiving mode
-t 30 Set 30s timeout Unstable networks
-k Skip TLS verification Internal testing with self-signed certs

You don't need to memorize all of these. Just remember -I. It forces CSS, fonts, and images to be inlined as data URLs, guaranteeing a fully self-contained HTML file.


🛠 Practical Example: Archiving Tech Docs & Stripping Ads

Let's say you want to save the Hacker News homepage for offline reading, but you want to skip Google Analytics tracking scripts and ad networks. Here's how:

Step 1: Filter domains with a blocklist

bash 复制代码
monolith -I \
  -B -d googleanalytics.com \
  -B -d .google.com \
  https://news.ycombinator.com/ \
  -o hn-offline.html

💡 Note the -B syntax: -B -d <domain> adds the domain to a blocklist. monolith will skip fetching any resources from it. If you only want to allow specific domains, use -d in whitelist mode instead.

Step 2: Verify the output

bash 复制代码
ls -lh hn-offline.html

Open the file, and you'll notice the Google Analytics <script> tags are gone, while Hacker News' content and styling remain intact. The archive is now a completely self-contained HTML file, stripped of hundreds of KB of third-party dependencies.


Advanced Scenarios

Scenario 1: Handling SPAs & Dynamically Rendered Pages

monolith lacks a JavaScript engine. It only processes the initial HTML returned by the server. For Vue/React SPAs (where content is injected via JS), running monolith directly will yield a blank skeleton.

The Fix: Let a headless browser render the DOM first, then pipe it to monolith.

bash 复制代码
chromium --headless \
  --window-size=1920,1080 \
  --run-all-compositor-stages-before-draw \
  --virtual-time-budget=9000 \
  --incognito \
  --dump-dom \
  https://github.com \
  | monolith - -I -b https://github.com -o github-rendered.html

This does two things:

  1. Headless Chromium opens the page, waits ~9 seconds for JS to execute, and outputs the final DOM (--dump-dom).
  2. monolith reads the HTML from stdin (-), uses https://github.com as the base URL (-b), and inlines all resolved resources.

This is the officially recommended workflow for dynamic content. Save it as a shell function in your .bashrc (e.g., archive-dynamic <url>) for one-click archiving.

Scenario 2: Batch Archiving Multiple Pages

Rarely do you archive just one page. A simple shell loop handles batches efficiently:

bash 复制代码
#!/bin/bash
## archive-list.sh: Read urls.txt line by line and archive each
while read -r url; do
  name=$(echo "$url" | md5sum | cut -d' ' -f1)
  echo "[Archiving] $url"
  monolith "$url" -I -t 30 -o "archive/${name}.html"
done < urls.txt

Using MD5 for filenames avoids illegal characters (/, ?, etc.) in URLs. The -t 30 flag ensures each request times out after 30 seconds, preventing hangs.

Scenario 3: Pages with Authentication

bash 复制代码
monolith https://username:password@internal.example.com/dashboard -o dashboard.html

Standard Basic Auth is fully supported using the user:pass@host URL format.

Scenario 4: Routing Through a Proxy

monolith automatically respects environment variables. No extra flags needed:

bash 复制代码
export https_proxy=http://proxy.corp.com:8080
monolith https://some.site/ -o saved.html

⚠️ Common Pitfalls & Troubleshooting

Q1: The generated HTML file is huge?
A: Expected behavior. Inlining every image, font, and video will bloat the file. If you only need text/assets, use -i -c -j to strip images, CSS, and JS. You can often shrink a multi-MB file down to a few dozen KB.

Q2: Some pages show garbled text/mojibake?
A: Force the encoding with -E utf-8. Some legacy sites declare incorrect charsets in their meta tags, causing parsing errors.

Q3: The saved content looks different from the live site?
A: Likely an SPA or async-loaded content. Use the headless Chrome pre-rendering method outlined in "Scenario 1".

Q4: Piped input fails with relative paths?
A: When reading HTML from stdin (cat local.html | monolith - ...), monolith doesn't know the original URL. Relative paths will break. Always pair - with -b <base-url> to resolve relative assets correctly.


Summary

Let's recap what we've covered:

  1. Installed monolith using the package manager best suited for your OS.
  2. Ran monolith <url> -o output.html to perform your first archive.
  3. Used -B -d <domain> to filter out unwanted third-party trackers.
  4. Combined headless Chrome with a pipe to handle dynamic SPAs.
  5. Automated batch archiving with a simple shell script.

The real power of monolith isn't flashy features; it's distilling a clunky, multi-step process into a single command. Licensed under CC0, it's completely safe to integrate into corporate pipelines or personal toolchains.

Next Steps: Wrap your frequent workflows into shell aliases or Makefile targets. For advanced scheduling or conditional filtering, explore monolith's Apify Actor integration, or use it as a downstream processor in a Go/Node.js orchestration script.

Struggling with a specific archiving scenario? Drop a comment below and let's figure it out.

Last Updated:2026-06-21 10:05:01

Comments (0)

Post Comment

Loading...
0/500
Loading comments...