A Practical Guide to Archiving Web Pages Offline with Monolith
Learn how to use the `monolith` CLI tool to convert any web page—complete with images, CSS, and scripts—into a single, self-contained HTML file. This guide covers installation, essential commands, dynamic page handling, batch processing, and troubleshooting for reliable offline archiving.

Ditch "Save As": A Practical Guide to Archiving Complete Web Pages with Monolith
Last week, a colleague sent me a dozen links for a technical reference doc, asking me to "archive them when I have time." My usual workflow? Open browser → Right-click "Save As" → End up with an HTML file plus a messy folder of assets → Zip it up → Send it → Hope it opens correctly on their end (spoiler: relative paths often break).
It's too clunky. Today, I want to introduce monolith, a lightweight CLI tool written in Rust that bundles an entire webpage—images, CSS, and JavaScript—into a single .html file. Open it offline, and it renders exactly like the live site. With 15k+ GitHub stars, zero runtime dependencies, and cross-platform support, it's a game-changer for developers, ops engineers, and content creators.
By the end of this guide, you'll be able to: Install monolith, archive a full-featured webpage into a single HTML file, master resource filtering, batch processing, and handle dynamic SPAs.
Prerequisites
- OS: macOS, Linux, or Windows
- Experience: No programming background required; basic terminal knowledge is enough
- Optional: If compiling from source, ensure you have
cargo(Rust package manager) installed
Quick Installation
The developers have made installation seamless across all major platforms. Pick the method that suits your environment:
macOS / Linux (Homebrew)
bash
brew install monolith
Windows (Winget)
bash
winget install --id=Y2Z.Monolith -e
Global Install (Requires Rust)
bash
cargo install monolith
Verify the installation:
bash
monolith -V
If you see the version number, you're ready to go.
Why recommend package managers? Because monolith is a static binary. brew or winget fetches the precompiled version and adds it to your $PATH. No need to wrestle with build dependencies like libssl. Switch to cargo install or download binaries from the Releases page only if you're on an isolated network or need the latest edge features.
Core Usage: One-Liner Web Archiving
The most basic usage requires just a URL and an output path:
bash
monolith https://lyrics.github.io/db/P/Portishead/Dummy/Roads/ -o portishead.html
What happens under the hood?
monolithfetches the target HTML via HTTP.- It parses the document, identifying all
<link>,<img>,<script>, and<style>references. - It downloads each asset and converts it into data URLs (e.g.,
data:image/png;base64,...). - The inlined assets replace the original references, and the final output is saved to
portishead.html.
Double-click the file, open it offline, and it renders identically to the live version. This is the fundamental difference between monolith and your browser's "Save As": it doesn't rely on external asset folders. Everything lives in one file.
Quick Reference for Common Flags
| Flag | Purpose | Use Case |
|---|---|---|
-o FILE |
Output to file | Required |
-i |
Remove images | Reduce file size when only text is needed |
-c |
Remove CSS | Pure data extraction |
-j |
Remove JavaScript | Strip tracking scripts/ads |
-I |
Isolate document (inline all assets) | The default/recommended archiving mode |
-t 30 |
Set 30s timeout | Unstable networks |
-k |
Skip TLS verification | Internal testing with self-signed certs |
You don't need to memorize all of these. Just remember -I. It forces CSS, fonts, and images to be inlined as data URLs, guaranteeing a fully self-contained HTML file.
🛠 Practical Example: Archiving Tech Docs & Stripping Ads
Let's say you want to save the Hacker News homepage for offline reading, but you want to skip Google Analytics tracking scripts and ad networks. Here's how:
Step 1: Filter domains with a blocklist
bash
monolith -I \
-B -d googleanalytics.com \
-B -d .google.com \
https://news.ycombinator.com/ \
-o hn-offline.html
💡 Note the
-Bsyntax:-B -d <domain>adds the domain to a blocklist.monolithwill skip fetching any resources from it. If you only want to allow specific domains, use-din whitelist mode instead.
Step 2: Verify the output
bash
ls -lh hn-offline.html
Open the file, and you'll notice the Google Analytics <script> tags are gone, while Hacker News' content and styling remain intact. The archive is now a completely self-contained HTML file, stripped of hundreds of KB of third-party dependencies.
Advanced Scenarios
Scenario 1: Handling SPAs & Dynamically Rendered Pages
monolith lacks a JavaScript engine. It only processes the initial HTML returned by the server. For Vue/React SPAs (where content is injected via JS), running monolith directly will yield a blank skeleton.
The Fix: Let a headless browser render the DOM first, then pipe it to monolith.
bash
chromium --headless \
--window-size=1920,1080 \
--run-all-compositor-stages-before-draw \
--virtual-time-budget=9000 \
--incognito \
--dump-dom \
https://github.com \
| monolith - -I -b https://github.com -o github-rendered.html
This does two things:
- Headless Chromium opens the page, waits ~9 seconds for JS to execute, and outputs the final DOM (
--dump-dom). monolithreads the HTML from stdin (-), useshttps://github.comas the base URL (-b), and inlines all resolved resources.
This is the officially recommended workflow for dynamic content. Save it as a shell function in your .bashrc (e.g., archive-dynamic <url>) for one-click archiving.
Scenario 2: Batch Archiving Multiple Pages
Rarely do you archive just one page. A simple shell loop handles batches efficiently:
bash
#!/bin/bash
## archive-list.sh: Read urls.txt line by line and archive each
while read -r url; do
name=$(echo "$url" | md5sum | cut -d' ' -f1)
echo "[Archiving] $url"
monolith "$url" -I -t 30 -o "archive/${name}.html"
done < urls.txt
Using MD5 for filenames avoids illegal characters (/, ?, etc.) in URLs. The -t 30 flag ensures each request times out after 30 seconds, preventing hangs.
Scenario 3: Pages with Authentication
bash
monolith https://username:password@internal.example.com/dashboard -o dashboard.html
Standard Basic Auth is fully supported using the user:pass@host URL format.
Scenario 4: Routing Through a Proxy
monolith automatically respects environment variables. No extra flags needed:
bash
export https_proxy=http://proxy.corp.com:8080
monolith https://some.site/ -o saved.html
⚠️ Common Pitfalls & Troubleshooting
Q1: The generated HTML file is huge?
A: Expected behavior. Inlining every image, font, and video will bloat the file. If you only need text/assets, use -i -c -j to strip images, CSS, and JS. You can often shrink a multi-MB file down to a few dozen KB.
Q2: Some pages show garbled text/mojibake?
A: Force the encoding with -E utf-8. Some legacy sites declare incorrect charsets in their meta tags, causing parsing errors.
Q3: The saved content looks different from the live site?
A: Likely an SPA or async-loaded content. Use the headless Chrome pre-rendering method outlined in "Scenario 1".
Q4: Piped input fails with relative paths?
A: When reading HTML from stdin (cat local.html | monolith - ...), monolith doesn't know the original URL. Relative paths will break. Always pair - with -b <base-url> to resolve relative assets correctly.
Summary
Let's recap what we've covered:
- Installed
monolithusing the package manager best suited for your OS. - Ran
monolith <url> -o output.htmlto perform your first archive. - Used
-B -d <domain>to filter out unwanted third-party trackers. - Combined headless Chrome with a pipe to handle dynamic SPAs.
- Automated batch archiving with a simple shell script.
The real power of monolith isn't flashy features; it's distilling a clunky, multi-step process into a single command. Licensed under CC0, it's completely safe to integrate into corporate pipelines or personal toolchains.
Next Steps: Wrap your frequent workflows into shell aliases or Makefile targets. For advanced scheduling or conditional filtering, explore monolith's Apify Actor integration, or use it as a downstream processor in a Go/Node.js orchestration script.
Struggling with a specific archiving scenario? Drop a comment below and let's figure it out.