Tool to make print-friendly pdf copies of substack articles. Can archive entire substacks, and there is an option (with instructions) to input the cookie associated with paid substacks so you can archive content you paid for.
Find a file
2025-12-21 09:42:28 -05:00
build_exe.bat Upload files to "/" 2025-12-21 09:42:28 -05:00
LICENSE Initial commit 2025-12-21 09:35:53 -05:00
README.md Upload files to "/" 2025-12-21 09:41:38 -05:00
requirements-build.txt Upload files to "/" 2025-12-21 09:41:38 -05:00
requirements.txt Upload files to "/" 2025-12-21 09:41:38 -05:00
substack_archiver.py Upload files to "/" 2025-12-21 09:41:38 -05:00
SubstackArchiver.spec Upload files to "/" 2025-12-21 09:41:38 -05:00

Substack Archival Tool

A Python tool to archive Substack publications as print-friendly PDF files with proper margins to prevent text cutoff during printing.

Features

  • Automatic article discovery: Fetches all articles from a Substack publication
  • Proper print margins: 0.75-inch margins on all sides to prevent text cutoff
  • Complete content: Includes article text, images, and metadata (title, author, date)
  • Individual PDFs: Creates one PDF file per article for easy printing
  • Embedded images: Downloads and embeds images directly in PDFs
  • Clean formatting: Professional, readable layout optimized for printing

Requirements

  • Python 3.8 or higher
  • Internet connection (to fetch articles and images)

Installation

  1. Extract the project and navigate to the project directory:

    cd path/to/substack-archival-project
    

    Replace path/to/substack-archival-project with wherever you extracted the files.

  2. Install dependencies:

    pip install -r requirements.txt
    

    Note: WeasyPrint requires additional system dependencies on some platforms:

    • Windows: Usually works out of the box
    • macOS: brew install cairo pango gdk-pixbuf libffi
    • Linux: apt-get install python3-dev python3-pip python3-cffi libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info

Usage

Basic Usage

Archive all articles from a Substack publication:

python substack_archiver.py --url https://example.substack.com

or without the https://:

python substack_archiver.py --url example.substack.com

Advanced Options

Specify output directory:

python substack_archiver.py --url example.substack.com --output ./my_archive

Verbose mode (show detailed progress):

python substack_archiver.py --url example.substack.com --verbose

Limit number of articles (useful for testing):

python substack_archiver.py --url example.substack.com --limit 5

Adjust delay between requests:

python substack_archiver.py --url example.substack.com --delay 3.0

Combine multiple options:

python substack_archiver.py --url example.substack.com --output ./my_archive --verbose --limit 10

Archiving Paid Content (With Authentication)

If you have a paid subscription and want to archive subscriber-only articles, you'll need to provide your Substack session cookie:

Step 1: Get your Substack authentication cookie

IMPORTANT: The cookie name depends on how the Substack is hosted:

  • Custom domain (e.g., example.io): Use the connect.sid cookie
  • Standard Substack domain (e.g., example.substack.com): Use the substack.sid cookie
  1. In Chrome/Edge:

    • Open the specific Substack you want to archive
    • Make sure you're logged in
    • Press F12 to open Developer Tools
    • Go to the "Application" tab
    • In the left sidebar, expand "Cookies" and click on your Substack domain
    • Find the cookie named connect.sid or substack.sid (depending on domain type above)
    • Copy the entire "Value" field
  2. In Firefox:

    • Open the specific Substack you want to archive
    • Make sure you're logged in
    • Press F12 to open Developer Tools
    • Go to the "Storage" tab
    • Expand "Cookies" and click on your Substack domain
    • Find connect.sid or substack.sid (depending on domain type above) and copy its value

Step 2: Use the cookie with the archiver

For custom domains:

python substack_archiver.py --url example.io --cookie "connect.sid=YOUR_COOKIE_VALUE_HERE"

For standard Substack domains:

python substack_archiver.py --url example.substack.com --cookie "substack.sid=YOUR_COOKIE_VALUE_HERE"

Example with full cookie:

python substack_archiver.py --url example.io --cookie "connect.sid=s%3AJ7thyI5Z..." --verbose

Important notes:

  • Keep your cookie private - it's like a password
  • Cookies expire after some time; you may need to get a new one periodically
  • Only works for Substacks you have paid access to
  • The cookie authenticates you, giving access to paid content you're subscribed to

Command-Line Options

Option Description Default
--url Substack publication URL (required) -
--output Output directory for PDF files ./output
--verbose, -v Show detailed progress information False
--limit Limit number of articles to archive All articles
--delay Delay between requests (seconds) 2.0
--cookie Substack session cookie for paid content None

Output

PDFs are saved to the output directory (default: ./output/) with filenames based on article titles. Special characters are removed for filesystem compatibility.

Example output structure:

output/
├── Article Title One.pdf
├── Another Great Article.pdf
├── How To Do Something.pdf
└── ...

Print Settings

The PDFs are generated with 0.75-inch margins on all sides, which should work with most printers. When printing:

  1. Use the "Fit to page" or "Actual size" option
  2. Ensure printer margins are set to at least 0.75 inches
  3. Print a test page first to verify no text cutoff

Troubleshooting

No articles found

  • Verify the Substack URL is correct
  • Try accessing the URL in a web browser
  • Some Substacks may be private or require authentication

Images not appearing

  • Ensure you have internet connection during archival
  • Some images may be behind authentication
  • Check the verbose output for image download errors

PDF generation errors

  • Ensure WeasyPrint dependencies are installed correctly
  • Check that you have write permissions to the output directory
  • Try with --verbose to see detailed error messages

WeasyPrint installation issues

Limitations

  • Does not handle paywalled content (requires subscription)
  • Videos are not embedded (shows link instead)
  • Comments are not included
  • Some complex formatting may be simplified

Building for Distribution

If you want to create a standalone Windows EXE that others can use without installing Python:

  1. Run the build script:

    build_exe.bat
    

    This will automatically:

    • Install PyInstaller and dependencies
    • Clean previous builds
    • Create a single EXE file in the dist/ folder
  2. Distribute the EXE:

    • The file will be located at dist/SubstackArchiver.exe
    • This is a standalone executable (50-100 MB)
    • Users do NOT need Python installed
    • Simply share this single file

Manual Build

If you prefer to build manually:

  1. Install build dependencies:

    pip install -r requirements-build.txt
    
  2. Build with PyInstaller:

    pyinstaller --clean SubstackArchiver.spec
    
  3. Find your EXE:

    dist/SubstackArchiver.exe
    

Usage of the EXE

Users can run the EXE from Command Prompt:

SubstackArchiver.exe --url example.substack.com

Or double-click it and it will prompt for the URL and authentication interactively.

Note on Antivirus Software

Some antivirus programs may flag PyInstaller executables as suspicious. This is a false positive common with bundled Python applications. Users may need to:

  • Add an exception in their antivirus software
  • Or use the Python script version instead

Project Structure

Substack Archival Project/
├── README.md                   # This file
├── requirements.txt            # Python dependencies
├── requirements-build.txt      # Build dependencies (for creating EXE)
├── substack_archiver.py       # Main CLI script
├── SubstackArchiver.spec       # PyInstaller configuration
├── build_exe.bat               # Automated build script for Windows
├── modules/
│   ├── __init__.py            # Module initialization
│   ├── fetcher.py             # Article URL fetching
│   ├── parser.py              # Content extraction
│   └── pdf_generator.py       # PDF generation
├── templates/
│   └── article_template.html  # HTML template reference
├── output/                     # Generated PDFs (created automatically)
└── dist/                       # Built EXE location (after running build)

License

Free to use and modify for personal and educational purposes.

Support

For issues or questions, check the troubleshooting section above or review the source code comments.