jschubert/substack_archival_tool

Fork 0

Tool to make print-friendly pdf copies of substack articles. Can archive entire substacks, and there is an option (with instructions) to input the cookie associated with paid substacks so you can archive content you paid for.

Find a file

jschubert 3f049946c3 Upload files to "/"		2025-12-21 09:42:28 -05:00
build_exe.bat	Upload files to "/"	2025-12-21 09:42:28 -05:00
LICENSE	Initial commit	2025-12-21 09:35:53 -05:00
README.md	Upload files to "/"	2025-12-21 09:41:38 -05:00
requirements-build.txt	Upload files to "/"	2025-12-21 09:41:38 -05:00
requirements.txt	Upload files to "/"	2025-12-21 09:41:38 -05:00
substack_archiver.py	Upload files to "/"	2025-12-21 09:41:38 -05:00
SubstackArchiver.spec	Upload files to "/"	2025-12-21 09:41:38 -05:00

README.md

Substack Archival Tool

A Python tool to archive Substack publications as print-friendly PDF files with proper margins to prevent text cutoff during printing.

Features

Automatic article discovery: Fetches all articles from a Substack publication
Proper print margins: 0.75-inch margins on all sides to prevent text cutoff
Complete content: Includes article text, images, and metadata (title, author, date)
Individual PDFs: Creates one PDF file per article for easy printing
Embedded images: Downloads and embeds images directly in PDFs
Clean formatting: Professional, readable layout optimized for printing

Requirements

Python 3.8 or higher
Internet connection (to fetch articles and images)

Installation

Extract the project and navigate to the project directory:
```
cd path/to/substack-archival-project
```
Replace path/to/substack-archival-project with wherever you extracted the files.
Install dependencies:
```
pip install -r requirements.txt
```
Note: WeasyPrint requires additional system dependencies on some platforms:
- Windows: Usually works out of the box
- macOS: brew install cairo pango gdk-pixbuf libffi
- Linux: apt-get install python3-dev python3-pip python3-cffi libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info

Usage

Basic Usage

Archive all articles from a Substack publication:

python substack_archiver.py --url https://example.substack.com

or without the https://:

python substack_archiver.py --url example.substack.com

Advanced Options

Specify output directory:

python substack_archiver.py --url example.substack.com --output ./my_archive

Verbose mode (show detailed progress):

python substack_archiver.py --url example.substack.com --verbose

Limit number of articles (useful for testing):

python substack_archiver.py --url example.substack.com --limit 5

Adjust delay between requests:

python substack_archiver.py --url example.substack.com --delay 3.0

Combine multiple options:

python substack_archiver.py --url example.substack.com --output ./my_archive --verbose --limit 10

Archiving Paid Content (With Authentication)

If you have a paid subscription and want to archive subscriber-only articles, you'll need to provide your Substack session cookie:

Step 1: Get your Substack authentication cookie

IMPORTANT: The cookie name depends on how the Substack is hosted:

Custom domain (e.g., example.io): Use the connect.sid cookie
Standard Substack domain (e.g., example.substack.com): Use the substack.sid cookie

In Chrome/Edge:
- Open the specific Substack you want to archive
- Make sure you're logged in
- Press F12 to open Developer Tools
- Go to the "Application" tab
- In the left sidebar, expand "Cookies" and click on your Substack domain
- Find the cookie named connect.sid or substack.sid (depending on domain type above)
- Copy the entire "Value" field
In Firefox:
- Open the specific Substack you want to archive
- Make sure you're logged in
- Press F12 to open Developer Tools
- Go to the "Storage" tab
- Expand "Cookies" and click on your Substack domain
- Find connect.sid or substack.sid (depending on domain type above) and copy its value

Step 2: Use the cookie with the archiver

For custom domains:

python substack_archiver.py --url example.io --cookie "connect.sid=YOUR_COOKIE_VALUE_HERE"

For standard Substack domains:

python substack_archiver.py --url example.substack.com --cookie "substack.sid=YOUR_COOKIE_VALUE_HERE"

Example with full cookie:

python substack_archiver.py --url example.io --cookie "connect.sid=s%3AJ7thyI5Z..." --verbose

Important notes:

Keep your cookie private - it's like a password
Cookies expire after some time; you may need to get a new one periodically
Only works for Substacks you have paid access to
The cookie authenticates you, giving access to paid content you're subscribed to

Command-Line Options

Option	Description	Default
`--url`	Substack publication URL (required)	-
`--output`	Output directory for PDF files	`./output`
`--verbose`, `-v`	Show detailed progress information	`False`
`--limit`	Limit number of articles to archive	All articles
`--delay`	Delay between requests (seconds)	`2.0`
`--cookie`	Substack session cookie for paid content	None

Output

PDFs are saved to the output directory (default: ./output/) with filenames based on article titles. Special characters are removed for filesystem compatibility.

Example output structure:

output/
├── Article Title One.pdf
├── Another Great Article.pdf
├── How To Do Something.pdf
└── ...

Print Settings

The PDFs are generated with 0.75-inch margins on all sides, which should work with most printers. When printing:

Use the "Fit to page" or "Actual size" option
Ensure printer margins are set to at least 0.75 inches
Print a test page first to verify no text cutoff

Troubleshooting

No articles found

Verify the Substack URL is correct
Try accessing the URL in a web browser
Some Substacks may be private or require authentication

Images not appearing

Ensure you have internet connection during archival
Some images may be behind authentication
Check the verbose output for image download errors

PDF generation errors

Ensure WeasyPrint dependencies are installed correctly
Check that you have write permissions to the output directory
Try with --verbose to see detailed error messages

WeasyPrint installation issues

Windows: Install GTK+ if needed
Refer to WeasyPrint documentation: https://doc.courtbouillon.org/weasyprint/stable/first_steps.html

Limitations

Does not handle paywalled content (requires subscription)
Videos are not embedded (shows link instead)
Comments are not included
Some complex formatting may be simplified

Building for Distribution

If you want to create a standalone Windows EXE that others can use without installing Python:

Quick Build (Recommended)

Run the build script:
```
build_exe.bat
```
This will automatically:
- Install PyInstaller and dependencies
- Clean previous builds
- Create a single EXE file in the dist/ folder
Distribute the EXE:
- The file will be located at dist/SubstackArchiver.exe
- This is a standalone executable (50-100 MB)
- Users do NOT need Python installed
- Simply share this single file

Manual Build

If you prefer to build manually:

Install build dependencies:
```
pip install -r requirements-build.txt
```

Build with PyInstaller:

pyinstaller --clean SubstackArchiver.spec

Find your EXE:
```
dist/SubstackArchiver.exe
```

Usage of the EXE

Users can run the EXE from Command Prompt:

SubstackArchiver.exe --url example.substack.com

Or double-click it and it will prompt for the URL and authentication interactively.

Note on Antivirus Software

Some antivirus programs may flag PyInstaller executables as suspicious. This is a false positive common with bundled Python applications. Users may need to:

Add an exception in their antivirus software
Or use the Python script version instead

Project Structure

Substack Archival Project/
├── README.md                   # This file
├── requirements.txt            # Python dependencies
├── requirements-build.txt      # Build dependencies (for creating EXE)
├── substack_archiver.py       # Main CLI script
├── SubstackArchiver.spec       # PyInstaller configuration
├── build_exe.bat               # Automated build script for Windows
├── modules/
│   ├── __init__.py            # Module initialization
│   ├── fetcher.py             # Article URL fetching
│   ├── parser.py              # Content extraction
│   └── pdf_generator.py       # PDF generation
├── templates/
│   └── article_template.html  # HTML template reference
├── output/                     # Generated PDFs (created automatically)
└── dist/                       # Built EXE location (after running build)

License

Free to use and modify for personal and educational purposes.

Support

For issues or questions, check the troubleshooting section above or review the source code comments.