| build_exe.bat | ||
| LICENSE | ||
| README.md | ||
| requirements-build.txt | ||
| requirements.txt | ||
| substack_archiver.py | ||
| SubstackArchiver.spec | ||
Substack Archival Tool
A Python tool to archive Substack publications as print-friendly PDF files with proper margins to prevent text cutoff during printing.
Features
- Automatic article discovery: Fetches all articles from a Substack publication
- Proper print margins: 0.75-inch margins on all sides to prevent text cutoff
- Complete content: Includes article text, images, and metadata (title, author, date)
- Individual PDFs: Creates one PDF file per article for easy printing
- Embedded images: Downloads and embeds images directly in PDFs
- Clean formatting: Professional, readable layout optimized for printing
Requirements
- Python 3.8 or higher
- Internet connection (to fetch articles and images)
Installation
-
Extract the project and navigate to the project directory:
cd path/to/substack-archival-projectReplace
path/to/substack-archival-projectwith wherever you extracted the files. -
Install dependencies:
pip install -r requirements.txtNote: WeasyPrint requires additional system dependencies on some platforms:
- Windows: Usually works out of the box
- macOS:
brew install cairo pango gdk-pixbuf libffi - Linux:
apt-get install python3-dev python3-pip python3-cffi libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info
Usage
Basic Usage
Archive all articles from a Substack publication:
python substack_archiver.py --url https://example.substack.com
or without the https://:
python substack_archiver.py --url example.substack.com
Advanced Options
Specify output directory:
python substack_archiver.py --url example.substack.com --output ./my_archive
Verbose mode (show detailed progress):
python substack_archiver.py --url example.substack.com --verbose
Limit number of articles (useful for testing):
python substack_archiver.py --url example.substack.com --limit 5
Adjust delay between requests:
python substack_archiver.py --url example.substack.com --delay 3.0
Combine multiple options:
python substack_archiver.py --url example.substack.com --output ./my_archive --verbose --limit 10
Archiving Paid Content (With Authentication)
If you have a paid subscription and want to archive subscriber-only articles, you'll need to provide your Substack session cookie:
Step 1: Get your Substack authentication cookie
IMPORTANT: The cookie name depends on how the Substack is hosted:
- Custom domain (e.g.,
example.io): Use theconnect.sidcookie - Standard Substack domain (e.g.,
example.substack.com): Use thesubstack.sidcookie
-
In Chrome/Edge:
- Open the specific Substack you want to archive
- Make sure you're logged in
- Press F12 to open Developer Tools
- Go to the "Application" tab
- In the left sidebar, expand "Cookies" and click on your Substack domain
- Find the cookie named
connect.sidorsubstack.sid(depending on domain type above) - Copy the entire "Value" field
-
In Firefox:
- Open the specific Substack you want to archive
- Make sure you're logged in
- Press F12 to open Developer Tools
- Go to the "Storage" tab
- Expand "Cookies" and click on your Substack domain
- Find
connect.sidorsubstack.sid(depending on domain type above) and copy its value
Step 2: Use the cookie with the archiver
For custom domains:
python substack_archiver.py --url example.io --cookie "connect.sid=YOUR_COOKIE_VALUE_HERE"
For standard Substack domains:
python substack_archiver.py --url example.substack.com --cookie "substack.sid=YOUR_COOKIE_VALUE_HERE"
Example with full cookie:
python substack_archiver.py --url example.io --cookie "connect.sid=s%3AJ7thyI5Z..." --verbose
Important notes:
- Keep your cookie private - it's like a password
- Cookies expire after some time; you may need to get a new one periodically
- Only works for Substacks you have paid access to
- The cookie authenticates you, giving access to paid content you're subscribed to
Command-Line Options
| Option | Description | Default |
|---|---|---|
--url |
Substack publication URL (required) | - |
--output |
Output directory for PDF files | ./output |
--verbose, -v |
Show detailed progress information | False |
--limit |
Limit number of articles to archive | All articles |
--delay |
Delay between requests (seconds) | 2.0 |
--cookie |
Substack session cookie for paid content | None |
Output
PDFs are saved to the output directory (default: ./output/) with filenames based on article titles. Special characters are removed for filesystem compatibility.
Example output structure:
output/
├── Article Title One.pdf
├── Another Great Article.pdf
├── How To Do Something.pdf
└── ...
Print Settings
The PDFs are generated with 0.75-inch margins on all sides, which should work with most printers. When printing:
- Use the "Fit to page" or "Actual size" option
- Ensure printer margins are set to at least 0.75 inches
- Print a test page first to verify no text cutoff
Troubleshooting
No articles found
- Verify the Substack URL is correct
- Try accessing the URL in a web browser
- Some Substacks may be private or require authentication
Images not appearing
- Ensure you have internet connection during archival
- Some images may be behind authentication
- Check the verbose output for image download errors
PDF generation errors
- Ensure WeasyPrint dependencies are installed correctly
- Check that you have write permissions to the output directory
- Try with
--verboseto see detailed error messages
WeasyPrint installation issues
- Windows: Install GTK+ if needed
- Refer to WeasyPrint documentation: https://doc.courtbouillon.org/weasyprint/stable/first_steps.html
Limitations
- Does not handle paywalled content (requires subscription)
- Videos are not embedded (shows link instead)
- Comments are not included
- Some complex formatting may be simplified
Building for Distribution
If you want to create a standalone Windows EXE that others can use without installing Python:
Quick Build (Recommended)
-
Run the build script:
build_exe.batThis will automatically:
- Install PyInstaller and dependencies
- Clean previous builds
- Create a single EXE file in the
dist/folder
-
Distribute the EXE:
- The file will be located at
dist/SubstackArchiver.exe - This is a standalone executable (50-100 MB)
- Users do NOT need Python installed
- Simply share this single file
- The file will be located at
Manual Build
If you prefer to build manually:
-
Install build dependencies:
pip install -r requirements-build.txt -
Build with PyInstaller:
pyinstaller --clean SubstackArchiver.spec -
Find your EXE:
dist/SubstackArchiver.exe
Usage of the EXE
Users can run the EXE from Command Prompt:
SubstackArchiver.exe --url example.substack.com
Or double-click it and it will prompt for the URL and authentication interactively.
Note on Antivirus Software
Some antivirus programs may flag PyInstaller executables as suspicious. This is a false positive common with bundled Python applications. Users may need to:
- Add an exception in their antivirus software
- Or use the Python script version instead
Project Structure
Substack Archival Project/
├── README.md # This file
├── requirements.txt # Python dependencies
├── requirements-build.txt # Build dependencies (for creating EXE)
├── substack_archiver.py # Main CLI script
├── SubstackArchiver.spec # PyInstaller configuration
├── build_exe.bat # Automated build script for Windows
├── modules/
│ ├── __init__.py # Module initialization
│ ├── fetcher.py # Article URL fetching
│ ├── parser.py # Content extraction
│ └── pdf_generator.py # PDF generation
├── templates/
│ └── article_template.html # HTML template reference
├── output/ # Generated PDFs (created automatically)
└── dist/ # Built EXE location (after running build)
License
Free to use and modify for personal and educational purposes.
Support
For issues or questions, check the troubleshooting section above or review the source code comments.