|
|
||
|---|---|---|
| .github/workflows | ||
| src | ||
| tests | ||
| .dockerignore | ||
| .gitignore | ||
| .pre-commit-config.yaml | ||
| config.yaml.example | ||
| docker-compose.synology.yml | ||
| docker-compose.yml | ||
| Dockerfile | ||
| pyproject.toml | ||
| README.md | ||
| reddit-collector-web.service | ||
| run.sh | ||
| run_collector.sh | ||
| run_web.sh | ||
Reddit Media Collector
A powerful, self-hosted media collector for Reddit that automatically downloads images, videos, and GIFs from your favourite subreddits and users. Features a built-in web interface for management and seamless integration with Immich for photo organisation.
Features
Collection
- Multi-source — subreddits and user profiles
- Smart deduplication — MD5 hash-based; never downloads the same file twice
- Gallery support — handles Reddit galleries with multiple images
- Multiple extractors — Reddit, Imgur, Gfycat, Redgifs
- No API keys — uses Reddit's public JSON endpoints
Web dashboard
- Modular HTMX + Alpine.js frontend (vanilla, zero build step)
- Gallery with infinite scroll, multi-select, bulk delete, sorting/filtering
- Tag taxonomy (Stash-inspired) — auto-tags every post by
subreddit / performer / genre / nsfw; manual tags preserved across reruns; colored chips on every card - NSFW gate — blur thumbnails by default; toggle 👁/🙈 in the header (persisted in localStorage)
- Discreet mode 🤫 — compact thumbnails for screen-shoulder privacy; auto-activates after 60 s of idle
- PIN lock (optional) — HMAC-signed session cookie on top of Basic Auth; idle timeout configurable
- Favourites + per-author view + sync favourites to user targets
- Blacklist — authors, subreddits, title keywords, domains
- Scheduler — interval or specific times, run history, "run now" button
last_errorbanner — failed scheduled runs surface at the top until dismissed- Keyboard shortcuts in the modal:
j/knavigate,ffavourite,bblacklist author,Escclose /healthendpoint — public JSON with DB/ffmpeg/scheduler/writable status, ready for Container Manager monitors
Operations
- Docker — single-image deployment, published to
ghcr.io/richardnixondev/reddit-media-collector - Synology DSM Container Manager ready (one-click deploy via compose)
- SQLite WAL — crash-safe on power loss, fast concurrent reads
- Rotating log file (10 MB × 5 backups) — caps disk usage on NAS
- HTTP Basic Auth (optional) + per-IP rate limiting on every mutation
- Immich integration — JSON sidecar with metadata for seamless import
Quick Start
Prerequisites
- Python 3.11 or higher
- FFmpeg (optional, for video thumbnails)
- yt-dlp (optional, for Gfycat/Redgifs support)
Installation
-
Clone the repository
git clone https://github.com/richardnixondev/reddit-media-collector.git cd reddit-media-collector -
Create virtual environment
python3 -m venv venv source venv/bin/activate # Linux/macOS # or .\venv\Scripts\activate # Windows -
Install the package (editable, with dev tooling)
pip install -e ".[dev]" -
Configure
cp config.yaml.example config.yaml # Edit config.yaml with your preferences -
Run the collector
python -m src.main
Development
pip install -e ".[dev]"
pre-commit install
# Tests + coverage
pytest -v
pytest --cov=src --cov-report=term-missing
# Lint, format, type-check
ruff check src/ tests/
ruff format src/ tests/
mypy src/
Configuration
Create a config.yaml file based on the example:
targets:
subreddits:
- name: "earthporn"
limit: 100 # Posts per request (max 100)
sort: "top" # hot, new, top, rising
time_filter: "week" # For "top" sort: hour, day, week, month, year, all
- name: "pics"
limit: 50
sort: "new"
users:
- name: "username"
limit: 100
download:
output_dir: "./downloads"
media_types:
- "image"
- "video"
- "gif"
min_score: 10 # Minimum upvotes required
skip_nsfw: false # Skip NSFW content
max_file_size_mb: 200 # Maximum file size
flat_structure: true # All files in a single folder
generate_sidecar: true # Generate .json for Immich
videos_only_from_favorites: false # Only download videos from favourited users
rate_limit:
requests_per_minute: 20
download_delay_seconds: 2
logging:
level: "INFO"
file: "collector.log"
blacklist:
authors: [] # Usernames to ignore
subreddits: [] # Subreddits to ignore
title_keywords: []
domains: []
Web Interface
The collector includes a FastAPI-powered web dashboard for easy management.
Starting the Web Server
# Using the run script
./run_web.sh
# Or directly with uvicorn
uvicorn src.web.app:app --host 0.0.0.0 --port 8000
Access the dashboard at http://localhost:8000
Web Features
- Dashboard — collection statistics, trends chart, top authors, recent downloads
- Gallery — infinite scroll, filtering (subreddit / author / type / favourites / NSFW), sorting, bulk select, multi-delete, tag chips per card, NSFW blur with hover/global toggle
- Authors — grid grouped by author with per-author modal
- Sources — add/remove subreddits and users (HTMX, no page reload)
- Settings — download options, blacklist, scheduler config, individual collection
- Scheduler — interval or specific times, run history, "run now"
- Collector control — manual trigger or per-target collection
- Header toggles — NSFW 👁/🙈, Discreet 🤫, 🔒 Lock (when PIN enabled)
- Keyboard shortcuts — in the media modal:
j/knext/prev,ffavourite,bblacklist author,Escclose
Security & Limits
HTTP Basic Auth (optional)
Set both RMC_AUTH_USER and RMC_AUTH_PASS to require Basic credentials on every route except /health. With either variable unset, the API stays public — appropriate for trusted local/intranet deployments.
RMC_AUTH_USER=alice RMC_AUTH_PASS=s3cret uvicorn src.web.app:app
PIN lock (optional, UI-only)
Set RMC_PIN (4-6 digits) to gate the web UI behind a numeric PIN screen on top of Basic Auth. Useful when others might use the device but you don't want to leak the API password.
- Session cookie is signed with HMAC-SHA256 using a key generated at boot — restart invalidates every session.
- Idle window: 10 min default; override with
RMC_PIN_TIMEOUT(seconds). /health,/static/*,/faviconbypass the lock so monitors and CSS keep working.- Clicking "🔒 Lock" in the header locks immediately.
RMC_PIN=1234 RMC_PIN_TIMEOUT=900 uvicorn src.web.app:app
Rate limiting (per IP, per endpoint)
Heavy and mutation endpoints are throttled in-process to prevent runaway clients:
| Endpoint | Limit |
|---|---|
GET /api/stats, /api/stats/enhanced |
30 / minute |
POST /api/collector/run |
3 / minute |
POST /api/media/cleanup-blacklist |
5 / minute |
POST /api/media/cleanup-by-type |
5 / minute |
POST/PUT/DELETE /api/{subreddits,users,blacklist/*,settings/*,posts/*/tags} |
60 / minute |
Limits are per (client IP, route path) and reset on a sliding window. For multi-worker deployments, swap the in-memory buckets for a shared backend.
Gallery performance
- The gallery keeps at most 500 items in memory/DOM at once; older items are evicted as you scroll. Use filters or sort to reach older media.
- Status polling pauses when the tab is hidden.
- Filter dropdowns are debounced (200 ms) so rapid changes only fire one request.
- Bulk DELETE runs 5 requests in parallel.
Running as a Service (systemd)
# Copy the service file
sudo cp reddit-collector-web.service /etc/systemd/system/
# Reload systemd and enable service
sudo systemctl daemon-reload
sudo systemctl enable reddit-collector-web
sudo systemctl start reddit-collector-web
# Check status
sudo systemctl status reddit-collector-web
# View logs
sudo journalctl -u reddit-collector-web -f
Docker
Using Docker Compose
# Build and run
docker-compose up -d
# View logs
docker-compose logs -f
# Stop
docker-compose down
Docker Configuration
The docker-compose.yml mounts the following volumes:
./config.yaml→/app/config.yaml(must be writable — the UI mutates it)./downloads→/app/downloads(media files)./data→/app/data(SQLite DB, scheduler DB/config — survives upgrades)
Relevant environment variables (all optional, with sensible defaults inside the image):
| Variable | Default | Purpose |
|---|---|---|
RMC_DOWNLOAD_DIR |
/app/downloads |
Where media files are written |
RMC_DB_PATH |
/app/data/media.db |
Main SQLite database |
RMC_SCHEDULER_DB |
/app/scheduler.db |
APScheduler jobstore — set to /app/data/scheduler.db in containers so history survives restarts |
RMC_SCHEDULER_CONFIG |
/app/scheduler_config.yaml |
Scheduler interval/cron config — same advice as above |
RMC_CONFIG_PATH |
/app/config.yaml |
YAML with subreddits/users/blacklist |
RMC_TIMEZONE |
UTC |
Timezone used by the scheduler |
RMC_AUTH_USER / RMC_AUTH_PASS |
unset | Enable HTTP Basic Auth on every route when both are set |
RMC_PIN |
unset | 4-6 digit PIN gating the web UI (HMAC cookie); leave unset to disable |
RMC_PIN_TIMEOUT |
600 |
Idle seconds before the PIN cookie expires |
Synology DSM Deployment
Tested on DSM 7.2 with Container Manager on x86_64 Plus models. The published image lives at
ghcr.io/richardnixondev/reddit-media-collector:latest.
-
Prepare the host directories (one-time, via SSH or File Station):
sudo mkdir -p /volume1/docker/reddit-media-collector/{downloads,data} sudo cp /path/to/your/config.yaml /volume1/docker/reddit-media-collector/config.yaml -
Create the project in Container Manager:
- Open Container Manager → Project → Create.
- Project name:
reddit-media-collector. - Path:
/volume1/docker/reddit-media-collector. - Source: Create docker-compose.yml, paste the contents of
docker-compose.synology.yml(adjustRMC_TIMEZONEto your zone). - Next → Build — DSM does
docker compose pull && up -d.
-
Access the UI:
http://<nas-ip>:8000. Downloaded media appears in/volume1/docker/reddit-media-collector/downloads/, which you can point Synology Photos or an Immich instance at. -
HTTPS via DSM Reverse Proxy (optional, recommended if exposed): Control Panel → Login Portal → Advanced → Reverse Proxy → Create. Source:
reddit.yourdomain.com(HTTPS:443). Destination:localhost:8000(HTTP). Attach a Let's Encrypt cert. When exposing publicly, uncommentRMC_AUTH_USER/RMC_AUTH_PASSin the compose file — and consider addingRMC_PIN=NNNNfor a second factor on the UI. -
Updating: Container Manager → Project → reddit-media-collector → Action → Build re-pulls
latest. Thedata/volume keeps the database, scheduler history, tags and config. -
First boot (one-time, if you already had a library): to retroactively tag every existing post (subreddit / performer / genre / nsfw):
curl -X POST http://<nas-ip>:8000/api/tags/backfillIdempotent — safe to re-run.
Permissions note: if the container can't write to the bind mounts, check the owner
of /volume1/docker/reddit-media-collector/ with ls -ln. Either chown it to a user
the container can write as, or set user: "UID:GID" in the compose (commented out at the
bottom of docker-compose.synology.yml).
File Naming Convention
Downloaded files follow a descriptive naming pattern:
{subreddit}_{author}_{YYYYMMDD}_{HHmmss}_{post_id}[_{gallery_index}].{ext}
Examples:
earthporn_photographer123_20260118_143052_abc123.jpg
pics_user456_20260118_091523_xyz789_1.jpg (gallery image 1)
pics_user456_20260118_091523_xyz789_2.jpg (gallery image 2)
Immich Integration
The collector generates JSON sidecar files compatible with Immich:
{
"dateTimeOriginal": "2026-01-18T14:30:52+00:00",
"description": "Post title from Reddit",
"albums": ["r/earthporn"],
"tags": ["reddit", "earthporn", "image"],
"rating": 4,
"people": ["photographer123"],
"externalUrl": "https://reddit.com/r/earthporn/comments/abc123/title"
}
Rating System
Ratings are automatically assigned based on post score:
| Score | Rating |
|---|---|
| 0-9 | 1 star |
| 10-49 | 2 stars |
| 50-199 | 3 stars |
| 200-999 | 4 stars |
| 1000+ | 5 stars |
Importing to Immich
Point Immich to your downloads folder as an external library, or use the Immich CLI:
immich upload --album "Reddit Collection" ./downloads/
Scheduled Collection
Using Cron
# Edit crontab
crontab -e
# Add entry (runs every 6 hours)
0 */6 * * * /path/to/reddit-media-collector/run_collector.sh
Example run_collector.sh
#!/bin/bash
cd /path/to/reddit-media-collector
source venv/bin/activate
export PATH="$HOME/.local/bin:$PATH" # For yt-dlp
timeout 4h python -m src.main >> cron.log 2>&1
echo "$(date): Collector finished with exit code $?" >> cron.log
API Reference
The web interface exposes a REST API. FastAPI also serves interactive docs at /docs and OpenAPI at /openapi.json.
Configuration
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/config |
Full config snapshot |
| GET | /api/subreddits |
List configured subreddits |
| POST | /api/subreddits |
Add a subreddit |
| DELETE | /api/subreddits/{name} |
Remove a subreddit |
| GET | /api/users |
List configured users |
| POST | /api/users |
Add a user |
| DELETE | /api/users/{name} |
Remove a user |
| GET | /api/settings |
Download + rate-limit settings |
| PUT | /api/settings |
Update settings |
Media
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/media |
List downloaded media (paginated, filterable) |
| GET | /api/media/{id}/info |
Get media details |
| DELETE | /api/media/{id}?blacklist_author=&blacklist_subreddit= |
Delete media file (and optionally blacklist) |
| GET | /api/media/subreddits[?limit&offset] |
Subreddits with downloaded content |
| GET | /api/media/authors[?limit&offset] |
Authors with downloaded content |
| GET | /api/media/file/{filename:path} |
Serve raw file (Range-aware for video) |
| GET | /api/media/thumb/{filename:path} |
Serve / generate video thumbnail |
| GET | /api/media/blacklist-preview |
Files affected by current blacklist |
| POST | /api/media/cleanup-blacklist |
Delete blacklisted media |
| GET | /api/media/cleanup-preview?media_type= |
Files affected by type cleanup |
| POST | /api/media/cleanup-by-type?media_type= |
Delete by media type |
Blacklist
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/blacklist |
Full blacklist |
| POST | /api/blacklist/{authors,subreddits,keywords,domains} |
Add entry |
| DELETE | /api/blacklist/{kind}/{name} |
Remove entry |
Favorites & Authors
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/favorites |
Favorited posts (paginated) |
| POST | /api/favorites/{post_id} |
Add to favorites |
| DELETE | /api/favorites/{post_id} |
Remove from favorites |
| GET | /api/favorites/authors[?limit&offset] |
Distinct favorited authors |
| POST | /api/favorites/sync-users |
Add favorite authors as user targets |
| GET | /api/authors |
Authors with stats (paginated, sortable, favorites filter) |
| GET | /api/authors/{author}/media |
Media for one author |
Stats
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/stats |
Disk + counts (30 s server-side cache) |
| GET | /api/stats/enhanced |
Trends, top authors, scores |
| GET | /api/stats/recent |
Recent downloads |
Collector & Scheduler
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/collector/run |
Trigger collection run |
| GET | /api/collector/status |
Collector status (includes last_error) |
| POST | /api/collector/clear-error |
Dismiss the last_error banner |
| POST | /api/collect/individual |
Collect from a single subreddit/user |
| GET | /api/collect/targets |
List available targets |
| GET | /api/scheduler/status |
Scheduler state + next run |
| PUT | /api/scheduler/config |
Update schedule |
| GET | /api/scheduler/history |
Past scheduler runs |
| POST | /api/scheduler/run-now |
Execute schedule immediately |
Tags
Stash-style taxonomy. Auto-tags are recreated on every collect (idempotent); user-added tags survive reruns.
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/tags?category= |
List all tags (filter by category: performer/source/genre/meta) |
| GET | /api/posts/{id}/tags |
Tags attached to a single post |
| POST | /api/posts/{id}/tags |
Attach a user-curated tag |
| DELETE | /api/posts/{id}/tags/{tag_id} |
Detach a tag (auto or user) |
| POST | /api/tags/backfill |
One-time pass to retroactively auto-tag the whole library |
GET /api/media already returns each post's tags: [{name, category}] array (batched join — O(1) extra queries per page).
The gallery filters now also accept nsfw=all|hide|only.
Health & Auth
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Public JSON: db, ffmpeg, scheduler, downloads_writable, version, auth_enabled |
| GET | /unlock |
PIN entry screen (only when RMC_PIN is set) |
| POST | /unlock |
Submit PIN; sets the signed session cookie |
| POST | /lock |
Invalidate the session cookie immediately |
Database Schema
The SQLite database (media.db) stores all metadata:
CREATE TABLE posts (
id TEXT PRIMARY KEY,
subreddit TEXT NOT NULL,
author TEXT,
title TEXT,
url TEXT NOT NULL,
media_url TEXT,
media_type TEXT,
score INTEGER DEFAULT 0,
created_utc REAL,
downloaded_at TIMESTAMP,
local_path TEXT,
file_hash TEXT,
permalink TEXT,
source_type TEXT, -- 'subreddit' or 'user'
flair TEXT,
nsfw INTEGER DEFAULT 0 -- Reddit's over_18 mirrored locally
);
CREATE TABLE favorites (
post_id TEXT PRIMARY KEY,
favorited_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (post_id) REFERENCES posts(id)
);
-- Tag taxonomy (Stash-inspired)
CREATE TABLE tags (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
category TEXT, -- 'performer' | 'source' | 'genre' | 'meta'
is_nsfw INTEGER DEFAULT 0,
description TEXT,
UNIQUE(name, category)
);
CREATE TABLE post_tags (
post_id TEXT NOT NULL,
tag_id INTEGER NOT NULL,
source TEXT DEFAULT 'auto', -- 'auto' | 'user'
PRIMARY KEY (post_id, tag_id),
FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE,
FOREIGN KEY (tag_id) REFERENCES tags(id) ON DELETE CASCADE
);
-- Scheduler history (in-app cron alternative)
CREATE TABLE scheduler_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at TIMESTAMP,
finished_at TIMESTAMP,
status TEXT, -- 'success' | 'error' | 'timeout' | 'running'
posts_processed INTEGER DEFAULT 0,
posts_downloaded INTEGER DEFAULT 0,
error_message TEXT
);
The database uses SQLite WAL (PRAGMA journal_mode=WAL) for crash safety on power loss and concurrent reads while a collect is running.
Troubleshooting
Videos saving as .html
Cause: yt-dlp not installed or not in PATH
Solution:
pip install yt-dlp
export PATH="$HOME/.local/bin:$PATH"
Rate limited (429 errors)
Cause: Too many requests to Reddit
Solution: Increase download_delay_seconds or decrease requests_per_minute in config
Incomplete galleries
Cause: Gallery metadata not available from Reddit API
Solution: Verify the post still exists on Reddit
Web interface not starting
Cause: Port already in use or missing dependencies
Solution:
# Check if port is in use
lsof -i :8000
# Reinstall dependencies (editable)
pip install -e ".[dev]"
Project Structure
reddit-media-collector/
├── src/
│ ├── main.py # Collector entry point
│ ├── config.py # Configuration dataclasses + rotating logger
│ ├── database.py # SQLite (WAL) + tags taxonomy + auto-tagger
│ ├── downloader.py # Downloader with retry + dedupe
│ ├── reddit_client.py # Reddit JSON-API client
│ ├── sidecar.py # Immich-compatible JSON sidecars
│ ├── extractors/ # URL extractors per host (reddit, imgur, gfycat)
│ └── web/ # FastAPI app
│ ├── app.py # App, lifespan, PIN-lock middleware
│ ├── auth.py # Optional HTTP Basic auth
│ ├── session.py # HMAC-signed PIN session cookie
│ ├── config_manager.py
│ ├── deps.py
│ ├── rate_limit.py # Per-IP throttle dependency
│ ├── routers/
│ │ ├── config.py # Config CRUD (HTMX-aware fragments)
│ │ ├── favorites.py
│ │ ├── health.py # Public /health JSON
│ │ ├── media.py # Gallery, file serving, cleanups
│ │ ├── scheduler.py # In-app scheduler + collector control
│ │ ├── stats.py
│ │ └── tags.py # Tag taxonomy CRUD + backfill
│ ├── static/
│ │ ├── css/app.css # Extracted from inline (themed via CSS vars)
│ │ └── js/
│ │ ├── api.js # Shared fetch helpers
│ │ └── app.js # All app logic
│ └── templates/
│ ├── index.html # Composition: extends layout + includes
│ ├── unlock.html # PIN entry screen
│ └── partials/ # tab_*.html, _modal_*.html, _item_*.html, _tag.html
├── tests/ # pytest (unit + API contract; 151 tests)
├── downloads/ # Downloaded media (gitignored)
├── config.yaml # Configuration
├── pyproject.toml # Project metadata + tooling config
├── .pre-commit-config.yaml # ruff + mypy hooks
├── .github/workflows/
│ ├── ci.yml # lint, types, test, docker (per PR/push)
│ └── release.yml # GHCR publish on every v* tag
├── Dockerfile
├── docker-compose.yml # Local dev
├── docker-compose.synology.yml # NAS-flavored (image from GHCR)
└── README.md
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Disclaimer
This tool is for personal use only. Please respect Reddit's Terms of Service and the content creators' rights. Do not use this tool to redistribute copyrighted content.