reddit-media-collector/README.md
Richard Nixon a6ee86c4bb docs: refresh README covering v1.1.0 → v1.3.1
A documentacao estava colada na v1.0; cinco releases depois precisa
de uma atualizacao geral.

Features
- Reorganiza em Collection / Web dashboard / Operations.
- Inclui Tag taxonomy, NSFW gate, Discreet mode, PIN lock, keyboard
  shortcuts, last_error banner, /health endpoint, modular HTMX+Alpine
  frontend.

Web Features
- Cobre os novos toggles do header (NSFW, Discreto, Lock), filtros
  NSFW na galeria, atalhos de teclado no modal, tag chips.

Security & Limits
- Nova subsection PIN lock com semantica do cookie HMAC, idle timeout,
  bypass de /health e /static.
- Tabela de rate limits ganha linha para mutacoes de config + tags
  (60/min por IP+path).

Env vars (Docker)
- RMC_PIN, RMC_PIN_TIMEOUT documentados.

Synology DSM
- Passo 4 menciona PIN como segundo fator se exposto publicamente.
- Novo passo 6: curl POST /api/tags/backfill para taggear retroativo
  uma biblioteca pre-existente.

API Reference
- Nova secao Tags (5 endpoints) com nota de que /api/media ja inclui
  tags por post.
- Nova secao Health & Auth (/health, /unlock, /lock).
- Collector ganha /api/collector/clear-error.
- Gallery filters ganham nsfw=all|hide|only.

Database Schema
- Adiciona coluna posts.nsfw, tabelas tags + post_tags + scheduler_history
  com comentarios de categoria/source.
- Nota sobre PRAGMA journal_mode=WAL.

Project Structure
- Atualiza arvore: routers/health.py, routers/tags.py, session.py,
  static/css/app.css, static/js/app.js, templates/partials/,
  unlock.html, docker-compose.synology.yml, .github/workflows/release.yml.
- Comentarios curtos por arquivo.

Apenas docs — sem mudanca de codigo, sem bump.
2026-05-17 21:54:32 +01:00

24 KiB
Raw Permalink Blame History

Reddit Media Collector

Python 3.11+ License: MIT Docker CI

A powerful, self-hosted media collector for Reddit that automatically downloads images, videos, and GIFs from your favourite subreddits and users. Features a built-in web interface for management and seamless integration with Immich for photo organisation.

Features

Collection

  • Multi-source — subreddits and user profiles
  • Smart deduplication — MD5 hash-based; never downloads the same file twice
  • Gallery support — handles Reddit galleries with multiple images
  • Multiple extractors — Reddit, Imgur, Gfycat, Redgifs
  • No API keys — uses Reddit's public JSON endpoints

Web dashboard

  • Modular HTMX + Alpine.js frontend (vanilla, zero build step)
  • Gallery with infinite scroll, multi-select, bulk delete, sorting/filtering
  • Tag taxonomy (Stash-inspired) — auto-tags every post by subreddit / performer / genre / nsfw; manual tags preserved across reruns; colored chips on every card
  • NSFW gate — blur thumbnails by default; toggle 👁/🙈 in the header (persisted in localStorage)
  • Discreet mode 🤫 — compact thumbnails for screen-shoulder privacy; auto-activates after 60 s of idle
  • PIN lock (optional) — HMAC-signed session cookie on top of Basic Auth; idle timeout configurable
  • Favourites + per-author view + sync favourites to user targets
  • Blacklist — authors, subreddits, title keywords, domains
  • Scheduler — interval or specific times, run history, "run now" button
  • last_error banner — failed scheduled runs surface at the top until dismissed
  • Keyboard shortcuts in the modal: j/k navigate, f favourite, b blacklist author, Esc close
  • /health endpoint — public JSON with DB/ffmpeg/scheduler/writable status, ready for Container Manager monitors

Operations

  • Docker — single-image deployment, published to ghcr.io/richardnixondev/reddit-media-collector
  • Synology DSM Container Manager ready (one-click deploy via compose)
  • SQLite WAL — crash-safe on power loss, fast concurrent reads
  • Rotating log file (10 MB × 5 backups) — caps disk usage on NAS
  • HTTP Basic Auth (optional) + per-IP rate limiting on every mutation
  • Immich integration — JSON sidecar with metadata for seamless import

Quick Start

Prerequisites

  • Python 3.11 or higher
  • FFmpeg (optional, for video thumbnails)
  • yt-dlp (optional, for Gfycat/Redgifs support)

Installation

  1. Clone the repository

    git clone https://github.com/richardnixondev/reddit-media-collector.git
    cd reddit-media-collector
    
  2. Create virtual environment

    python3 -m venv venv
    source venv/bin/activate  # Linux/macOS
    # or
    .\venv\Scripts\activate   # Windows
    
  3. Install the package (editable, with dev tooling)

    pip install -e ".[dev]"
    
  4. Configure

    cp config.yaml.example config.yaml
    # Edit config.yaml with your preferences
    
  5. Run the collector

    python -m src.main
    

Development

pip install -e ".[dev]"
pre-commit install

# Tests + coverage
pytest -v
pytest --cov=src --cov-report=term-missing

# Lint, format, type-check
ruff check src/ tests/
ruff format src/ tests/
mypy src/

Configuration

Create a config.yaml file based on the example:

targets:
  subreddits:
    - name: "earthporn"
      limit: 100           # Posts per request (max 100)
      sort: "top"          # hot, new, top, rising
      time_filter: "week"  # For "top" sort: hour, day, week, month, year, all

    - name: "pics"
      limit: 50
      sort: "new"

  users:
    - name: "username"
      limit: 100

download:
  output_dir: "./downloads"
  media_types:
    - "image"
    - "video"
    - "gif"
  min_score: 10              # Minimum upvotes required
  skip_nsfw: false           # Skip NSFW content
  max_file_size_mb: 200      # Maximum file size
  flat_structure: true       # All files in a single folder
  generate_sidecar: true     # Generate .json for Immich
  videos_only_from_favorites: false  # Only download videos from favourited users

rate_limit:
  requests_per_minute: 20
  download_delay_seconds: 2

logging:
  level: "INFO"
  file: "collector.log"

blacklist:
  authors: []      # Usernames to ignore
  subreddits: []   # Subreddits to ignore
  title_keywords: []
  domains: []

Web Interface

The collector includes a FastAPI-powered web dashboard for easy management.

Starting the Web Server

# Using the run script
./run_web.sh

# Or directly with uvicorn
uvicorn src.web.app:app --host 0.0.0.0 --port 8000

Access the dashboard at http://localhost:8000

Web Features

  • Dashboard — collection statistics, trends chart, top authors, recent downloads
  • Gallery — infinite scroll, filtering (subreddit / author / type / favourites / NSFW), sorting, bulk select, multi-delete, tag chips per card, NSFW blur with hover/global toggle
  • Authors — grid grouped by author with per-author modal
  • Sources — add/remove subreddits and users (HTMX, no page reload)
  • Settings — download options, blacklist, scheduler config, individual collection
  • Scheduler — interval or specific times, run history, "run now"
  • Collector control — manual trigger or per-target collection
  • Header toggles — NSFW 👁/🙈, Discreet 🤫, 🔒 Lock (when PIN enabled)
  • Keyboard shortcuts — in the media modal: j/k next/prev, f favourite, b blacklist author, Esc close

Security & Limits

HTTP Basic Auth (optional)

Set both RMC_AUTH_USER and RMC_AUTH_PASS to require Basic credentials on every route except /health. With either variable unset, the API stays public — appropriate for trusted local/intranet deployments.

RMC_AUTH_USER=alice RMC_AUTH_PASS=s3cret uvicorn src.web.app:app

PIN lock (optional, UI-only)

Set RMC_PIN (4-6 digits) to gate the web UI behind a numeric PIN screen on top of Basic Auth. Useful when others might use the device but you don't want to leak the API password.

  • Session cookie is signed with HMAC-SHA256 using a key generated at boot — restart invalidates every session.
  • Idle window: 10 min default; override with RMC_PIN_TIMEOUT (seconds).
  • /health, /static/*, /favicon bypass the lock so monitors and CSS keep working.
  • Clicking "🔒 Lock" in the header locks immediately.
RMC_PIN=1234 RMC_PIN_TIMEOUT=900 uvicorn src.web.app:app

Rate limiting (per IP, per endpoint)

Heavy and mutation endpoints are throttled in-process to prevent runaway clients:

Endpoint Limit
GET /api/stats, /api/stats/enhanced 30 / minute
POST /api/collector/run 3 / minute
POST /api/media/cleanup-blacklist 5 / minute
POST /api/media/cleanup-by-type 5 / minute
POST/PUT/DELETE /api/{subreddits,users,blacklist/*,settings/*,posts/*/tags} 60 / minute

Limits are per (client IP, route path) and reset on a sliding window. For multi-worker deployments, swap the in-memory buckets for a shared backend.

  • The gallery keeps at most 500 items in memory/DOM at once; older items are evicted as you scroll. Use filters or sort to reach older media.
  • Status polling pauses when the tab is hidden.
  • Filter dropdowns are debounced (200 ms) so rapid changes only fire one request.
  • Bulk DELETE runs 5 requests in parallel.

Running as a Service (systemd)

# Copy the service file
sudo cp reddit-collector-web.service /etc/systemd/system/

# Reload systemd and enable service
sudo systemctl daemon-reload
sudo systemctl enable reddit-collector-web
sudo systemctl start reddit-collector-web

# Check status
sudo systemctl status reddit-collector-web

# View logs
sudo journalctl -u reddit-collector-web -f

Docker

Using Docker Compose

# Build and run
docker-compose up -d

# View logs
docker-compose logs -f

# Stop
docker-compose down

Docker Configuration

The docker-compose.yml mounts the following volumes:

  • ./config.yaml/app/config.yaml (must be writable — the UI mutates it)
  • ./downloads/app/downloads (media files)
  • ./data/app/data (SQLite DB, scheduler DB/config — survives upgrades)

Relevant environment variables (all optional, with sensible defaults inside the image):

Variable Default Purpose
RMC_DOWNLOAD_DIR /app/downloads Where media files are written
RMC_DB_PATH /app/data/media.db Main SQLite database
RMC_SCHEDULER_DB /app/scheduler.db APScheduler jobstore — set to /app/data/scheduler.db in containers so history survives restarts
RMC_SCHEDULER_CONFIG /app/scheduler_config.yaml Scheduler interval/cron config — same advice as above
RMC_CONFIG_PATH /app/config.yaml YAML with subreddits/users/blacklist
RMC_TIMEZONE UTC Timezone used by the scheduler
RMC_AUTH_USER / RMC_AUTH_PASS unset Enable HTTP Basic Auth on every route when both are set
RMC_PIN unset 4-6 digit PIN gating the web UI (HMAC cookie); leave unset to disable
RMC_PIN_TIMEOUT 600 Idle seconds before the PIN cookie expires

Synology DSM Deployment

Tested on DSM 7.2 with Container Manager on x86_64 Plus models. The published image lives at ghcr.io/richardnixondev/reddit-media-collector:latest.

  1. Prepare the host directories (one-time, via SSH or File Station):

    sudo mkdir -p /volume1/docker/reddit-media-collector/{downloads,data}
    sudo cp /path/to/your/config.yaml /volume1/docker/reddit-media-collector/config.yaml
    
  2. Create the project in Container Manager:

    • Open Container Manager → Project → Create.
    • Project name: reddit-media-collector.
    • Path: /volume1/docker/reddit-media-collector.
    • Source: Create docker-compose.yml, paste the contents of docker-compose.synology.yml (adjust RMC_TIMEZONE to your zone).
    • Next → Build — DSM does docker compose pull && up -d.
  3. Access the UI: http://<nas-ip>:8000. Downloaded media appears in /volume1/docker/reddit-media-collector/downloads/, which you can point Synology Photos or an Immich instance at.

  4. HTTPS via DSM Reverse Proxy (optional, recommended if exposed): Control Panel → Login Portal → Advanced → Reverse Proxy → Create. Source: reddit.yourdomain.com (HTTPS:443). Destination: localhost:8000 (HTTP). Attach a Let's Encrypt cert. When exposing publicly, uncomment RMC_AUTH_USER/RMC_AUTH_PASS in the compose file — and consider adding RMC_PIN=NNNN for a second factor on the UI.

  5. Updating: Container Manager → Project → reddit-media-collector → Action → Build re-pulls latest. The data/ volume keeps the database, scheduler history, tags and config.

  6. First boot (one-time, if you already had a library): to retroactively tag every existing post (subreddit / performer / genre / nsfw):

    curl -X POST http://<nas-ip>:8000/api/tags/backfill
    

    Idempotent — safe to re-run.

Permissions note: if the container can't write to the bind mounts, check the owner of /volume1/docker/reddit-media-collector/ with ls -ln. Either chown it to a user the container can write as, or set user: "UID:GID" in the compose (commented out at the bottom of docker-compose.synology.yml).

File Naming Convention

Downloaded files follow a descriptive naming pattern:

{subreddit}_{author}_{YYYYMMDD}_{HHmmss}_{post_id}[_{gallery_index}].{ext}

Examples:

earthporn_photographer123_20260118_143052_abc123.jpg
pics_user456_20260118_091523_xyz789_1.jpg  (gallery image 1)
pics_user456_20260118_091523_xyz789_2.jpg  (gallery image 2)

Immich Integration

The collector generates JSON sidecar files compatible with Immich:

{
  "dateTimeOriginal": "2026-01-18T14:30:52+00:00",
  "description": "Post title from Reddit",
  "albums": ["r/earthporn"],
  "tags": ["reddit", "earthporn", "image"],
  "rating": 4,
  "people": ["photographer123"],
  "externalUrl": "https://reddit.com/r/earthporn/comments/abc123/title"
}

Rating System

Ratings are automatically assigned based on post score:

Score Rating
0-9 1 star
10-49 2 stars
50-199 3 stars
200-999 4 stars
1000+ 5 stars

Importing to Immich

Point Immich to your downloads folder as an external library, or use the Immich CLI:

immich upload --album "Reddit Collection" ./downloads/

Scheduled Collection

Using Cron

# Edit crontab
crontab -e

# Add entry (runs every 6 hours)
0 */6 * * * /path/to/reddit-media-collector/run_collector.sh

Example run_collector.sh

#!/bin/bash
cd /path/to/reddit-media-collector
source venv/bin/activate
export PATH="$HOME/.local/bin:$PATH"  # For yt-dlp
timeout 4h python -m src.main >> cron.log 2>&1
echo "$(date): Collector finished with exit code $?" >> cron.log

API Reference

The web interface exposes a REST API. FastAPI also serves interactive docs at /docs and OpenAPI at /openapi.json.

Configuration

Method Endpoint Description
GET /api/config Full config snapshot
GET /api/subreddits List configured subreddits
POST /api/subreddits Add a subreddit
DELETE /api/subreddits/{name} Remove a subreddit
GET /api/users List configured users
POST /api/users Add a user
DELETE /api/users/{name} Remove a user
GET /api/settings Download + rate-limit settings
PUT /api/settings Update settings

Media

Method Endpoint Description
GET /api/media List downloaded media (paginated, filterable)
GET /api/media/{id}/info Get media details
DELETE /api/media/{id}?blacklist_author=&blacklist_subreddit= Delete media file (and optionally blacklist)
GET /api/media/subreddits[?limit&offset] Subreddits with downloaded content
GET /api/media/authors[?limit&offset] Authors with downloaded content
GET /api/media/file/{filename:path} Serve raw file (Range-aware for video)
GET /api/media/thumb/{filename:path} Serve / generate video thumbnail
GET /api/media/blacklist-preview Files affected by current blacklist
POST /api/media/cleanup-blacklist Delete blacklisted media
GET /api/media/cleanup-preview?media_type= Files affected by type cleanup
POST /api/media/cleanup-by-type?media_type= Delete by media type

Blacklist

Method Endpoint Description
GET /api/blacklist Full blacklist
POST /api/blacklist/{authors,subreddits,keywords,domains} Add entry
DELETE /api/blacklist/{kind}/{name} Remove entry

Favorites & Authors

Method Endpoint Description
GET /api/favorites Favorited posts (paginated)
POST /api/favorites/{post_id} Add to favorites
DELETE /api/favorites/{post_id} Remove from favorites
GET /api/favorites/authors[?limit&offset] Distinct favorited authors
POST /api/favorites/sync-users Add favorite authors as user targets
GET /api/authors Authors with stats (paginated, sortable, favorites filter)
GET /api/authors/{author}/media Media for one author

Stats

Method Endpoint Description
GET /api/stats Disk + counts (30 s server-side cache)
GET /api/stats/enhanced Trends, top authors, scores
GET /api/stats/recent Recent downloads

Collector & Scheduler

Method Endpoint Description
POST /api/collector/run Trigger collection run
GET /api/collector/status Collector status (includes last_error)
POST /api/collector/clear-error Dismiss the last_error banner
POST /api/collect/individual Collect from a single subreddit/user
GET /api/collect/targets List available targets
GET /api/scheduler/status Scheduler state + next run
PUT /api/scheduler/config Update schedule
GET /api/scheduler/history Past scheduler runs
POST /api/scheduler/run-now Execute schedule immediately

Tags

Stash-style taxonomy. Auto-tags are recreated on every collect (idempotent); user-added tags survive reruns.

Method Endpoint Description
GET /api/tags?category= List all tags (filter by category: performer/source/genre/meta)
GET /api/posts/{id}/tags Tags attached to a single post
POST /api/posts/{id}/tags Attach a user-curated tag
DELETE /api/posts/{id}/tags/{tag_id} Detach a tag (auto or user)
POST /api/tags/backfill One-time pass to retroactively auto-tag the whole library

GET /api/media already returns each post's tags: [{name, category}] array (batched join — O(1) extra queries per page).

The gallery filters now also accept nsfw=all|hide|only.

Health & Auth

Method Endpoint Description
GET /health Public JSON: db, ffmpeg, scheduler, downloads_writable, version, auth_enabled
GET /unlock PIN entry screen (only when RMC_PIN is set)
POST /unlock Submit PIN; sets the signed session cookie
POST /lock Invalidate the session cookie immediately

Database Schema

The SQLite database (media.db) stores all metadata:

CREATE TABLE posts (
    id TEXT PRIMARY KEY,
    subreddit TEXT NOT NULL,
    author TEXT,
    title TEXT,
    url TEXT NOT NULL,
    media_url TEXT,
    media_type TEXT,
    score INTEGER DEFAULT 0,
    created_utc REAL,
    downloaded_at TIMESTAMP,
    local_path TEXT,
    file_hash TEXT,
    permalink TEXT,
    source_type TEXT,         -- 'subreddit' or 'user'
    flair TEXT,
    nsfw INTEGER DEFAULT 0    -- Reddit's over_18 mirrored locally
);

CREATE TABLE favorites (
    post_id TEXT PRIMARY KEY,
    favorited_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (post_id) REFERENCES posts(id)
);

-- Tag taxonomy (Stash-inspired)
CREATE TABLE tags (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    category TEXT,            -- 'performer' | 'source' | 'genre' | 'meta'
    is_nsfw INTEGER DEFAULT 0,
    description TEXT,
    UNIQUE(name, category)
);

CREATE TABLE post_tags (
    post_id TEXT NOT NULL,
    tag_id INTEGER NOT NULL,
    source TEXT DEFAULT 'auto',  -- 'auto' | 'user'
    PRIMARY KEY (post_id, tag_id),
    FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE,
    FOREIGN KEY (tag_id) REFERENCES tags(id) ON DELETE CASCADE
);

-- Scheduler history (in-app cron alternative)
CREATE TABLE scheduler_history (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    started_at TIMESTAMP,
    finished_at TIMESTAMP,
    status TEXT,              -- 'success' | 'error' | 'timeout' | 'running'
    posts_processed INTEGER DEFAULT 0,
    posts_downloaded INTEGER DEFAULT 0,
    error_message TEXT
);

The database uses SQLite WAL (PRAGMA journal_mode=WAL) for crash safety on power loss and concurrent reads while a collect is running.

Troubleshooting

Videos saving as .html

Cause: yt-dlp not installed or not in PATH

Solution:

pip install yt-dlp
export PATH="$HOME/.local/bin:$PATH"

Rate limited (429 errors)

Cause: Too many requests to Reddit

Solution: Increase download_delay_seconds or decrease requests_per_minute in config

Incomplete galleries

Cause: Gallery metadata not available from Reddit API

Solution: Verify the post still exists on Reddit

Web interface not starting

Cause: Port already in use or missing dependencies

Solution:

# Check if port is in use
lsof -i :8000

# Reinstall dependencies (editable)
pip install -e ".[dev]"

Project Structure

reddit-media-collector/
├── src/
│   ├── main.py                # Collector entry point
│   ├── config.py              # Configuration dataclasses + rotating logger
│   ├── database.py            # SQLite (WAL) + tags taxonomy + auto-tagger
│   ├── downloader.py          # Downloader with retry + dedupe
│   ├── reddit_client.py       # Reddit JSON-API client
│   ├── sidecar.py             # Immich-compatible JSON sidecars
│   ├── extractors/            # URL extractors per host (reddit, imgur, gfycat)
│   └── web/                   # FastAPI app
│       ├── app.py             # App, lifespan, PIN-lock middleware
│       ├── auth.py            # Optional HTTP Basic auth
│       ├── session.py         # HMAC-signed PIN session cookie
│       ├── config_manager.py
│       ├── deps.py
│       ├── rate_limit.py      # Per-IP throttle dependency
│       ├── routers/
│       │   ├── config.py      # Config CRUD (HTMX-aware fragments)
│       │   ├── favorites.py
│       │   ├── health.py      # Public /health JSON
│       │   ├── media.py       # Gallery, file serving, cleanups
│       │   ├── scheduler.py   # In-app scheduler + collector control
│       │   ├── stats.py
│       │   └── tags.py        # Tag taxonomy CRUD + backfill
│       ├── static/
│       │   ├── css/app.css    # Extracted from inline (themed via CSS vars)
│       │   └── js/
│       │       ├── api.js     # Shared fetch helpers
│       │       └── app.js     # All app logic
│       └── templates/
│           ├── index.html     # Composition: extends layout + includes
│           ├── unlock.html    # PIN entry screen
│           └── partials/      # tab_*.html, _modal_*.html, _item_*.html, _tag.html
├── tests/                     # pytest (unit + API contract; 151 tests)
├── downloads/                 # Downloaded media (gitignored)
├── config.yaml                # Configuration
├── pyproject.toml             # Project metadata + tooling config
├── .pre-commit-config.yaml    # ruff + mypy hooks
├── .github/workflows/
│   ├── ci.yml                 # lint, types, test, docker (per PR/push)
│   └── release.yml            # GHCR publish on every v* tag
├── Dockerfile
├── docker-compose.yml         # Local dev
├── docker-compose.synology.yml # NAS-flavored (image from GHCR)
└── README.md

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is for personal use only. Please respect Reddit's Terms of Service and the content creators' rights. Do not use this tool to redistribute copyrighted content.

Acknowledgments

  • Reddit for their public JSON API
  • Immich for the excellent self-hosted photo management
  • yt-dlp for video extraction support