Richard Nixon a6ee86c4bb docs: refresh README covering v1.1.0 → v1.3.1

A documentacao estava colada na v1.0; cinco releases depois precisa
de uma atualizacao geral.

Features
- Reorganiza em Collection / Web dashboard / Operations.
- Inclui Tag taxonomy, NSFW gate, Discreet mode, PIN lock, keyboard
  shortcuts, last_error banner, /health endpoint, modular HTMX+Alpine
  frontend.

Web Features
- Cobre os novos toggles do header (NSFW, Discreto, Lock), filtros
  NSFW na galeria, atalhos de teclado no modal, tag chips.

Security & Limits
- Nova subsection PIN lock com semantica do cookie HMAC, idle timeout,
  bypass de /health e /static.
- Tabela de rate limits ganha linha para mutacoes de config + tags
  (60/min por IP+path).

Env vars (Docker)
- RMC_PIN, RMC_PIN_TIMEOUT documentados.

Synology DSM
- Passo 4 menciona PIN como segundo fator se exposto publicamente.
- Novo passo 6: curl POST /api/tags/backfill para taggear retroativo
  uma biblioteca pre-existente.

API Reference
- Nova secao Tags (5 endpoints) com nota de que /api/media ja inclui
  tags por post.
- Nova secao Health & Auth (/health, /unlock, /lock).
- Collector ganha /api/collector/clear-error.
- Gallery filters ganham nsfw=all|hide|only.

Database Schema
- Adiciona coluna posts.nsfw, tabelas tags + post_tags + scheduler_history
  com comentarios de categoria/source.
- Nota sobre PRAGMA journal_mode=WAL.

Project Structure
- Atualiza arvore: routers/health.py, routers/tags.py, session.py,
  static/css/app.css, static/js/app.js, templates/partials/,
  unlock.html, docker-compose.synology.yml, .github/workflows/release.yml.
- Comentarios curtos por arquivo.

Apenas docs — sem mudanca de codigo, sem bump.

2026-05-17 21:54:32 +01:00

24 KiB

Raw Permalink Blame History

Reddit Media Collector

A powerful, self-hosted media collector for Reddit that automatically downloads images, videos, and GIFs from your favourite subreddits and users. Features a built-in web interface for management and seamless integration with Immich for photo organisation.

Features

Collection

Multi-source — subreddits and user profiles
Smart deduplication — MD5 hash-based; never downloads the same file twice
Gallery support — handles Reddit galleries with multiple images
Multiple extractors — Reddit, Imgur, Gfycat, Redgifs
No API keys — uses Reddit's public JSON endpoints

Web dashboard

Modular HTMX + Alpine.js frontend (vanilla, zero build step)
Gallery with infinite scroll, multi-select, bulk delete, sorting/filtering
Tag taxonomy (Stash-inspired) — auto-tags every post by subreddit / performer / genre / nsfw; manual tags preserved across reruns; colored chips on every card
NSFW gate — blur thumbnails by default; toggle 👁/🙈 in the header (persisted in localStorage)
Discreet mode 🤫 — compact thumbnails for screen-shoulder privacy; auto-activates after 60 s of idle
PIN lock (optional) — HMAC-signed session cookie on top of Basic Auth; idle timeout configurable
Favourites + per-author view + sync favourites to user targets
Blacklist — authors, subreddits, title keywords, domains
Scheduler — interval or specific times, run history, "run now" button
last_error banner — failed scheduled runs surface at the top until dismissed
Keyboard shortcuts in the modal: j/k navigate, f favourite, b blacklist author, Esc close
/health endpoint — public JSON with DB/ffmpeg/scheduler/writable status, ready for Container Manager monitors

Operations

Docker — single-image deployment, published to ghcr.io/richardnixondev/reddit-media-collector
Synology DSM Container Manager ready (one-click deploy via compose)
SQLite WAL — crash-safe on power loss, fast concurrent reads
Rotating log file (10 MB × 5 backups) — caps disk usage on NAS
HTTP Basic Auth (optional) + per-IP rate limiting on every mutation
Immich integration — JSON sidecar with metadata for seamless import

Quick Start

Prerequisites

Python 3.11 or higher
FFmpeg (optional, for video thumbnails)
yt-dlp (optional, for Gfycat/Redgifs support)

Installation

Clone the repository

git clone https://github.com/richardnixondev/reddit-media-collector.git
cd reddit-media-collector

Create virtual environment

python3 -m venv venv
source venv/bin/activate  # Linux/macOS
# or
.\venv\Scripts\activate   # Windows

Install the package (editable, with dev tooling)
```
pip install -e ".[dev]"
```

Configure

cp config.yaml.example config.yaml
# Edit config.yaml with your preferences

Run the collector
```
python -m src.main
```

Development

pip install -e ".[dev]"
pre-commit install

# Tests + coverage
pytest -v
pytest --cov=src --cov-report=term-missing

# Lint, format, type-check
ruff check src/ tests/
ruff format src/ tests/
mypy src/

Configuration

Create a config.yaml file based on the example:

targets:
  subreddits:
    - name: "earthporn"
      limit: 100           # Posts per request (max 100)
      sort: "top"          # hot, new, top, rising
      time_filter: "week"  # For "top" sort: hour, day, week, month, year, all

    - name: "pics"
      limit: 50
      sort: "new"

  users:
    - name: "username"
      limit: 100

download:
  output_dir: "./downloads"
  media_types:
    - "image"
    - "video"
    - "gif"
  min_score: 10              # Minimum upvotes required
  skip_nsfw: false           # Skip NSFW content
  max_file_size_mb: 200      # Maximum file size
  flat_structure: true       # All files in a single folder
  generate_sidecar: true     # Generate .json for Immich
  videos_only_from_favorites: false  # Only download videos from favourited users

rate_limit:
  requests_per_minute: 20
  download_delay_seconds: 2

logging:
  level: "INFO"
  file: "collector.log"

blacklist:
  authors: []      # Usernames to ignore
  subreddits: []   # Subreddits to ignore
  title_keywords: []
  domains: []

Web Interface

The collector includes a FastAPI-powered web dashboard for easy management.

Starting the Web Server

# Using the run script
./run_web.sh

# Or directly with uvicorn
uvicorn src.web.app:app --host 0.0.0.0 --port 8000

Access the dashboard at http://localhost:8000

Web Features

Dashboard — collection statistics, trends chart, top authors, recent downloads
Gallery — infinite scroll, filtering (subreddit / author / type / favourites / NSFW), sorting, bulk select, multi-delete, tag chips per card, NSFW blur with hover/global toggle
Authors — grid grouped by author with per-author modal
Sources — add/remove subreddits and users (HTMX, no page reload)
Settings — download options, blacklist, scheduler config, individual collection
Scheduler — interval or specific times, run history, "run now"
Collector control — manual trigger or per-target collection
Header toggles — NSFW 👁/🙈, Discreet 🤫, 🔒 Lock (when PIN enabled)
Keyboard shortcuts — in the media modal: j/k next/prev, f favourite, b blacklist author, Esc close

Security & Limits

HTTP Basic Auth (optional)

Set both RMC_AUTH_USER and RMC_AUTH_PASS to require Basic credentials on every route except /health. With either variable unset, the API stays public — appropriate for trusted local/intranet deployments.

RMC_AUTH_USER=alice RMC_AUTH_PASS=s3cret uvicorn src.web.app:app

PIN lock (optional, UI-only)

Set RMC_PIN (4-6 digits) to gate the web UI behind a numeric PIN screen on top of Basic Auth. Useful when others might use the device but you don't want to leak the API password.

Session cookie is signed with HMAC-SHA256 using a key generated at boot — restart invalidates every session.
Idle window: 10 min default; override with RMC_PIN_TIMEOUT (seconds).
/health, /static/*, /favicon bypass the lock so monitors and CSS keep working.
Clicking "🔒 Lock" in the header locks immediately.

RMC_PIN=1234 RMC_PIN_TIMEOUT=900 uvicorn src.web.app:app

Rate limiting (per IP, per endpoint)

Heavy and mutation endpoints are throttled in-process to prevent runaway clients:

Endpoint	Limit
`GET /api/stats`, `/api/stats/enhanced`	30 / minute
`POST /api/collector/run`	3 / minute
`POST /api/media/cleanup-blacklist`	5 / minute
`POST /api/media/cleanup-by-type`	5 / minute
`POST/PUT/DELETE /api/{subreddits,users,blacklist/,settings/,posts/*/tags}`	60 / minute

Limits are per (client IP, route path) and reset on a sliding window. For multi-worker deployments, swap the in-memory buckets for a shared backend.

Gallery performance

The gallery keeps at most 500 items in memory/DOM at once; older items are evicted as you scroll. Use filters or sort to reach older media.
Status polling pauses when the tab is hidden.
Filter dropdowns are debounced (200 ms) so rapid changes only fire one request.
Bulk DELETE runs 5 requests in parallel.

Running as a Service (systemd)

# Copy the service file
sudo cp reddit-collector-web.service /etc/systemd/system/

# Reload systemd and enable service
sudo systemctl daemon-reload
sudo systemctl enable reddit-collector-web
sudo systemctl start reddit-collector-web

# Check status
sudo systemctl status reddit-collector-web

# View logs
sudo journalctl -u reddit-collector-web -f

Docker

Using Docker Compose

# Build and run
docker-compose up -d

# View logs
docker-compose logs -f

# Stop
docker-compose down

Docker Configuration

The docker-compose.yml mounts the following volumes:

./config.yaml → /app/config.yaml (must be writable — the UI mutates it)
./downloads → /app/downloads (media files)
./data → /app/data (SQLite DB, scheduler DB/config — survives upgrades)

Relevant environment variables (all optional, with sensible defaults inside the image):

Variable	Default	Purpose
`RMC_DOWNLOAD_DIR`	`/app/downloads`	Where media files are written
`RMC_DB_PATH`	`/app/data/media.db`	Main SQLite database
`RMC_SCHEDULER_DB`	`/app/scheduler.db`	APScheduler jobstore — set to `/app/data/scheduler.db` in containers so history survives restarts
`RMC_SCHEDULER_CONFIG`	`/app/scheduler_config.yaml`	Scheduler interval/cron config — same advice as above
`RMC_CONFIG_PATH`	`/app/config.yaml`	YAML with subreddits/users/blacklist
`RMC_TIMEZONE`	`UTC`	Timezone used by the scheduler
`RMC_AUTH_USER` / `RMC_AUTH_PASS`	unset	Enable HTTP Basic Auth on every route when both are set
`RMC_PIN`	unset	4-6 digit PIN gating the web UI (HMAC cookie); leave unset to disable
`RMC_PIN_TIMEOUT`	`600`	Idle seconds before the PIN cookie expires

Synology DSM Deployment

Tested on DSM 7.2 with Container Manager on x86_64 Plus models. The published image lives at ghcr.io/richardnixondev/reddit-media-collector:latest.

Prepare the host directories (one-time, via SSH or File Station):

sudo mkdir -p /volume1/docker/reddit-media-collector/{downloads,data}
sudo cp /path/to/your/config.yaml /volume1/docker/reddit-media-collector/config.yaml

Create the project in Container Manager:
- Open Container Manager → Project → Create.
- Project name: reddit-media-collector.
- Path: /volume1/docker/reddit-media-collector.
- Source: Create docker-compose.yml, paste the contents of docker-compose.synology.yml (adjust RMC_TIMEZONE to your zone).
- Next → Build — DSM does docker compose pull && up -d.
Access the UI: http://<nas-ip>:8000. Downloaded media appears in /volume1/docker/reddit-media-collector/downloads/, which you can point Synology Photos or an Immich instance at.
HTTPS via DSM Reverse Proxy (optional, recommended if exposed): Control Panel → Login Portal → Advanced → Reverse Proxy → Create. Source: reddit.yourdomain.com (HTTPS:443). Destination: localhost:8000 (HTTP). Attach a Let's Encrypt cert. When exposing publicly, uncomment RMC_AUTH_USER/RMC_AUTH_PASS in the compose file — and consider adding RMC_PIN=NNNN for a second factor on the UI.
Updating: Container Manager → Project → reddit-media-collector → Action → Build re-pulls latest. The data/ volume keeps the database, scheduler history, tags and config.
First boot (one-time, if you already had a library): to retroactively tag every existing post (subreddit / performer / genre / nsfw):
```
curl -X POST http://<nas-ip>:8000/api/tags/backfill
```
Idempotent — safe to re-run.

Permissions note: if the container can't write to the bind mounts, check the owner of /volume1/docker/reddit-media-collector/ with ls -ln. Either chown it to a user the container can write as, or set user: "UID:GID" in the compose (commented out at the bottom of docker-compose.synology.yml).

File Naming Convention

Downloaded files follow a descriptive naming pattern:

{subreddit}_{author}_{YYYYMMDD}_{HHmmss}_{post_id}[_{gallery_index}].{ext}

Examples:

earthporn_photographer123_20260118_143052_abc123.jpg
pics_user456_20260118_091523_xyz789_1.jpg  (gallery image 1)
pics_user456_20260118_091523_xyz789_2.jpg  (gallery image 2)

Immich Integration

The collector generates JSON sidecar files compatible with Immich:

{
  "dateTimeOriginal": "2026-01-18T14:30:52+00:00",
  "description": "Post title from Reddit",
  "albums": ["r/earthporn"],
  "tags": ["reddit", "earthporn", "image"],
  "rating": 4,
  "people": ["photographer123"],
  "externalUrl": "https://reddit.com/r/earthporn/comments/abc123/title"
}

Rating System

Ratings are automatically assigned based on post score:

Score	Rating
0-9	1 star
10-49	2 stars
50-199	3 stars
200-999	4 stars
1000+	5 stars

Importing to Immich

Point Immich to your downloads folder as an external library, or use the Immich CLI:

immich upload --album "Reddit Collection" ./downloads/

Scheduled Collection

Using Cron

# Edit crontab
crontab -e

# Add entry (runs every 6 hours)
0 */6 * * * /path/to/reddit-media-collector/run_collector.sh

Example run_collector.sh

#!/bin/bash
cd /path/to/reddit-media-collector
source venv/bin/activate
export PATH="$HOME/.local/bin:$PATH"  # For yt-dlp
timeout 4h python -m src.main >> cron.log 2>&1
echo "$(date): Collector finished with exit code $?" >> cron.log

API Reference

The web interface exposes a REST API. FastAPI also serves interactive docs at /docs and OpenAPI at /openapi.json.

Configuration

Method	Endpoint	Description
GET	`/api/config`	Full config snapshot
GET	`/api/subreddits`	List configured subreddits
POST	`/api/subreddits`	Add a subreddit
DELETE	`/api/subreddits/{name}`	Remove a subreddit
GET	`/api/users`	List configured users
POST	`/api/users`	Add a user
DELETE	`/api/users/{name}`	Remove a user
GET	`/api/settings`	Download + rate-limit settings
PUT	`/api/settings`	Update settings

Media

Method	Endpoint	Description
GET	`/api/media`	List downloaded media (paginated, filterable)
GET	`/api/media/{id}/info`	Get media details
DELETE	`/api/media/{id}?blacklist_author=&blacklist_subreddit=`	Delete media file (and optionally blacklist)
GET	`/api/media/subreddits[?limit&offset]`	Subreddits with downloaded content
GET	`/api/media/authors[?limit&offset]`	Authors with downloaded content
GET	`/api/media/file/{filename:path}`	Serve raw file (Range-aware for video)
GET	`/api/media/thumb/{filename:path}`	Serve / generate video thumbnail
GET	`/api/media/blacklist-preview`	Files affected by current blacklist
POST	`/api/media/cleanup-blacklist`	Delete blacklisted media
GET	`/api/media/cleanup-preview?media_type=`	Files affected by type cleanup
POST	`/api/media/cleanup-by-type?media_type=`	Delete by media type

Blacklist

Method	Endpoint	Description
GET	`/api/blacklist`	Full blacklist
POST	`/api/blacklist/{authors,subreddits,keywords,domains}`	Add entry
DELETE	`/api/blacklist/{kind}/{name}`	Remove entry

Favorites & Authors

Method	Endpoint	Description
GET	`/api/favorites`	Favorited posts (paginated)
POST	`/api/favorites/{post_id}`	Add to favorites
DELETE	`/api/favorites/{post_id}`	Remove from favorites
GET	`/api/favorites/authors[?limit&offset]`	Distinct favorited authors
POST	`/api/favorites/sync-users`	Add favorite authors as user targets
GET	`/api/authors`	Authors with stats (paginated, sortable, favorites filter)
GET	`/api/authors/{author}/media`	Media for one author

Stats

Method	Endpoint	Description
GET	`/api/stats`	Disk + counts (30 s server-side cache)
GET	`/api/stats/enhanced`	Trends, top authors, scores
GET	`/api/stats/recent`	Recent downloads

Collector & Scheduler

Method	Endpoint	Description
POST	`/api/collector/run`	Trigger collection run
GET	`/api/collector/status`	Collector status (includes `last_error`)
POST	`/api/collector/clear-error`	Dismiss the `last_error` banner
POST	`/api/collect/individual`	Collect from a single subreddit/user
GET	`/api/collect/targets`	List available targets
GET	`/api/scheduler/status`	Scheduler state + next run
PUT	`/api/scheduler/config`	Update schedule
GET	`/api/scheduler/history`	Past scheduler runs
POST	`/api/scheduler/run-now`	Execute schedule immediately

Method	Endpoint	Description
GET	`/api/tags?category=`	List all tags (filter by category: `performer`/`source`/`genre`/`meta`)
GET	`/api/posts/{id}/tags`	Tags attached to a single post
POST	`/api/posts/{id}/tags`	Attach a user-curated tag
DELETE	`/api/posts/{id}/tags/{tag_id}`	Detach a tag (auto or user)
POST	`/api/tags/backfill`	One-time pass to retroactively auto-tag the whole library

Health & Auth

Method	Endpoint	Description
GET	`/health`	Public JSON: `db`, `ffmpeg`, `scheduler`, `downloads_writable`, `version`, `auth_enabled`
GET	`/unlock`	PIN entry screen (only when `RMC_PIN` is set)
POST	`/unlock`	Submit PIN; sets the signed session cookie
POST	`/lock`	Invalidate the session cookie immediately

Database Schema

The SQLite database (media.db) stores all metadata:

CREATE TABLE posts (
    id TEXT PRIMARY KEY,
    subreddit TEXT NOT NULL,
    author TEXT,
    title TEXT,
    url TEXT NOT NULL,
    media_url TEXT,
    media_type TEXT,
    score INTEGER DEFAULT 0,
    created_utc REAL,
    downloaded_at TIMESTAMP,
    local_path TEXT,
    file_hash TEXT,
    permalink TEXT,
    source_type TEXT,         -- 'subreddit' or 'user'
    flair TEXT,
    nsfw INTEGER DEFAULT 0    -- Reddit's over_18 mirrored locally
);

CREATE TABLE favorites (
    post_id TEXT PRIMARY KEY,
    favorited_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (post_id) REFERENCES posts(id)
);

-- Tag taxonomy (Stash-inspired)
CREATE TABLE tags (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    category TEXT,            -- 'performer' | 'source' | 'genre' | 'meta'
    is_nsfw INTEGER DEFAULT 0,
    description TEXT,
    UNIQUE(name, category)
);

CREATE TABLE post_tags (
    post_id TEXT NOT NULL,
    tag_id INTEGER NOT NULL,
    source TEXT DEFAULT 'auto',  -- 'auto' | 'user'
    PRIMARY KEY (post_id, tag_id),
    FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE,
    FOREIGN KEY (tag_id) REFERENCES tags(id) ON DELETE CASCADE
);

-- Scheduler history (in-app cron alternative)
CREATE TABLE scheduler_history (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    started_at TIMESTAMP,
    finished_at TIMESTAMP,
    status TEXT,              -- 'success' | 'error' | 'timeout' | 'running'
    posts_processed INTEGER DEFAULT 0,
    posts_downloaded INTEGER DEFAULT 0,
    error_message TEXT
);

The database uses SQLite WAL (PRAGMA journal_mode=WAL) for crash safety on power loss and concurrent reads while a collect is running.

Troubleshooting

Videos saving as .html

Cause: yt-dlp not installed or not in PATH

Solution:

pip install yt-dlp
export PATH="$HOME/.local/bin:$PATH"

Rate limited (429 errors)

Cause: Too many requests to Reddit

Solution: Increase download_delay_seconds or decrease requests_per_minute in config

Incomplete galleries

Cause: Gallery metadata not available from Reddit API

Solution: Verify the post still exists on Reddit

Web interface not starting

Cause: Port already in use or missing dependencies

Solution:

# Check if port is in use
lsof -i :8000

# Reinstall dependencies (editable)
pip install -e ".[dev]"

Project Structure

reddit-media-collector/
├── src/
│   ├── main.py                # Collector entry point
│   ├── config.py              # Configuration dataclasses + rotating logger
│   ├── database.py            # SQLite (WAL) + tags taxonomy + auto-tagger
│   ├── downloader.py          # Downloader with retry + dedupe
│   ├── reddit_client.py       # Reddit JSON-API client
│   ├── sidecar.py             # Immich-compatible JSON sidecars
│   ├── extractors/            # URL extractors per host (reddit, imgur, gfycat)
│   └── web/                   # FastAPI app
│       ├── app.py             # App, lifespan, PIN-lock middleware
│       ├── auth.py            # Optional HTTP Basic auth
│       ├── session.py         # HMAC-signed PIN session cookie
│       ├── config_manager.py
│       ├── deps.py
│       ├── rate_limit.py      # Per-IP throttle dependency
│       ├── routers/
│       │   ├── config.py      # Config CRUD (HTMX-aware fragments)
│       │   ├── favorites.py
│       │   ├── health.py      # Public /health JSON
│       │   ├── media.py       # Gallery, file serving, cleanups
│       │   ├── scheduler.py   # In-app scheduler + collector control
│       │   ├── stats.py
│       │   └── tags.py        # Tag taxonomy CRUD + backfill
│       ├── static/
│       │   ├── css/app.css    # Extracted from inline (themed via CSS vars)
│       │   └── js/
│       │       ├── api.js     # Shared fetch helpers
│       │       └── app.js     # All app logic
│       └── templates/
│           ├── index.html     # Composition: extends layout + includes
│           ├── unlock.html    # PIN entry screen
│           └── partials/      # tab_*.html, _modal_*.html, _item_*.html, _tag.html
├── tests/                     # pytest (unit + API contract; 151 tests)
├── downloads/                 # Downloaded media (gitignored)
├── config.yaml                # Configuration
├── pyproject.toml             # Project metadata + tooling config
├── .pre-commit-config.yaml    # ruff + mypy hooks
├── .github/workflows/
│   ├── ci.yml                 # lint, types, test, docker (per PR/push)
│   └── release.yml            # GHCR publish on every v* tag
├── Dockerfile
├── docker-compose.yml         # Local dev
├── docker-compose.synology.yml # NAS-flavored (image from GHCR)
└── README.md

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Disclaimer

This tool is for personal use only. Please respect Reddit's Terms of Service and the content creators' rights. Do not use this tool to redistribute copyrighted content.

Acknowledgments

Reddit for their public JSON API
Immich for the excellent self-hosted photo management
yt-dlp for video extraction support

24 KiB Raw Permalink Blame History Unescape Escape

Reddit Media Collector

Features

Collection

Web dashboard

Operations

Quick Start

Prerequisites

Installation

Development

Configuration

Web Interface

Starting the Web Server

Web Features

Security & Limits

HTTP Basic Auth (optional)

PIN lock (optional, UI-only)

Rate limiting (per IP, per endpoint)

Gallery performance

Running as a Service (systemd)

Docker

Using Docker Compose

Docker Configuration

Synology DSM Deployment

File Naming Convention

Immich Integration

Rating System

Importing to Immich

Scheduled Collection

Using Cron

Example run_collector.sh

API Reference

Configuration

Media

Blacklist

Favorites & Authors

Stats

Collector & Scheduler

Tags

Health & Auth

Database Schema

Troubleshooting

Videos saving as .html

Rate limited (429 errors)

Incomplete galleries

Web interface not starting

Project Structure

Contributing

License

Disclaimer

Acknowledgments

24 KiB

Raw Permalink Blame History