Closes Dependabot PR #2.
v3 changelog: adds an optional commit-message parameter (we do not
use it, default is fine), removes the Endbug dependency that caused
issues on github-enterprise, and bumps its own internal checkout /
setup-python actions. None of the inputs we pass (lint-path,
python-version, requirements-path, pylintrc-path, readme-path,
badge-text, color-*) changed.
Re-pinned by full commit SHA, same hardening pattern as v2.1.
Closes Dependabot PRs #7 (recharts), #6 (globals) and #4
(eslint-plugin-react-refresh).
- recharts ^3.2.0 -> ^3.8.1 (runtime chart lib used by App.jsx;
bundle grew from 489 kB to 550 kB gzipped — within the 1 MB
soft budget),
- globals ^16.3.0 -> ^17.6.0 (eslint flat-config peer; no API
surface used directly),
- eslint-plugin-react-refresh ^0.4.20 -> ^0.5.2.
Verified locally with `npm install && npm run lint && npm run build`:
zero lint errors, build completes in 814 ms, npm audit reports no
vulnerabilities.
Two Dependabot PRs intentionally not included here:
- #12 @eslint/js -> ^10 needs eslint -> ^10 first (peer dep),
- #9 @vitejs/plugin-react -> ^6 needs vite -> ^8 first (peer dep).
Both will be revisited once Dependabot opens the matching core
bumps.
Closes Dependabot PRs #13 (websockets) and #1 (setup-python).
websockets: 15.0.1 -> 16.0
Major bump but the API we use (serve(handler), handler arg
exposing request.path, ws.send, close(code, reason)) is
unchanged. Verified by running the existing 89-test suite
against websockets==16.0 locally — _ws_token_ok still reads
the query string as before.
actions/setup-python: v5 -> v6
First-party action, low risk. The previous version was already
tag-pinned (acceptable for first-party). cache: 'pip' input is
preserved.
pip-audit remains clean (0 known vulnerabilities).
34 new tests (89 total, still ~0.1s).
test_settings.py — exercises BackendSettings directly with _env_file=None
so the developer's local .env does not leak in:
- default port ranges and invariants,
- non-integer / out-of-range port rejection,
- cpu_alert_th out-of-range rejection,
- env override roundtrip,
- extra="ignore" tolerates typos (regression: an unknown env var
should not crash startup).
test_metrics_schema.py — black-box tests of parse_metrics() with each
case named after the attack it guards against:
- happy path with full and partial (optional fields) payloads,
- every required field individually missing,
- every percentage field individually out of [0, 100],
- extra="forbid" rejects smuggled keys (e.g. {"injected": "<!channel>"}),
- unsafe device_id patterns (slashes, newlines, path traversal,
65-char overflow, empty string),
- invalid raw JSON,
- NaN / Infinity / -Infinity which json.loads accepts but the
schema (Field + the finite-value validator) rejects.
Previously only device_id and cpu_percent had explicit checks. The
rest of the payload — mem_percent, disk_percent, gpu_percent,
timestamp, agent_cpu_percent, agent_mem_mb — was trusted as long
as json.loads accepted it, so a malicious or buggy publisher could
push:
- mem_percent: "<script>alert(1)</script>" (rendered later in
the WS dashboard / Slack summary as if numeric),
- disk_percent: NaN (which compares False everywhere and breaks
downstream chart aggregation),
- extra keys ("evil": "<!channel>"), persisted in device_state
and forwarded verbatim to WS clients.
Pydantic Metrics model now enforces the whole frame:
- device_id pattern (same regex as validate_device_id),
- percentages bounded to [0, 100],
- explicit NaN rejection (a finite-value @field_validator on top
of Field(ge/le), which already excludes inf),
- timestamp >= 0,
- extra="forbid" so unknown keys are dropped at the door.
on_message now goes through parse_metrics() which logs a WARNING
with the structured pydantic error list on rejection.
Replace the scattered os.getenv() + int()/float() pattern with a
BaseSettings class on both modules. Wins:
- bad config now fails at import with a readable pydantic error
(WS_PORT=abc no longer produces a ValueError stack from inside
main()); ports are bounded to [1, 65535], cpu_alert_th to [0,100],
backoff_min/interval to >= 1,
- .env loading moves into pydantic-settings (env_file in
SettingsConfigDict), so the manual load_dotenv() call is gone,
- every callback now reads from a single ``settings`` instance, so
runtime overrides are possible (tests use monkeypatch on
backend.settings instead of patching module-level constants).
Test for ws_token is updated to patch backend.settings.ws_auth_token
rather than the old WS_AUTH_TOKEN module constant; the contract is
unchanged so all 55 tests still pass.
Pydantic stack pinned: pydantic==2.13.4, pydantic-core==2.46.4,
pydantic-settings==2.14.1 (plus annotated-types and typing-inspection
as transitives). pip-audit remains clean.
pip-audit now fails the CI on known CVEs (added earlier in the lint
workflow), but that only protects against regressions in *new* PRs.
Dependabot closes the loop on the rest: it raises weekly PRs to bump
pinned versions before a CVE is even disclosed publicly, including
SHA-pinned bumps for the Silleellie third-party action so the SHA
pin does not become a maintenance trap.
Three ecosystems are configured:
- pip (repo root, runtime + requirements-dev),
- npm (frontend/),
- github-actions (workflows/ and reusable actions).
Before this change a single connect() failure (bad cert, broker
down, transient network glitch) made mqtt_loop return after a one-
line error, while the surrounding asyncio.gather kept the WS server
and the Slack summary loop running blind — the bridge looked alive
but no metrics flowed.
The loop now:
- keeps the SSL context outside the retry body so reconnects do
not re-read cert files on every attempt,
- awaits an asyncio.Event flipped by on_disconnect, so it only
enters the backoff sleep when the broker actually dropped us,
- retries with exponential backoff capped at MQTT_BACKOFF_MAX
(default 60s), resetting after each successful connect,
- lets asyncio.CancelledError propagate so shutdown still works.
next_backoff() is a tiny pure helper so the doubling/ceiling logic
is unit tested in isolation (tests/test_backoff.py, 7 cases).
Also hoisted `import ssl` to module top-level — the previous in-
function import was a leftover from earlier copy/paste and tripped
pylint's import-outside-toplevel check.
print() across both modules made production observability painful:
no levels, no timestamps under the developer's control, and the
common `except Exception as e: print(e)` pattern dropped the
traceback. A single grep for `[MQTT ERROR]` could not tell whether
the failure was a parse error, a TLS handshake, or an OOM.
Now both backend.py and collect_metrics.py use a module logger:
- basicConfig with `%(asctime)s %(levelname)s %(name)s: %(message)s`,
- level driven by the LOG_LEVEL env var (defaults to INFO),
- log.exception() in catch blocks so the stack trace is preserved,
- debug-level for noisy frames (raw payload, agent self stats,
per-message broadcast log) so prod runs stay readable.
Also rename the agent's local CPU threshold to AGENT_CPU_WARN (env
overridable) so the magic 90 in collect_metrics no longer drifts
silently from the backend's CPU_ALERT_TH.
Install requirements-dev.txt (which transitively pulls runtime deps)
and run pytest before pylint. The suite is fast (~0.1s) so it does
not affect total CI time, but it now gates merges on the validation
and alert-predicate contracts.
48 tests, ~0.1s total. Each case targets a specific bug class so the
file reads as a contract:
- test_validation.py
accepts well-formed device ids, rejects whitespace, slashes,
colons, newlines, zero-width spaces, oversize values, HTML.
- test_alert_predicate.py
threshold boundary, bool-vs-int trap, NaN / inf / out-of-range,
non-numeric payloads, per-device cooldown window.
- test_active_snapshot.py
recent vs stale, the "<= prune_seconds" boundary (inclusive),
missing last_seen treated as ancient, empty state.
- test_ws_token.py
open mode, missing/wrong/empty/extra-param query strings, plus
the happy path with the correct token.
conftest.py stubs MQTT_BROKER and prepends the repo root to sys.path
so `import backend` works without a .env file. Dev deps split into
requirements-dev.txt to keep the runtime image lean.
Pull the validation, snapshot pruning and alert predicate out of the
MQTT callback so they can be unit tested without mocking gmqtt:
- filter_active(state, seen, now, prune_seconds) — pure pruning
rule, now also the body of active_snapshot();
- validate_device_id(raw) — single source of truth for the
[A-Za-z0-9._-]{1,64} contract;
- should_alert(cpu, device_id, now, last_alert, cooldown, threshold)
— boolean predicate that captures the bool-vs-int trap, the
[0,100] range check, and the per-device cooldown.
Pure behavior is unchanged; on_message now calls these helpers
instead of inlining the same logic.
Without auth, the WS server at 0.0.0.0:6789 exposed every device's
metrics to anyone on the network — useful reconnaissance for an
attacker (saturated nodes are easier DoS targets) and trivial pivot
from a compromised host.
Server side:
- WS_AUTH_TOKEN env defaults to empty (open mode for local dev),
- when set, ws_handler reads ?token=... from the handshake target
and rejects with WS close 1008 unless secrets.compare_digest
matches; the comparison is constant-time to avoid timing oracles.
Client side:
- frontend reads VITE_WS_URL and VITE_WS_TOKEN, so the same build
works in dev (localhost, no token) and prod (proxied wss, token).
- frontend/.env.sample documents the variables; .gitignore extended
to keep .env / .env.* out of the repo while allowing .env.sample.
env_sample also documents ALERT_COOLDOWN, MAX_PAYLOAD_BYTES and
MAX_DEVICES that the previous commits introduced.
- Declare permissions:contents:write explicitly. Defaulting to the
repository-wide GITHUB_TOKEN scope is broader than required for
badge updates and violates least privilege.
- Pin Silleellie/pylint-github-action by full commit SHA instead of
the mutable v2.1 tag, removing the supply-chain risk where a tag
re-point would run arbitrary code with our GITHUB_TOKEN.
- Add a pip-audit step so new CVEs in pinned deps fail the build.
- Enable pip cache to cut ~30s off cold runs.
The previous per-client asyncio.create_task(ws.send) had two problems:
- tasks were created without a reference, so a slow GC could drop
them before they ran and exceptions vanished silently,
- the surrounding try/except could only catch *synchronous* failures
of create_task itself, never an actual send failure, so dead
sockets stayed in `clients` forever and the discard branch was
effectively dead code.
Use a single _broadcast coroutine that fans out with asyncio.gather
(return_exceptions=True) and prunes the clients set based on real
send results. Schedule fire-and-forget work through _schedule, which
keeps a strong reference to the task until it completes.
Each MQTT message with cpu_percent >= CPU_ALERT_TH used to schedule a
post_slack task immediately. A bursty (or hostile) publisher could
spam the channel, burn the webhook's 1 req/s quota, and bury real
alerts under alert fatigue.
Also harden the predicate:
- reject bool values (True passes isinstance check for int otherwise),
- bound cpu to [0, 100] so NaN / inf / 1e308 cannot trigger,
- re-alert per device only after ALERT_COOLDOWN seconds.
A compromised device (or a stolen cert) could previously:
- publish a giant payload to exhaust backend memory in json.loads,
- spoof messages claiming arbitrary device_id values (rendered into
Slack alerts as mrkdwn, enabling content injection / channel
impersonation with link unfurls and broadcast keywords),
- flood device_state with random ids to drive unbounded memory growth
since the prune is only applied on read in active_snapshot().
Add three guards in on_message:
- MAX_PAYLOAD_BYTES (default 16 KiB) — rejects oversize frames before
json parsing,
- _DEVICE_ID_RE — accepts only [A-Za-z0-9._-]{1,64}, rejecting
newlines, slack mrkdwn metacharacters and absurd lengths,
- MAX_DEVICES cap — refuses new device_ids once the active set is
full, so a misbehaving publisher cannot grow the dict without bound.
- Remove paho-mqtt, pynvml, nvidia-ml-py and the redundant dotenv shim
(the code only imports gmqtt, psutil and python-dotenv; GPU stats come
from a nvidia-smi subprocess, not pynvml).
- Bump aiohttp 3.12.15 -> 3.13.4 (closes 18 CVEs; mostly server-side
but defense-in-depth for the Slack client too).
- Bump python-dotenv 1.1.1 -> 1.2.2 (CVE-2026-28684; not exploitable
here since only load_dotenv is used, but keeps scanners clean).