Product specification: news.lihor.ro

Purpose

A self-hosted, open-source news dashboard. It combines:

News inbox — Clau/cron collects useful sources automatically via RSS, GitHub release feeds, trending feeds, and scraped pages.
Reading tracker — records what was read, saved, skipped, and archived with timestamps.
Summary archive — stores contextual summaries and "why this matters" blurbs per article card.
Search — full-text search across titles, summaries, tags, and sources.
Source health — each feed shows last check time, success/error state, and item counts.
Future AI question layer — ask Clau questions over saved/read history (see AI layer section).

Source taxonomy

See backend/news_dashboard/sources.py for the canonical source list.

Sources are split by category:

python — Python Insider, Astral blog, Ruff/uv/mypy/pyright/scikit-learn/scipy/PyTorch/TensorFlow releases
ai-llm — Anthropic (scraped), OpenAI, Google AI, Hugging Face, Augment Code, Simon Willison, Latent Space, Import AI, InfoQ
agents — LangChain, LangGraph, Langfuse releases
cloud-infra — Kubernetes, Docker, AWS ML blog
engineering — Pragmatic Engineer, GitHub Changelog, GitHub Engineering
trending — Hacker News Best, Hacker News AI search
repositories — GitHub Trending (All, Python, TypeScript)

Source kinds:

rss_feed — standard RSS/Atom feed via feedparser
github_release_feed — GitHub releases Atom feed
trending_feed — HN/GitHub trending RSS
scraped_page — custom HTML scraper (stdlib urllib + html.parser, no extra deps)

Article status model

new → read
new → saved → read
new → skipped
any → archived
archived → read  (restore)

Status transitions are tracked with timestamps (read_at, saved_at, skipped_at, archived_at).

Noise filtering

Per-source limits are configured in ingest.py::NOISE_FILTERS:

Broad feeds (HN, GitHub trending): capped at 15–20 items per run
Newsletter feeds (Import AI, Latent Space): capped at 5 per run
Curated blog feeds: capped at 50 per run (feedparser default)

A keyword include-list hook is available per-source for future tighter filtering.

Summary / reason generation

Summaries are the first 280 characters of feed description (after HTML cleaning).

The reason field is a contextual blurb generated by make_reason():

Release feeds → "New release vX.Y.Z from source."
Security content → "Security update from source — review recommended."
Tutorial/how-to → "How-to or deep-dive on category from source."
Trending (HN) → "Trending on Hacker News."
Trending (GitHub) → "Trending {Language} repository on GitHub today."
AI/agent content → "AI/agent development news from source."
Python content → "Python ecosystem update from source."
Fallback → "{Category} — {Source}."

Source health tracking

Each source stores:

last_checked_at — timestamp of last fetch attempt
last_success_at — timestamp of last successful fetch
last_error — last error message (null if no error)
last_fetched_count — items found in last run
last_inserted_count — new items inserted in last run

The Sources tab in the UI shows a colored health badge (ok / stale / error) and truncated error message when present.

Non-goals for v1

No public unauthenticated access.
No broad all-tech firehose.
No complex AI Q&A before article history exists.
No manual article creation as a primary workflow.

Future AI layer (v1.1+)

After ≥ 100 saved/read articles exist:

Data model (ready now)

Articles already store summary, tags, reason — sufficient for keyword search.
PostgreSQL full-text search (search_vector tsvector generated column and idx_articles_search GIN index) is created and indexes title/summary/reason/tags/source_name/body.
Backend /api/search?q=... endpoint is live.

Full-text extraction (v1.1)

Add optional full_text column to articles table.
Fetch full article body for saved/read articles (via Trafilatura or goose3).
Run FTS over full_text when available.

Embeddings / semantic search (v1.2)

Embed title + summary via the configured OpenAI-compatible embeddings provider.
Store embeddings in PostgreSQL-managed article data structures; runtime storage remains PostgreSQL only.
Enables semantic search and similarity grouping without adding a second runtime database.
Configure with FREE_LLM_API_KEY / FREE_LLM_BASE_URL (or the OPENAI_API_KEY / OPENAI_BASE_URL fallback); see README.md and backend/news_dashboard/embeddings.py.

Ask Clau endpoint (v1.3)

Scope: questions over articles with status saved or read only (not new/skipped/archived).

Proposed API:

POST /api/ask
{ "question": "What LangGraph updates happened last month?", "limit": 20 }
→ { "answer": "...", "citations": [{ "id": 42, "title": "...", "url": "..." }] }

Implementation:

Run /api/search?q=<question_keywords> to retrieve candidate articles.
Bundle up to N articles (title + summary + date + source) into a prompt.
Call OpenAI with the bundle and question.
Return answer + article IDs used as citations.

Privacy/security:

Endpoint requires the app authentication boundary, either local password sessions or optional Keycloak SSO.
No article content is sent to external APIs unless the relevant AI feature is configured with an API key.
OpenAI API key stored as an environment secret, never in source.
Briefing generation uses FREE_LLM_API_KEY / FREE_LLM_BASE_URL (or the OPENAI_API_KEY fallback) and OPENAI_BRIEFING_MODEL; see README.md and backend/news_dashboard/briefings.py.

Privacy/security boundaries

No article content leaves the server unless the relevant feature-specific AI provider variables are configured.
Search index uses PostgreSQL full-text search.
Local password auth is built into the app, and production can enable Keycloak SSO while preserving local user_id authorization boundaries. See Authentication (Keycloak).
Caddy is a reverse proxy for the app and Keycloak paths, not the primary authentication layer.

Deployment

Container

Built via Dockerfile at repo root. Published to GHCR via GitHub Actions on every push to main.

ghcr.io/lihor-hub/news-dashboard:latest
ghcr.io/lihor-hub/news-dashboard:<sha>

Kubernetes (Helm)

# Deploy with NodePort for host-side Caddy proxying
helm upgrade --install news-dashboard ./helm/news-dashboard \
  --set service.type=NodePort \
  --set service.nodePort=30088

# Caddy proxies to 127.0.0.1:30088 (never to a mutable ClusterIP)

Caddyfile pattern

Use reverse_proxy 127.0.0.1:<nodePort> with a fixed NodePort value. Do NOT use a ClusterIP — it can change when the Service is recreated.

The NodePort value (e.g., 30088) is stable across pod restarts and Helm upgrades as long as the service.nodePort value in values.yaml stays the same.

Purpose​

Source taxonomy​

Article status model​

Noise filtering​

Summary / reason generation​

Source health tracking​

Non-goals for v1​

Future AI layer (v1.1+)​

Data model (ready now)​

Full-text extraction (v1.1)​

Embeddings / semantic search (v1.2)​

Ask Clau endpoint (v1.3)​

Privacy/security boundaries​

Deployment​

Container​

Kubernetes (Helm)​

Caddyfile pattern​