Product specification: news.lihor.ro
Purpose
A self-hosted, open-source news dashboard. It combines:
- News inbox — Clau/cron collects useful sources automatically via RSS, GitHub release feeds, trending feeds, and scraped pages.
- Reading tracker — records what was read, saved, skipped, and archived with timestamps.
- Summary archive — stores contextual summaries and "why this matters" blurbs per article card.
- Search — full-text search across titles, summaries, tags, and sources.
- Source health — each feed shows last check time, success/error state, and item counts.
- Future AI question layer — ask Clau questions over saved/read history (see AI layer section).
Source taxonomy
See backend/news_dashboard/sources.py for the canonical source list.
Sources are split by category:
python— Python Insider, Astral blog, Ruff/uv/mypy/pyright/scikit-learn/scipy/PyTorch/TensorFlow releasesai-llm— Anthropic (scraped), OpenAI, Google AI, Hugging Face, Augment Code, Simon Willison, Latent Space, Import AI, InfoQagents— LangChain, LangGraph, Langfuse releasescloud-infra— Kubernetes, Docker, AWS ML blogengineering— Pragmatic Engineer, GitHub Changelog, GitHub Engineeringtrending— Hacker News Best, Hacker News AI searchrepositories— GitHub Trending (All, Python, TypeScript)
Source kinds:
rss_feed— standard RSS/Atom feed via feedparsergithub_release_feed— GitHub releases Atom feedtrending_feed— HN/GitHub trending RSSscraped_page— custom HTML scraper (stdlib urllib + html.parser, no extra deps)
Article status model
new → read
new → saved → read
new → skipped
any → archived
archived → read (restore)
Status transitions are tracked with timestamps (read_at, saved_at, skipped_at, archived_at).
Noise filtering
Per-source limits are configured in ingest.py::NOISE_FILTERS:
- Broad feeds (HN, GitHub trending): capped at 15–20 items per run
- Newsletter feeds (Import AI, Latent Space): capped at 5 per run
- Curated blog feeds: capped at 50 per run (feedparser default)
A keyword include-list hook is available per-source for future tighter filtering.
Summary / reason generation
Summaries are the first 280 characters of feed description (after HTML cleaning).
The reason field is a contextual blurb generated by make_reason():
- Release feeds → "New release vX.Y.Z from source."
- Security content → "Security update from source — review recommended."
- Tutorial/how-to → "How-to or deep-dive on category from source."
- Trending (HN) → "Trending on Hacker News."
- Trending (GitHub) →
"Trending {Language} repository on GitHub today." - AI/agent content → "AI/agent development news from source."
- Python content → "Python ecosystem update from source."
- Fallback →
"{Category} — {Source}."
Source health tracking
Each source stores:
last_checked_at— timestamp of last fetch attemptlast_success_at— timestamp of last successful fetchlast_error— last error message (null if no error)last_fetched_count— items found in last runlast_inserted_count— new items inserted in last run
The Sources tab in the UI shows a colored health badge (ok / stale / error) and truncated error message when present.
Non-goals for v1
- No public unauthenticated access.
- No broad all-tech firehose.
- No complex AI Q&A before article history exists.
- No manual article creation as a primary workflow.
Future AI layer (v1.1+)
After ≥ 100 saved/read articles exist:
Data model (ready now)
- Articles already store
summary,tags,reason— sufficient for keyword search. - PostgreSQL full-text search (
search_vectortsvector generated column andidx_articles_searchGIN index) is created and indexes title/summary/reason/tags/source_name/body. - Backend
/api/search?q=...endpoint is live.
Full-text extraction (v1.1)
- Add optional
full_textcolumn to articles table. - Fetch full article body for saved/read articles (via Trafilatura or goose3).
- Run FTS over full_text when available.
Embeddings / semantic search (v1.2)
- Embed title + summary via the configured OpenAI-compatible embeddings provider.
- Store embeddings in PostgreSQL-managed article data structures; runtime storage remains PostgreSQL only.
- Enables semantic search and similarity grouping without adding a second runtime database.
- Configure with
FREE_LLM_API_KEY/FREE_LLM_BASE_URL(or theOPENAI_API_KEY/OPENAI_BASE_URLfallback); seeREADME.mdandbackend/news_dashboard/embeddings.py.
Ask Clau endpoint (v1.3)
Scope: questions over articles with status saved or read only (not new/skipped/archived).
Proposed API:
POST /api/ask
{ "question": "What LangGraph updates happened last month?", "limit": 20 }
→ { "answer": "...", "citations": [{ "id": 42, "title": "...", "url": "..." }] }
Implementation:
- Run
/api/search?q=<question_keywords>to retrieve candidate articles. - Bundle up to N articles (title + summary + date + source) into a prompt.
- Call OpenAI with the bundle and question.
- Return answer + article IDs used as citations.
Privacy/security:
- Endpoint requires the app authentication boundary, either local password sessions or optional Keycloak SSO.
- No article content is sent to external APIs unless the relevant AI feature is configured with an API key.
- OpenAI API key stored as an environment secret, never in source.
- Briefing generation uses
FREE_LLM_API_KEY/FREE_LLM_BASE_URL(or theOPENAI_API_KEYfallback) andOPENAI_BRIEFING_MODEL; seeREADME.mdandbackend/news_dashboard/briefings.py.
Privacy/security boundaries
- No article content leaves the server unless the relevant feature-specific AI provider variables are configured.
- Search index uses PostgreSQL full-text search.
- Local password auth is built into the app, and production can enable Keycloak SSO while preserving local
user_idauthorization boundaries. See Authentication (Keycloak). - Caddy is a reverse proxy for the app and Keycloak paths, not the primary authentication layer.
Deployment
Container
Built via Dockerfile at repo root. Published to GHCR via GitHub Actions on every push to main.
ghcr.io/lihor-hub/news-dashboard:latest
ghcr.io/lihor-hub/news-dashboard:<sha>
Kubernetes (Helm)
# Deploy with NodePort for host-side Caddy proxying
helm upgrade --install news-dashboard ./helm/news-dashboard \
--set service.type=NodePort \
--set service.nodePort=30088
# Caddy proxies to 127.0.0.1:30088 (never to a mutable ClusterIP)
Caddyfile pattern
Use reverse_proxy 127.0.0.1:<nodePort> with a fixed NodePort value.
Do NOT use a ClusterIP — it can change when the Service is recreated.
The NodePort value (e.g., 30088) is stable across pod restarts and Helm upgrades
as long as the service.nodePort value in values.yaml stays the same.