Skip to main content

Product specification: news.lihor.ro

Purpose

A self-hosted, open-source news dashboard. It combines:

  1. News inbox — Clau/cron collects useful sources automatically via RSS, GitHub release feeds, trending feeds, and scraped pages.
  2. Reading tracker — records what was read, saved, skipped, and archived with timestamps.
  3. Summary archive — stores contextual summaries and "why this matters" blurbs per article card.
  4. Search — full-text search across titles, summaries, tags, and sources.
  5. Source health — each feed shows last check time, success/error state, and item counts.
  6. Future AI question layer — ask Clau questions over saved/read history (see AI layer section).

Source taxonomy

See backend/news_dashboard/sources.py for the canonical source list.

Sources are split by category:

  • python — Python Insider, Astral blog, Ruff/uv/mypy/pyright/scikit-learn/scipy/PyTorch/TensorFlow releases
  • ai-llm — Anthropic (scraped), OpenAI, Google AI, Hugging Face, Augment Code, Simon Willison, Latent Space, Import AI, InfoQ
  • agents — LangChain, LangGraph, Langfuse releases
  • cloud-infra — Kubernetes, Docker, AWS ML blog
  • engineering — Pragmatic Engineer, GitHub Changelog, GitHub Engineering
  • trending — Hacker News Best, Hacker News AI search
  • repositories — GitHub Trending (All, Python, TypeScript)

Source kinds:

  • rss_feed — standard RSS/Atom feed via feedparser
  • github_release_feed — GitHub releases Atom feed
  • trending_feed — HN/GitHub trending RSS
  • scraped_page — custom HTML scraper (stdlib urllib + html.parser, no extra deps)

Article status model

new → read
new → saved → read
new → skipped
any → archived
archived → read (restore)

Status transitions are tracked with timestamps (read_at, saved_at, skipped_at, archived_at).

Noise filtering

Per-source limits are configured in ingest.py::NOISE_FILTERS:

  • Broad feeds (HN, GitHub trending): capped at 15–20 items per run
  • Newsletter feeds (Import AI, Latent Space): capped at 5 per run
  • Curated blog feeds: capped at 50 per run (feedparser default)

A keyword include-list hook is available per-source for future tighter filtering.

Summary / reason generation

Summaries are the first 280 characters of feed description (after HTML cleaning).

The reason field is a contextual blurb generated by make_reason():

  • Release feeds → "New release vX.Y.Z from source."
  • Security content → "Security update from source — review recommended."
  • Tutorial/how-to → "How-to or deep-dive on category from source."
  • Trending (HN) → "Trending on Hacker News."
  • Trending (GitHub) → "Trending {Language} repository on GitHub today."
  • AI/agent content → "AI/agent development news from source."
  • Python content → "Python ecosystem update from source."
  • Fallback → "{Category} — {Source}."

Source health tracking

Each source stores:

  • last_checked_at — timestamp of last fetch attempt
  • last_success_at — timestamp of last successful fetch
  • last_error — last error message (null if no error)
  • last_fetched_count — items found in last run
  • last_inserted_count — new items inserted in last run

The Sources tab in the UI shows a colored health badge (ok / stale / error) and truncated error message when present.

Non-goals for v1

  • No public unauthenticated access.
  • No broad all-tech firehose.
  • No complex AI Q&A before article history exists.
  • No manual article creation as a primary workflow.

Future AI layer (v1.1+)

After ≥ 100 saved/read articles exist:

Data model (ready now)

  • Articles already store summary, tags, reason — sufficient for keyword search.
  • PostgreSQL full-text search (search_vector tsvector generated column and idx_articles_search GIN index) is created and indexes title/summary/reason/tags/source_name/body.
  • Backend /api/search?q=... endpoint is live.

Full-text extraction (v1.1)

  • Add optional full_text column to articles table.
  • Fetch full article body for saved/read articles (via Trafilatura or goose3).
  • Run FTS over full_text when available.

Embeddings / semantic search (v1.2)

  • Embed title + summary via the configured OpenAI-compatible embeddings provider.
  • Store embeddings in PostgreSQL-managed article data structures; runtime storage remains PostgreSQL only.
  • Enables semantic search and similarity grouping without adding a second runtime database.
  • Configure with FREE_LLM_API_KEY / FREE_LLM_BASE_URL (or the OPENAI_API_KEY / OPENAI_BASE_URL fallback); see README.md and backend/news_dashboard/embeddings.py.

Ask Clau endpoint (v1.3)

Scope: questions over articles with status saved or read only (not new/skipped/archived).

Proposed API:

POST /api/ask
{ "question": "What LangGraph updates happened last month?", "limit": 20 }
→ { "answer": "...", "citations": [{ "id": 42, "title": "...", "url": "..." }] }

Implementation:

  1. Run /api/search?q=<question_keywords> to retrieve candidate articles.
  2. Bundle up to N articles (title + summary + date + source) into a prompt.
  3. Call OpenAI with the bundle and question.
  4. Return answer + article IDs used as citations.

Privacy/security:

  • Endpoint requires the app authentication boundary, either local password sessions or optional Keycloak SSO.
  • No article content is sent to external APIs unless the relevant AI feature is configured with an API key.
  • OpenAI API key stored as an environment secret, never in source.
  • Briefing generation uses FREE_LLM_API_KEY / FREE_LLM_BASE_URL (or the OPENAI_API_KEY fallback) and OPENAI_BRIEFING_MODEL; see README.md and backend/news_dashboard/briefings.py.

Privacy/security boundaries

  • No article content leaves the server unless the relevant feature-specific AI provider variables are configured.
  • Search index uses PostgreSQL full-text search.
  • Local password auth is built into the app, and production can enable Keycloak SSO while preserving local user_id authorization boundaries. See Authentication (Keycloak).
  • Caddy is a reverse proxy for the app and Keycloak paths, not the primary authentication layer.

Deployment

Container

Built via Dockerfile at repo root. Published to GHCR via GitHub Actions on every push to main.

ghcr.io/lihor-hub/news-dashboard:latest
ghcr.io/lihor-hub/news-dashboard:<sha>

Kubernetes (Helm)

# Deploy with NodePort for host-side Caddy proxying
helm upgrade --install news-dashboard ./helm/news-dashboard \
--set service.type=NodePort \
--set service.nodePort=30088

# Caddy proxies to 127.0.0.1:30088 (never to a mutable ClusterIP)

Caddyfile pattern

Use reverse_proxy 127.0.0.1:<nodePort> with a fixed NodePort value. Do NOT use a ClusterIP — it can change when the Service is recreated.

The NodePort value (e.g., 30088) is stable across pod restarts and Helm upgrades as long as the service.nodePort value in values.yaml stays the same.