RAG (retrieval-augmented generation) pipeline
Document ingestion, embedding, vector search, LLM generation, and response streaming for a production RAG application.
The prompt
Production RAG architecture. Ingestion: documents (PDFs, HTML, Markdown) land in S3. A worker chunks them, generates embeddings via an embedding model API, and writes vectors plus metadata to a vector database (Pinecone or pgvector). Query path: user submits a question to the API. The API embeds the query, performs a top-k vector search, reranks results with a cross-encoder, and constructs a prompt with retrieved context plus the user question. The prompt is sent to an LLM (OpenAI or Anthropic). The response streams back to the user. Show the conversation memory store (Redis), the prompt-injection guardrails, and the evaluation/feedback loop that captures user thumbs and feeds it to a periodic eval pipeline.
What it generates
A diagram of a production-grade RAG system including ingestion, retrieval, generation, memory, guardrails, and evaluation.
When to use it
When you're building any LLM-powered application that needs to answer questions over your own data — internal docs, customer support, knowledge bases.
Generate this diagram in seconds
Copy the prompt above, sign in for free, and paste it into the generator.
Related data & ai templates
ETL data pipeline
Batch + streaming ETL into a lakehouse: source → ingestion → transformation → warehouse → BI/ML consumers.
Multi-agent LLM system
Hierarchical multi-agent architecture: orchestrator agent dispatches to specialist agents with shared memory and tool access.
MLOps training & inference pipeline
End-to-end ML lifecycle: feature store, training pipeline, model registry, online inference, and monitoring.