Question 1

What is a RAG pipeline and why is it vulnerable?

Accepted Answer

A RAG (Retrieval-Augmented Generation) pipeline connects an AI model to an external knowledge base by converting documents into mathematical representations called embeddings, storing them in a vector database, and retrieving relevant fragments when a user asks a question. The vulnerability arises because vector similarity search operates on mathematical distance between embeddings, not on document permissions. If access controls from the source system are not replicated in the vector database, any user can potentially retrieve fragments of documents they should not have access to.

Question 2

How does embedding inversion work as an attack?

Accepted Answer

Embedding inversion is a technique where an attacker uses the mathematical vector representation of a document to reconstruct its original text content. While embeddings are designed to capture semantic meaning rather than exact wording, research has shown that significant portions of the original text can be recovered, especially with access to the same embedding model. This means that even if the RAG system does not return the full document, the stored embeddings themselves can be a source of data leakage if the vector database is not properly secured.

RAG Pipeline Exploitation

What You'll Learn in RAG Pipeline Exploitation

RAG Pipeline Exploitation — Training Steps

Targeting the Knowledge Base

Logging In with Stolen Credentials

Reconnaissance: Finding the Target

Opening the Legitimate Policy

Studying the Real Document

Crafting the Competing Document

Setting the Wrong Retention Period

The Secret Weapon: Keyword Stuffing

Uploading to the Knowledge Base

Selecting the Poisoned Document