RepoReaper: Autonomous Architectural Analysis and Bilingual Semantic Search via Dynamic RAG Cache
RepoReaper redefines âChat with Codeâ by treating an LLM as the CPU and a vector store as a highâspeed L2 cache, enabling autonomous traversal and onâdemand enrichment of a repositoryâs context. Leveraging ASTâaware chunking, a hybrid BM25/Vector hybrid retrieval, and a ReActâbased JustâInâTime agent, it delivers precise, multilingual codeâlevel insights without static indexing. Designed for production, the system ships in Docker, supports DeepSeek and SiliconFlow APIs, and offers a polished Live Demo with intelligent language handling.
# RepoReaper: Autonomous Architectural Analysis and Bilingual Semantic Search via Dynamic RAG Cache
## Introduction
Modern software maintenance increasingly demands tooling that can understand large codebases in real time, without the latency of full indexing or the brittleness of brittle search pipelines. **RepoReaper** addresses this challenge by treating the Large Language Model (LLM) itself as the CPU while the vector store functions as an adaptive L2 cache. The framework parses the repositoryâs Abstract Syntax Tree (AST) to build a lightweight symbol map, then dynamically preâfetches the most architecturally relevant files, and finally employs a ReAct loop to fetch missing context on demand.
## Core Philosophy: RAG as a Dynamic Cache
Unlike conventional RetrievalâAugmented Generation (RAG) systems, which perform static lookâups, RepoReaperâs RAG layer acts as a realâtime, justâinâtime cache:
* **Cold Start â Repo Map**: A oneâtime AST traversal generates a global map of classes, functions, and modules, enabling instant navigation of the code tree.
* **Prefetching â Analysis Phase**: The agent autonomously selects 10â20 files that are most impactful for architectural comprehension, parses them, and preâloads their embeddings into the cache.
* **CacheâMiss Handling â ReAct Loop**: During user queries, if the BM25+Vector retrieval returns insufficient context, the Agent triggers a `` command to pull the missing files via the GitHub API, updates the cache, and reâgenerates the answer seamlessly.
## Architectural Innovations
1. **ASTâAware Semantic Chunking**
* **Logical Boundaries** â Code is split by class and method definitions rather than by raw token windows, preserving logical cohesion.
* **Context Injection** â Parent class signatures and docstrings are embedded in each method chunk, giving the LLM insight into both purpose (âwhyâ) and implementation (âhowâ).
2. **Asynchronous Concurrency Pipeline**
* Built atop *asyncio* and *httpx*, the system performs repository parsing, AST extraction, and vector embedding in a nonâblocking fashion.
* Deployment uses *Gunicorn* with *Uvicorn* workers; the *VectorStoreManager* synchronizes context via persistent ChromaDB instances, ensuring stateless workers without race conditions.
3. **JustâInâTime ReAct Agent**
* **Query Rewrite** â The LLM translates ambiguous or bilingual queries into canonical, Englishâonly technical terms for optimal BM25/Vector search.
* **SelfâCorrection** â When context is insufficient, the Agent emits a tool invocation, fetches the exact file snippets, reâindexes them, and reâinvokes the model in the same inference cycle.
4. **Hybrid Search Mechanism**
* Dense retrieval applies BAAI/bgeâm3 embeddings to capture conceptual similarity.
* Sparse retrieval (BM25Okapi/RankâBM25) preserves exact name matching for function signatures and error codes.
* Results are fused via Reciprocal Rank Fusion (RRF) to rank the most relevant snippets for the LLM.
5. **Native Bilingual Support**
* The prompt engineering module detects the language of user input and swaps the System Prompt accordingly, ensuring that the tone, terminology, and output format honor the userâs locale.
* A language toggle in the UI propagates through the entire pipeline, including the initial architectural report and subsequent Q&A.
## Technical Stack
| Layer | Technology |
|---|---|
| Core | Python 3.10+, FastAPI, AsyncIO |
| LLM | OpenAI SDK (DeepSeek/SiliconFlow) |
| Vector DB | ChromaDB (persistent disk) |
| Search | BM25Okapi, RankâBM25, RRF |
| Parsing | Python ast |
| Frontend | HTML5, ServerâSent Events, Mermaid.js |
| Deployment | Docker, Gunicorn, Uvicorn |
## Performance & Reliability
* **Session Management** â Combines browser sessionStorage with serverâside persistence, allowing warm cache state to survive page refreshes.
* **Network Resilience** â Graceful handling of GitHub API throttling (403/429) and timeouts ensures consistent user experience.
* **Memory Efficiency** â The VectorStoreManager maintains state on disk only, preventing memory leaks in longârunning containers.
## Quick Start Guide
> **Prerequisites** â Python 3.9+, a valid GitHub Personal Access Token, LLM API keys (DeepSeekâV3 or SiliconFlow recommended).
1. **Clone** ```bash git clone https://github.com/tzzp1224/RepoReaper.git cd RepoReaper ``` 2. **Create Virtual Environment** (recommended) ```bash python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate ``` 3. **Install Dependencies** ```bash pip install -r requirements.txt ``` 4. **Configure Environment** â Create `.env` in the root: ```dotenv GITHUB_TOKEN=ghp_your_token_here DEEPSEEK_API_KEY=sk_your_key_here SILICON_API_KEY=sk_your_key_here ``` 5. **Run Locally** â Universal option: ```bash python -m app.main ``` For production, you may use Gunicorn: ```bash gunicorn -c gunicorn_conf.py app.main:app ``` 6. **Docker Deployment** ```bash docker build -t reporeaper . docker run -d -p 8000:8000 --env-file .env --name reporeaper reporeaper ``` 7. **Access** â Open and input a GitHub repository URL to trigger autonomous analysis.
## Live Demo & Availability
The project hosts a public demo; however, shared API quotas may cause rateâlimit errors (403/429). For a smooth experience, especially for users in China, deploy the Seoul server locally or clone the repository and run it locally.
---
RepoReaper exemplifies how an LLMâdriven analysis agent can move beyond static indexing to provide a realâtime, bilingual, architectureâaware code exploration experience, making it a potent tool for senior technical leads and platform teams seeking deeper repository introspection without the overhead of traditional tooling.
1. **Clone** ```bash git clone https://github.com/tzzp1224/RepoReaper.git cd RepoReaper ``` 2. **Create Virtual Environment** (recommended) ```bash python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate ``` 3. **Install Dependencies** ```bash pip install -r requirements.txt ``` 4. **Configure Environment** â Create `.env` in the root: ```dotenv GITHUB_TOKEN=ghp_your_token_here DEEPSEEK_API_KEY=sk_your_key_here SILICON_API_KEY=sk_your_key_here ``` 5. **Run Locally** â Universal option: ```bash python -m app.main ``` For production, you may use Gunicorn: ```bash gunicorn -c gunicorn_conf.py app.main:app ``` 6. **Docker Deployment** ```bash docker build -t reporeaper . docker run -d -p 8000:8000 --env-file .env --name reporeaper reporeaper ``` 7. **Access** â Open