mirror of
https://github.com/Dvorinka/Devour.git
synced 2026-06-03 20:13:03 +00:00
560 lines
14 KiB
Markdown
560 lines
14 KiB
Markdown
---
|
|
name: devour
|
|
description: >
|
|
Context ingestion and management system for AI. Scrapes, indexes, and serves
|
|
documentation from GitHub repos, OpenAPI specs, web docs, and local files.
|
|
Provides semantic search via vector embeddings to feed relevant context to
|
|
AI models. Runs in local mode (stdio) or remote mode (HTTP MCP server).
|
|
Supports automatic updates via configurable scheduler. Integrates with
|
|
OpenAI for embeddings and LLM context injection. Triggers on: "devour",
|
|
"scrape docs", "index documentation", "context for AI", "vector search docs",
|
|
"semantic search", "ingest documentation", "documentation to AI".
|
|
allowed-tools:
|
|
- Read
|
|
- Write
|
|
- Edit
|
|
- Glob
|
|
- Grep
|
|
- Bash
|
|
- WebFetch
|
|
---
|
|
|
|
# Devour — Context Ingestion Skill
|
|
|
|
Comprehensive documentation scraping, indexing, and retrieval system for
|
|
feeding structured context to AI models. Orchestrates 5 specialized modules
|
|
and supports both local (stdio) and remote (HTTP) MCP modes.
|
|
|
|
## Quick Reference
|
|
|
|
| Command | What it does |
|
|
|---------|-------------|
|
|
| `/devour init [path]` | Initialize Devour for a project |
|
|
| `/devour get <language> <keyword>` | **NEW** Quick docs fetch for popular languages |
|
|
| `/devour scrape <source>` | Scrape docs from URL, GitHub, or local path |
|
|
| `/devour serve` | Start MCP server (local or remote) |
|
|
| `/devour query <text>` | Search indexed documentation |
|
|
| `/devour status` | Show index stats and health |
|
|
| `/devour sync` | Fetch updates from all configured sources |
|
|
| `/devour push <path>` | Push docs to remote MCP server |
|
|
| `/devour sources` | Manage documentation sources |
|
|
|
|
## Orchestration Logic
|
|
|
|
When the user invokes `/devour get <language> <keyword>`:
|
|
|
|
1. **Map language to base URL**:
|
|
- `go http` → `https://pkg.go.dev/http`
|
|
- `python asyncio` → `https://docs.python.org/3/library/asyncio.html`
|
|
- `react hooks` → `https://react.dev/reference/react/hooks`
|
|
- `docker compose` → `https://docs.docker.com/compose`
|
|
|
|
2. **Auto-detect source type** based on language:
|
|
- Go → `godocs` parser
|
|
- Python → `pythondocs` parser
|
|
- React → `reactdocs` parser
|
|
- Docker → `dockerdocs` parser
|
|
|
|
3. **Execute enhanced scrape** with pre-configured parameters:
|
|
- Automatic language-specific parsing
|
|
- Enhanced markdown formatting (if requested)
|
|
- Metadata extraction and enrichment
|
|
|
|
4. **Return structured documentation**:
|
|
- Rich markdown with TOC (if `--format markdown`)
|
|
- JSON with full metadata (default)
|
|
- Ready for AI context injection
|
|
|
|
When the user invokes `/devour scrape`:
|
|
|
|
1. **Detect source type** from URL/path:
|
|
- GitHub: `github.com/org/repo` → Clone, extract docs
|
|
- OpenAPI: Ends in `.json`/`.yaml` with OpenAPI spec → Parse endpoints
|
|
- Web: HTTP/HTTPS URL → Crawl with Colly
|
|
- Local: File path → Scan directory
|
|
|
|
2. **Scrape with appropriate parser**:
|
|
- Extract content (markdown, HTML, code structure)
|
|
- Clean and normalize text
|
|
- Extract metadata (title, headings, code blocks)
|
|
|
|
3. **Generate embeddings**:
|
|
- Chunk content appropriately (512-1024 tokens)
|
|
- Call OpenAI embedding API
|
|
- Store in vector database
|
|
|
|
4. **Update metadata**:
|
|
- Track source, timestamp, content hash
|
|
- Enable future update detection
|
|
|
|
When the user invokes `/devour query`:
|
|
|
|
1. Generate embedding for query text
|
|
2. Perform vector similarity search
|
|
3. Return top-K results with metadata
|
|
4. Optionally inject into AI context
|
|
|
|
## Enhanced Features
|
|
|
|
### 🎯 Language-Aware Documentation Access
|
|
|
|
The `devour get` command provides intelligent, language-specific documentation retrieval:
|
|
|
|
**Supported Languages & Mappings:**
|
|
- `go`, `golang` → Go packages (pkg.go.dev)
|
|
- `rust` → Rust crates (docs.rs)
|
|
- `python`, `py` → Python modules (docs.python.org)
|
|
- `java` → Java packages (docs.oracle.com)
|
|
- `spring` → Spring Boot (docs.spring.io)
|
|
- `typescript`, `ts` → TypeScript (typescriptlang.org)
|
|
- `react` → React (react.dev)
|
|
- `vue` → Vue.js (vuejs.org)
|
|
- `nuxt` → Nuxt (nuxt.com)
|
|
- `docker` → Docker (docs.docker.com)
|
|
- `cloudflare`, `cf` → Cloudflare (developers.cloudflare.com)
|
|
- `astro` → Astro (docs.astro.build)
|
|
|
|
**Usage Examples:**
|
|
```bash
|
|
/devour get go http # Go HTTP package docs
|
|
/devour get python asyncio # Python asyncio module
|
|
/devour get react hooks # React Hooks reference
|
|
/devour get docker compose # Docker Compose guide
|
|
/devour get rust tokio # Rust Tokio crate docs
|
|
```
|
|
|
|
### 📝 Rich Markdown Enhancement
|
|
|
|
When using `--format markdown`, Devour automatically enhances documentation:
|
|
|
|
**Auto-Generated Structure:**
|
|
- 📋 Document metadata tables (source, type, timestamp)
|
|
- 📑 Table of contents from headings
|
|
- 🎨 Visual indicators for important content
|
|
- 🔗 Automatic URL-to-link conversion
|
|
- 📚 Proper heading hierarchy
|
|
|
|
**Content Enhancement:**
|
|
- `Example:` → 💡 **Example:**
|
|
- `Note:` → 📝 **Note:**
|
|
- `Warning:` → ⚠️ **Warning:**
|
|
- `Important:` → ❗ **Important:**
|
|
- `TODO:` → 📋 **TODO:**
|
|
|
|
**Example Output Structure:**
|
|
```markdown
|
|
# Package Name
|
|
|
|
## 📋 Document Information
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **Source** | https://pkg.go.dev/http |
|
|
| **Type** | `godocs` |
|
|
| **Scraped** | 2026-02-19 12:30:00 |
|
|
|
|
## 📑 Table of Contents
|
|
- [Functions](#functions)
|
|
- [Types](#types)
|
|
- [Examples](#examples)
|
|
|
|
## 📚 Content
|
|
# Functions
|
|
|
|
💡 **Example:** Usage example here...
|
|
```
|
|
|
|
## Source Type Detection
|
|
|
|
| Pattern | Type | Parser |
|
|
|---------|------|--------|
|
|
| `github.com/*/*` | GitHub | Git clone + markdown parser |
|
|
| `*.json` + OpenAPI keys | OpenAPI | Swagger parser |
|
|
| `http://*`, `https://*` | Web | Colly crawler |
|
|
| `./path`, `/path` | Local | Directory scanner |
|
|
| `*.md`, `*.rst`, `*.txt` | File | Direct parse |
|
|
|
|
## Module Reference
|
|
|
|
### 1. Scraper Module (`internal/scraper`)
|
|
|
|
Responsible for fetching and parsing content from various sources.
|
|
|
|
**Supported sources:**
|
|
- GitHub repositories (clone, extract docs/, README.md)
|
|
- OpenAPI/Swagger specs (parse endpoints, schemas)
|
|
- HTML documentation sites (crawl, extract content)
|
|
- Markdown files (parse structure, code blocks)
|
|
- JSON/YAML configuration files
|
|
|
|
**Output format:**
|
|
```json
|
|
{
|
|
"id": "doc-uuid",
|
|
"source": "https://...",
|
|
"type": "markdown",
|
|
"title": "Document Title",
|
|
"content": "Extracted text...",
|
|
"metadata": {
|
|
"headings": ["H1", "H2"],
|
|
"code_blocks": ["go", "bash"],
|
|
"links": ["url1", "url2"]
|
|
},
|
|
"timestamp": "2025-01-15T10:00:00Z"
|
|
}
|
|
```
|
|
|
|
### 2. Indexer Module (`internal/indexer`)
|
|
|
|
Converts documents into vector embeddings for semantic search.
|
|
|
|
**Features:**
|
|
- OpenAI embedding integration (text-embedding-3-small/large)
|
|
- Intelligent chunking (512-1024 tokens, respect boundaries)
|
|
- Metadata preservation
|
|
- Batch processing for efficiency
|
|
|
|
**Chunking strategy:**
|
|
```go
|
|
type Chunk struct {
|
|
ID string
|
|
DocID string
|
|
Content string
|
|
Vector []float32
|
|
Metadata map[string]any
|
|
Position int // Position in original doc
|
|
}
|
|
```
|
|
|
|
### 3. Server Module (`internal/server`)
|
|
|
|
Exposes context via MCP protocol.
|
|
|
|
**Local mode (stdio):**
|
|
```
|
|
STDIN → JSON-RPC → Handler → Response → STDOUT
|
|
```
|
|
|
|
**Remote mode (HTTP):**
|
|
```
|
|
HTTP Request → Handler → Response → HTTP Response
|
|
```
|
|
|
|
**MCP Tools exposed:**
|
|
- `devour_query` - Semantic search
|
|
- `devour_add` - Add documents
|
|
- `devour_status` - Get stats
|
|
- `devour_sync` - Trigger update
|
|
|
|
**MCP Resources:**
|
|
- `devour://documents` - All indexed docs
|
|
- `devour://sources` - Configured sources
|
|
- `devour://stats` - Index statistics
|
|
|
|
### 4. Scheduler Module (`internal/scheduler`)
|
|
|
|
Manages automatic updates from configured sources.
|
|
|
|
**Default schedule:** Every 72 hours (3 days)
|
|
|
|
**Change detection methods:**
|
|
- Content hash comparison (default)
|
|
- Last-Modified timestamp
|
|
- ETag header
|
|
- Git commit hash (for repos)
|
|
|
|
**Configuration:**
|
|
```yaml
|
|
scheduler:
|
|
enabled: true
|
|
interval: 72h
|
|
check_method: hash
|
|
retry_count: 3
|
|
retry_delay: 1h
|
|
```
|
|
|
|
### 5. AI Module (`internal/ai`)
|
|
|
|
Handles AI integrations for embeddings and context injection.
|
|
|
|
**Supported providers:**
|
|
- OpenAI (primary)
|
|
- Ollama (local, planned)
|
|
- Custom endpoints
|
|
|
|
**Context injection format:**
|
|
```go
|
|
type Context struct {
|
|
Query string
|
|
Results []SearchResult
|
|
SystemPrompt string
|
|
}
|
|
|
|
func (c *Context) ToPrompt() string {
|
|
// Format for LLM consumption
|
|
}
|
|
```
|
|
|
|
## Configuration Schema
|
|
|
|
### devour.yaml
|
|
|
|
```yaml
|
|
# Core configuration
|
|
version: 1
|
|
|
|
# Storage paths
|
|
storage:
|
|
docs_dir: ./devour_data/docs
|
|
index_dir: ./devour_data/index
|
|
metadata_dir: ./devour_data/metadata
|
|
|
|
# Embedding configuration
|
|
embeddings:
|
|
provider: openai
|
|
model: text-embedding-3-small
|
|
dimensions: 1536
|
|
api_key: ${OPENAI_API_KEY}
|
|
batch_size: 100
|
|
|
|
# Vector database
|
|
vector_db:
|
|
type: chromem # chromem, weaviate, faiss
|
|
persist: true
|
|
similarity_metric: cosine
|
|
|
|
# Scraping configuration
|
|
scraper:
|
|
user_agent: "Devour/1.0 (+https://github.com/yourorg/devour)"
|
|
timeout: 30s
|
|
retry_count: 3
|
|
retry_delay: 5s
|
|
concurrency: 10
|
|
rate_limit: 500ms
|
|
max_depth: 3
|
|
cache_dir: ./devour_data/cache
|
|
|
|
# Scheduler configuration
|
|
scheduler:
|
|
enabled: true
|
|
interval: 72h
|
|
check_method: hash
|
|
on_startup: false
|
|
|
|
# Server configuration
|
|
server:
|
|
mode: local # local, remote
|
|
transport: stdio # stdio, http
|
|
host: localhost
|
|
port: 8080
|
|
cors:
|
|
enabled: false
|
|
origins: []
|
|
|
|
# Source definitions
|
|
sources:
|
|
- name: example-docs
|
|
type: url
|
|
url: https://docs.example.com
|
|
include:
|
|
- "**/*.md"
|
|
- "**/*.html"
|
|
exclude:
|
|
- "**/api/**"
|
|
- "**/legacy/**"
|
|
schedule: 24h # Override global schedule
|
|
|
|
- name: api-spec
|
|
type: openapi
|
|
url: https://api.example.com/openapi.json
|
|
schedule: 168h # Weekly
|
|
|
|
- name: internal-repo
|
|
type: github
|
|
repo: myorg/myrepo
|
|
branch: main
|
|
paths:
|
|
- docs/
|
|
- README.md
|
|
auth_token: ${GITHUB_TOKEN}
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `OPENAI_API_KEY` | OpenAI API key | Required |
|
|
| `DEVOUR_CONFIG` | Config file path | `./devour.yaml` |
|
|
| `DEVOUR_DATA_DIR` | Data directory | `./devour_data` |
|
|
| `GITHUB_TOKEN` | GitHub auth token | Optional |
|
|
| `DEVOUR_LOG_LEVEL` | Log level (debug, info, warn, error) | `info` |
|
|
| `DEVOUR_PORT` | Server port | `8080` |
|
|
|
|
## Quality Gates
|
|
|
|
Built-in validation rules:
|
|
|
|
- ⚠️ **WARNING** if document count < 10 (may be incomplete scrape)
|
|
- ⚠️ **WARNING** if average chunk size < 100 tokens (over-fragmented)
|
|
- 🛑 **HARD STOP** if embedding API fails (cannot index without vectors)
|
|
- 🛑 **HARD STOP** if storage is not writable (cannot persist)
|
|
|
|
## Output Formats
|
|
|
|
### Query Results (JSON)
|
|
```json
|
|
{
|
|
"query": "authentication",
|
|
"results": [
|
|
{
|
|
"id": "chunk-uuid",
|
|
"document_id": "doc-uuid",
|
|
"content": "Relevant text excerpt...",
|
|
"score": 0.89,
|
|
"source": "https://docs.example.com/auth",
|
|
"metadata": {
|
|
"title": "Authentication Guide",
|
|
"section": "Getting Started"
|
|
}
|
|
}
|
|
],
|
|
"total": 15,
|
|
"took_ms": 45
|
|
}
|
|
```
|
|
|
|
### Status Output
|
|
```
|
|
Devour Status
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
Index Health: ✅ Healthy
|
|
Documents: 1,247 indexed
|
|
Chunks: 8,392 total
|
|
Vector Dimension: 1536
|
|
Last Updated: 2025-01-15 10:30:00
|
|
Storage Used: 124 MB
|
|
|
|
Sources (3):
|
|
✅ example-docs (234 docs, synced 2h ago)
|
|
✅ api-spec (12 docs, synced 1d ago)
|
|
⚠️ internal-repo (pending first sync)
|
|
|
|
Next Scheduled Sync: 2025-01-18 10:30:00
|
|
```
|
|
|
|
## Integration Patterns
|
|
|
|
### With OpenCode
|
|
|
|
```yaml
|
|
# In OpenCode session
|
|
> /devour init
|
|
> /devour scrape https://docs.myframework.com
|
|
> /devour serve
|
|
|
|
# In another terminal or session
|
|
> /devour query "how to handle authentication"
|
|
# Returns relevant context for AI
|
|
```
|
|
|
|
### With AI Assistant
|
|
|
|
```go
|
|
// AI assistant queries Devour automatically
|
|
func getRelevantContext(query string) string {
|
|
resp, _ := http.Post("http://localhost:8080/query",
|
|
"application/json",
|
|
bytes.NewReader([]byte(`{"query":"`+query+`"}`)))
|
|
|
|
var result QueryResponse
|
|
json.NewDecoder(resp.Body).Decode(&result)
|
|
|
|
// Inject into prompt
|
|
return formatContextForAI(result.Results)
|
|
}
|
|
```
|
|
|
|
### As MCP Tool
|
|
|
|
```json
|
|
// AI calls via MCP
|
|
{
|
|
"method": "tools/call",
|
|
"params": {
|
|
"name": "devour_query",
|
|
"arguments": {
|
|
"query": "API rate limiting",
|
|
"limit": 5
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Sub-Skills
|
|
|
|
This skill can delegate to specialized modules:
|
|
|
|
1. **devour-scrape** — Scraping operations
|
|
2. **devour-index** — Indexing and embeddings
|
|
3. **devour-query** — Search and retrieval
|
|
4. **devour-sync** — Synchronization tasks
|
|
5. **devour-serve** — Server management
|
|
|
|
## Error Handling
|
|
|
|
| Error | Cause | Resolution |
|
|
|-------|-------|------------|
|
|
| `E001` | OpenAI API error | Check API key, rate limits |
|
|
| `E002` | Source unreachable | Verify URL, check network |
|
|
| `E003` | Storage write failure | Check permissions, disk space |
|
|
| `E004` | Invalid source type | Use supported: url, github, openapi, local |
|
|
| `E005` | Index corruption | Rebuild index with `devour sync --rebuild` |
|
|
|
|
## Performance Tuning
|
|
|
|
### Scraping
|
|
```yaml
|
|
scraper:
|
|
concurrency: 20 # Parallel workers
|
|
rate_limit: 200ms # Between requests
|
|
timeout: 60s # Per request
|
|
```
|
|
|
|
### Indexing
|
|
```yaml
|
|
embeddings:
|
|
batch_size: 200 # API batch size
|
|
vector_db:
|
|
index_type: hnsw # Fast similarity search
|
|
m: 16 # HNSW connectivity
|
|
```
|
|
|
|
### Querying
|
|
```yaml
|
|
query:
|
|
ef_search: 64 # HNSW search depth
|
|
limit: 10 # Default result count
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**Slow queries:**
|
|
- Increase `ef_search` for better recall
|
|
- Use smaller `limit` values
|
|
- Consider index type (HNSW vs Flat)
|
|
|
|
**API rate limits:**
|
|
- Reduce `batch_size`
|
|
- Add delays between batches
|
|
- Use caching
|
|
|
|
**Memory usage:**
|
|
- Reduce `concurrency`
|
|
- Process in smaller batches
|
|
- Use disk-backed storage
|
|
|
|
---
|
|
|
|
*Devour: Feed your AI the context it craves.*
|