TDvorak/Devour

Fork 0

mirror of https://github.com/Dvorinka/Devour.git synced 2026-06-03 20:13:03 +00:00

Files

T

Tomas Dvorak 55885a0e8f first commit

2026-02-22 10:42:17 +01:00

14 KiB

Raw Permalink Blame History

name, description, allowed-tools

name

description

allowed-tools

devour

Context ingestion and management system for AI. Scrapes, indexes, and serves documentation from GitHub repos, OpenAPI specs, web docs, and local files. Provides semantic search via vector embeddings to feed relevant context to AI models. Runs in local mode (stdio) or remote mode (HTTP MCP server). Supports automatic updates via configurable scheduler. Integrates with OpenAI for embeddings and LLM context injection. Triggers on: "devour", "scrape docs", "index documentation", "context for AI", "vector search docs", "semantic search", "ingest documentation", "documentation to AI".

Read

Write

Edit

Glob

Grep

Bash

WebFetch

Devour — Context Ingestion Skill

Comprehensive documentation scraping, indexing, and retrieval system for feeding structured context to AI models. Orchestrates 5 specialized modules and supports both local (stdio) and remote (HTTP) MCP modes.

Quick Reference

Command	What it does
`/devour init [path]`	Initialize Devour for a project
`/devour get <language> <keyword>`	NEW Quick docs fetch for popular languages
`/devour scrape <source>`	Scrape docs from URL, GitHub, or local path
`/devour serve`	Start MCP server (local or remote)
`/devour query <text>`	Search indexed documentation
`/devour status`	Show index stats and health
`/devour sync`	Fetch updates from all configured sources
`/devour push <path>`	Push docs to remote MCP server
`/devour sources`	Manage documentation sources

Orchestration Logic

When the user invokes /devour get <language> <keyword>:

Map language to base URL:
- go http → https://pkg.go.dev/http
- python asyncio → https://docs.python.org/3/library/asyncio.html
- react hooks → https://react.dev/reference/react/hooks
- docker compose → https://docs.docker.com/compose
Auto-detect source type based on language:
- Go → godocs parser
- Python → pythondocs parser
- React → reactdocs parser
- Docker → dockerdocs parser
Execute enhanced scrape with pre-configured parameters:
- Automatic language-specific parsing
- Enhanced markdown formatting (if requested)
- Metadata extraction and enrichment
Return structured documentation:
- Rich markdown with TOC (if --format markdown)
- JSON with full metadata (default)
- Ready for AI context injection

When the user invokes /devour scrape:

Detect source type from URL/path:
- GitHub: github.com/org/repo → Clone, extract docs
- OpenAPI: Ends in .json/.yaml with OpenAPI spec → Parse endpoints
- Web: HTTP/HTTPS URL → Crawl with Colly
- Local: File path → Scan directory
Scrape with appropriate parser:
- Extract content (markdown, HTML, code structure)
- Clean and normalize text
- Extract metadata (title, headings, code blocks)
Generate embeddings:
- Chunk content appropriately (512-1024 tokens)
- Call OpenAI embedding API
- Store in vector database
Update metadata:
- Track source, timestamp, content hash
- Enable future update detection

When the user invokes /devour query:

Generate embedding for query text
Perform vector similarity search
Return top-K results with metadata
Optionally inject into AI context

Enhanced Features

🎯 Language-Aware Documentation Access

The devour get command provides intelligent, language-specific documentation retrieval:

Supported Languages & Mappings:

go, golang → Go packages (pkg.go.dev)
rust → Rust crates (docs.rs)
python, py → Python modules (docs.python.org)
java → Java packages (docs.oracle.com)
spring → Spring Boot (docs.spring.io)
typescript, ts → TypeScript (typescriptlang.org)
react → React (react.dev)
vue → Vue.js (vuejs.org)
nuxt → Nuxt (nuxt.com)
docker → Docker (docs.docker.com)
cloudflare, cf → Cloudflare (developers.cloudflare.com)
astro → Astro (docs.astro.build)

Usage Examples:

/devour get go http              # Go HTTP package docs
/devour get python asyncio      # Python asyncio module
/devour get react hooks         # React Hooks reference
/devour get docker compose      # Docker Compose guide
/devour get rust tokio          # Rust Tokio crate docs

📝 Rich Markdown Enhancement

When using --format markdown, Devour automatically enhances documentation:

Auto-Generated Structure:

📋 Document metadata tables (source, type, timestamp)
📑 Table of contents from headings
🎨 Visual indicators for important content
🔗 Automatic URL-to-link conversion
📚 Proper heading hierarchy

Content Enhancement:

Example: → 💡 Example:
Note: → 📝 Note:
Warning: → ⚠️ Warning:
Important: → ❗ Important:
TODO: → 📋 TODO:

Example Output Structure:

# Package Name

## 📋 Document Information
| Property | Value |
|----------|-------|
| **Source** | https://pkg.go.dev/http |
| **Type** | `godocs` |
| **Scraped** | 2026-02-19 12:30:00 |

## 📑 Table of Contents
- [Functions](#functions)
- [Types](#types)
- [Examples](#examples)

## 📚 Content
# Functions

💡 **Example:** Usage example here...

Source Type Detection

Pattern	Type	Parser
`github.com//`	GitHub	Git clone + markdown parser
`*.json` + OpenAPI keys	OpenAPI	Swagger parser
`http://`, `https://`	Web	Colly crawler
`./path`, `/path`	Local	Directory scanner
`.md`, `.rst`, `*.txt`	File	Direct parse

Module Reference

1. Scraper Module (`internal/scraper`)

Responsible for fetching and parsing content from various sources.

Supported sources:

GitHub repositories (clone, extract docs/, README.md)
OpenAPI/Swagger specs (parse endpoints, schemas)
HTML documentation sites (crawl, extract content)
Markdown files (parse structure, code blocks)
JSON/YAML configuration files

Output format:

{
  "id": "doc-uuid",
  "source": "https://...",
  "type": "markdown",
  "title": "Document Title",
  "content": "Extracted text...",
  "metadata": {
    "headings": ["H1", "H2"],
    "code_blocks": ["go", "bash"],
    "links": ["url1", "url2"]
  },
  "timestamp": "2025-01-15T10:00:00Z"
}

2. Indexer Module (`internal/indexer`)

Converts documents into vector embeddings for semantic search.

Features:

OpenAI embedding integration (text-embedding-3-small/large)
Intelligent chunking (512-1024 tokens, respect boundaries)
Metadata preservation
Batch processing for efficiency

Chunking strategy:

type Chunk struct {
    ID       string
    DocID    string
    Content  string
    Vector   []float32
    Metadata map[string]any
    Position int // Position in original doc
}

3. Server Module (`internal/server`)

Exposes context via MCP protocol.

Local mode (stdio):

STDIN → JSON-RPC → Handler → Response → STDOUT

Remote mode (HTTP):

HTTP Request → Handler → Response → HTTP Response

MCP Tools exposed:

devour_query - Semantic search
devour_add - Add documents
devour_status - Get stats
devour_sync - Trigger update

MCP Resources:

devour://documents - All indexed docs
devour://sources - Configured sources
devour://stats - Index statistics

4. Scheduler Module (`internal/scheduler`)

Manages automatic updates from configured sources.

Default schedule: Every 72 hours (3 days)

Change detection methods:

Content hash comparison (default)
Last-Modified timestamp
ETag header
Git commit hash (for repos)

Configuration:

scheduler:
  enabled: true
  interval: 72h
  check_method: hash
  retry_count: 3
  retry_delay: 1h

5. AI Module (`internal/ai`)

Handles AI integrations for embeddings and context injection.

Supported providers:

OpenAI (primary)
Ollama (local, planned)
Custom endpoints

Context injection format:

type Context struct {
    Query   string
    Results []SearchResult
    SystemPrompt string
}

func (c *Context) ToPrompt() string {
    // Format for LLM consumption
}

Configuration Schema

devour.yaml

# Core configuration
version: 1

# Storage paths
storage:
  docs_dir: ./devour_data/docs
  index_dir: ./devour_data/index
  metadata_dir: ./devour_data/metadata

# Embedding configuration
embeddings:
  provider: openai
  model: text-embedding-3-small
  dimensions: 1536
  api_key: ${OPENAI_API_KEY}
  batch_size: 100

# Vector database
vector_db:
  type: chromem          # chromem, weaviate, faiss
  persist: true
  similarity_metric: cosine

# Scraping configuration
scraper:
  user_agent: "Devour/1.0 (+https://github.com/yourorg/devour)"
  timeout: 30s
  retry_count: 3
  retry_delay: 5s
  concurrency: 10
  rate_limit: 500ms
  max_depth: 3
  cache_dir: ./devour_data/cache

# Scheduler configuration
scheduler:
  enabled: true
  interval: 72h
  check_method: hash
  on_startup: false

# Server configuration
server:
  mode: local            # local, remote
  transport: stdio       # stdio, http
  host: localhost
  port: 8080
  cors:
    enabled: false
    origins: []

# Source definitions
sources:
  - name: example-docs
    type: url
    url: https://docs.example.com
    include:
      - "**/*.md"
      - "**/*.html"
    exclude:
      - "**/api/**"
      - "**/legacy/**"
    schedule: 24h        # Override global schedule

  - name: api-spec
    type: openapi
    url: https://api.example.com/openapi.json
    schedule: 168h       # Weekly

  - name: internal-repo
    type: github
    repo: myorg/myrepo
    branch: main
    paths:
      - docs/
      - README.md
    auth_token: ${GITHUB_TOKEN}

Environment Variables

Variable	Description	Default
`OPENAI_API_KEY`	OpenAI API key	Required
`DEVOUR_CONFIG`	Config file path	`./devour.yaml`
`DEVOUR_DATA_DIR`	Data directory	`./devour_data`
`GITHUB_TOKEN`	GitHub auth token	Optional
`DEVOUR_LOG_LEVEL`	Log level (debug, info, warn, error)	`info`
`DEVOUR_PORT`	Server port	`8080`

Quality Gates

Built-in validation rules:

⚠️ WARNING if document count < 10 (may be incomplete scrape)
⚠️ WARNING if average chunk size < 100 tokens (over-fragmented)
🛑 HARD STOP if embedding API fails (cannot index without vectors)
🛑 HARD STOP if storage is not writable (cannot persist)

Output Formats

Query Results (JSON)

{
  "query": "authentication",
  "results": [
    {
      "id": "chunk-uuid",
      "document_id": "doc-uuid",
      "content": "Relevant text excerpt...",
      "score": 0.89,
      "source": "https://docs.example.com/auth",
      "metadata": {
        "title": "Authentication Guide",
        "section": "Getting Started"
      }
    }
  ],
  "total": 15,
  "took_ms": 45
}

Status Output

Devour Status
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Index Health:      ✅ Healthy
Documents:         1,247 indexed
Chunks:            8,392 total
Vector Dimension:  1536
Last Updated:      2025-01-15 10:30:00
Storage Used:      124 MB

Sources (3):
  ✅ example-docs     (234 docs, synced 2h ago)
  ✅ api-spec         (12 docs, synced 1d ago)
  ⚠️  internal-repo   (pending first sync)

Next Scheduled Sync: 2025-01-18 10:30:00

Integration Patterns

With OpenCode

# In OpenCode session
> /devour init
> /devour scrape https://docs.myframework.com
> /devour serve

# In another terminal or session
> /devour query "how to handle authentication"
# Returns relevant context for AI

With AI Assistant

// AI assistant queries Devour automatically
func getRelevantContext(query string) string {
    resp, _ := http.Post("http://localhost:8080/query", 
        "application/json",
        bytes.NewReader([]byte(`{"query":"`+query+`"}`)))
    
    var result QueryResponse
    json.NewDecoder(resp.Body).Decode(&result)
    
    // Inject into prompt
    return formatContextForAI(result.Results)
}

As MCP Tool

// AI calls via MCP
{
  "method": "tools/call",
  "params": {
    "name": "devour_query",
    "arguments": {
      "query": "API rate limiting",
      "limit": 5
    }
  }
}

Sub-Skills

This skill can delegate to specialized modules:

devour-scrape — Scraping operations
devour-index — Indexing and embeddings
devour-query — Search and retrieval
devour-sync — Synchronization tasks
devour-serve — Server management

Error Handling

Error	Cause	Resolution
`E001`	OpenAI API error	Check API key, rate limits
`E002`	Source unreachable	Verify URL, check network
`E003`	Storage write failure	Check permissions, disk space
`E004`	Invalid source type	Use supported: url, github, openapi, local
`E005`	Index corruption	Rebuild index with `devour sync --rebuild`

Performance Tuning

Scraping

scraper:
  concurrency: 20        # Parallel workers
  rate_limit: 200ms      # Between requests
  timeout: 60s           # Per request

Indexing

embeddings:
  batch_size: 200        # API batch size
vector_db:
  index_type: hnsw       # Fast similarity search
  m: 16                  # HNSW connectivity

Querying

query:
  ef_search: 64          # HNSW search depth
  limit: 10              # Default result count

Troubleshooting

Common Issues

Slow queries:

Increase ef_search for better recall
Use smaller limit values
Consider index type (HNSW vs Flat)

API rate limits:

Reduce batch_size
Add delays between batches
Use caching

Memory usage:

Reduce concurrency
Process in smaller batches
Use disk-backed storage

Devour: Feed your AI the context it craves.

14 KiB Raw Permalink Blame History