Back to Proxysql

Vector Features - Embedding-Based Similarity for ProxySQL

doc/VECTOR_FEATURES/README.md

4.0.811.7 KB
Original Source

Vector Features - Embedding-Based Similarity for ProxySQL

Overview

Vector Features provide semantic similarity capabilities for ProxySQL using vector embeddings and sqlite-vec for efficient similarity search. This enables:

  • NL2SQL Vector Cache: Cache natural language queries by semantic meaning, not just exact text
  • Anomaly Detection: Detect SQL threats using embedding similarity against known attack patterns

Features

FeatureDescriptionBenefit
Semantic CachingCache queries by meaning, not exact textHigher cache hit rates for similar queries
Threat DetectionDetect attacks using embedding similarityCatch variations of known attack patterns
Vector Storagesqlite-vec for efficient KNN searchFast similarity queries on embedded vectors
GenAI IntegrationUses existing GenAI module for embeddingsNo external embedding service required
Configurable ThresholdsAdjust similarity sensitivityBalance between false positives and negatives

Architecture

Query Input
    |
    v
+-----------------+
| GenAI Module    | -> Generate 1536-dim embedding
| (llama-server)  |
+-----------------+
    |
    v
+-----------------+
| Vector DB       | -> Store embedding in SQLite
| (sqlite-vec)    | -> Similarity search via KNN
+-----------------+
    |
    v
+-----------------+
| Result          | -> Similar items within threshold
+-----------------+

Quick Start

1. Enable AI Features

sql
-- Via admin interface
SET ai_features_enabled='true';
LOAD MYSQL VARIABLES TO RUNTIME;

2. Configure Vector Database

sql
-- Set vector DB path (default: /var/lib/proxysql/ai_features.db)
SET ai_vector_db_path='/var/lib/proxysql/ai_features.db';

-- Set vector dimension (default: 1536 for text-embedding-3-small)
SET ai_vector_dimension='1536';

3. Configure NL2SQL Vector Cache

sql
-- Enable NL2SQL
SET ai_nl2sql_enabled='true';

-- Set cache similarity threshold (0-100, default: 85)
SET ai_nl2sql_cache_similarity_threshold='85';

4. Configure Anomaly Detection

sql
-- Enable anomaly detection
SET ai_anomaly_detection_enabled='true';

-- Set similarity threshold (0-100, default: 85)
SET ai_anomaly_similarity_threshold='85';

-- Set risk threshold (0-100, default: 70)
SET ai_anomaly_risk_threshold='70';

NL2SQL Vector Cache

How It Works

  1. User submits NL2SQL query: NL2SQL: Show all customers
  2. Generate embedding: Query text → 1536-dimensional vector
  3. Search cache: Find semantically similar cached queries
  4. Return cached SQL if similarity > threshold
  5. Otherwise call LLM and store result in cache

Configuration Variables

VariableDefaultDescription
ai_nl2sql_enabledtrueEnable/disable NL2SQL
ai_nl2sql_cache_similarity_threshold85Semantic similarity threshold (0-100)
ai_nl2sql_timeout_ms30000LLM request timeout
ai_vector_db_path/var/lib/proxysql/ai_features.dbVector database file path
ai_vector_dimension1536Embedding dimension

Example: Semantic Cache Hit

sql
-- First query - calls LLM
NL2SQL: Show me all customers from USA;

-- Similar query - returns cached result (no LLM call!)
NL2SQL: Display customers in the United States;

-- Another similar query - cached
NL2SQL: List USA customers;

All three queries are semantically similar and will hit the cache after the first one.

Cache Statistics

sql
-- View cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';

Anomaly Detection

How It Works

  1. Query intercepted during session processing
  2. Generate embedding of normalized query
  3. KNN search against threat pattern embeddings
  4. Calculate risk score: (severity / 10) * (1 - distance / 2)
  5. Block or flag if risk > threshold

Configuration Variables

VariableDefaultDescription
ai_anomaly_detection_enabledtrueEnable/disable anomaly detection
ai_anomaly_similarity_threshold85Similarity threshold for threat matching (0-100)
ai_anomaly_risk_threshold70Risk score threshold for blocking (0-100)
ai_anomaly_rate_limit100Max anomalies per minute before rate limiting
ai_anomaly_auto_blocktrueAutomatically block high-risk queries
ai_anomaly_log_onlyfalseIf true, log but don't block

Threat Pattern Management

Add a Threat Pattern

Via C++ API:

cpp
anomaly_detector->add_threat_pattern(
    "OR 1=1 Tautology",
    "SELECT * FROM users WHERE username='admin' OR 1=1--'",
    "sql_injection",
    9  // severity 1-10
);

Via MCP (future):

json
{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "ai_add_threat_pattern",
    "arguments": {
      "pattern_name": "OR 1=1 Tautology",
      "query_example": "SELECT * FROM users WHERE username='admin' OR 1=1--'",
      "pattern_type": "sql_injection",
      "severity": 9
    }
  }
}

List Threat Patterns

cpp
std::string patterns = anomaly_detector->list_threat_patterns();
// Returns JSON array of all patterns

Remove a Threat Pattern

cpp
bool success = anomaly_detector->remove_threat_pattern(pattern_id);

Built-in Threat Patterns

See scripts/add_threat_patterns.sh for 10 example threat patterns:

PatternTypeSeverity
OR 1=1 Tautologysql_injection9
UNION SELECTsql_injection8
Comment Injectionsql_injection7
Sleep-based DoSdos6
Benchmark-based DoSdos6
INTO OUTFILEdata_exfiltration9
DROP TABLEprivilege_escalation10
Schema Probingreconnaissance3
CONCAT Injectionsql_injection8
Hex Encodingsql_injection7

Detection Example

sql
-- Known threat pattern in database:
-- "SELECT * FROM users WHERE id=1 OR 1=1--"

-- Attacker tries variation:
SELECT * FROM users WHERE id=5 OR 2=2--';

-- Embedding similarity detects this as similar to OR 1=1 pattern
-- Risk score: (9/10) * (1 - 0.15/2) = 0.86 (86% risk)
-- Since 86 > 70 (risk_threshold), query is BLOCKED

Anomaly Statistics

sql
-- View anomaly statistics
SHOW STATUS LIKE 'ai_anomaly_%';
-- ai_detected_anomalies
-- ai_blocked_queries
-- ai_flagged_queries

Via API:

cpp
std::string stats = anomaly_detector->get_statistics();
// Returns JSON with detailed statistics

Vector Database

Schema

The vector database (ai_features.db) contains:

Main Tables

nl2sql_cache

sql
CREATE TABLE nl2sql_cache (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    natural_language TEXT NOT NULL,
    generated_sql TEXT NOT NULL,
    schema_context TEXT,
    embedding BLOB,
    hit_count INTEGER DEFAULT 0,
    last_hit INTEGER,
    created_at INTEGER
);

anomaly_patterns

sql
CREATE TABLE anomaly_patterns (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    pattern_name TEXT,
    pattern_type TEXT,  -- 'sql_injection', 'dos', 'privilege_escalation'
    query_example TEXT,
    embedding BLOB,
    severity INTEGER,  -- 1-10
    created_at INTEGER
);

query_history

sql
CREATE TABLE query_history (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    query_text TEXT NOT NULL,
    generated_sql TEXT,
    embedding BLOB,
    execution_time_ms INTEGER,
    success BOOLEAN,
    timestamp INTEGER
);

Virtual Vector Tables (sqlite-vec)

sql
CREATE VIRTUAL TABLE nl2sql_cache_vec USING vec0(
    embedding float(1536)
);

CREATE VIRTUAL TABLE anomaly_patterns_vec USING vec0(
    embedding float(1536)
);

CREATE VIRTUAL TABLE query_history_vec USING vec0(
    embedding float(1536)
);

Similarity Search Algorithm

Cosine Distance is used for similarity measurement:

distance = 2 * (1 - cosine_similarity)

where:
cosine_similarity = (A . B) / (|A| * |B|)

Distance range: 0 (identical) to 2 (opposite)
Similarity = (2 - distance) / 2 * 100

Threshold Conversion:

similarity_threshold (0-100) → distance_threshold (0-2)
distance_threshold = 2.0 - (similarity_threshold / 50.0)

Example:
  similarity = 85 → distance = 2.0 - (85/50.0) = 0.3

KNN Search Example

sql
-- Find similar cached queries
SELECT c.natural_language, c.generated_sql,
       vec_distance_cosine(v.embedding, '[0.1, 0.2, ...]') as distance
FROM nl2sql_cache c
JOIN nl2sql_cache_vec v ON c.id = v.rowid
WHERE v.embedding MATCH '[0.1, 0.2, ...]'
AND distance < 0.3
ORDER BY distance
LIMIT 1;

GenAI Integration

Vector Features use the existing GenAI Module for embedding generation.

Embedding Endpoint

  • Module: lib/GenAI_Thread.cpp
  • Global Handler: GenAI_Threads_Handler *GloGATH
  • Method: embed_documents({text})
  • Returns: GenAI_EmbeddingResult with float* data, embedding_size, count

Configuration

GenAI module connects to llama-server for embeddings:

cpp
// Endpoint: http://127.0.0.1:8013/embedding
// Model: nomic-embed-text-v1.5 (or similar)
// Dimension: 1536

Memory Management

cpp
// GenAI returns malloc'd data - must free after copying
GenAI_EmbeddingResult result = GloGATH->embed_documents({text});

std::vector<float> embedding(result.data, result.data + result.embedding_size);
free(result.data);  // Important: free the original data

Performance

Embedding Generation

OperationTimeNotes
Generate embedding~100-300msVia llama-server (local)
Vector cache search~10-50msKNN search with sqlite-vec
Pattern similarity check~10-50msKNN search with sqlite-vec

Cache Benefits

  • Cache hit: ~10-50ms (vs 1-5s for LLM call)
  • Semantic matching: Higher hit rate than exact text cache
  • Reduced LLM costs: Fewer API calls to cloud providers

Storage

  • Embedding size: 1536 floats × 4 bytes = ~6 KB per query
  • 1000 cached queries: ~6 MB + overhead
  • 100 threat patterns: ~600 KB

Troubleshooting

Vector Features Not Working

  1. Check AI features enabled:

    sql
    SELECT * FROM runtime_mysql_servers
    WHERE variable_name LIKE 'ai_%_enabled';
    
  2. Check vector DB exists:

    bash
    ls -la /var/lib/proxysql/ai_features.db
    
  3. Check GenAI handler initialized:

    bash
    tail -f proxysql.log | grep GenAI
    
  4. Check llama-server running:

    bash
    curl http://127.0.0.1:8013/embedding
    

Poor Similarity Detection

  1. Adjust thresholds:

    sql
    -- Lower threshold = more sensitive (more false positives)
    SET ai_anomaly_similarity_threshold='80';
    
  2. Add more threat patterns:

    cpp
    anomaly_detector->add_threat_pattern(...);
    
  3. Check embedding quality:

    • Ensure llama-server is using a good embedding model
    • Verify query normalization is working

Cache Issues

sql
-- Clear cache (via API, not SQL yet)
anomaly_detector->clear_cache();

-- Check cache statistics
SHOW STATUS LIKE 'ai_nl2sql_cache_%';

Security Considerations

  • Embeddings are stored locally in SQLite database
  • No external API calls for similarity search
  • Threat patterns are user-defined - ensure proper access control
  • Risk scores are heuristic - tune thresholds for your environment

Future Enhancements

  • Automatic threat pattern learning from flagged queries
  • Embedding model fine-tuning for SQL domain
  • Distributed vector storage for large-scale deployments
  • Real-time embedding updates for adaptive learning
  • Multi-lingual support for embeddings

API Reference

See API.md for complete API documentation.

Architecture Details

See ARCHITECTURE.md for detailed architecture documentation.

Testing Guide

See TESTING.md for testing instructions.