docs/integrations/ai-engines/anomaly.mdx
The Anomaly Detection handler implements supervised, semi-supervised, and unsupervised anomaly detection algorithms using the pyod, catboost, xgboost, and sklearn libraries. The models were chosen based on the results in the ADBench benchmark paper.
<Info> **Additional information**If no labelled data, we use an unsupervised learner with the syntax CREATE ANOMALY DETECTION MODEL <model_name> without specifying the target to predict. MindsDB then adds a column called outlier when generating results.
If we have labelled data, we use the regular model creation syntax. There is backend logic that chooses between a semi-supervised algorithm (currently XGBOD) vs. a supervised algorithm (currently CatBoost).
If multiple models are provided, then we create an ensemble and use majority voting.
See the anomaly detection proposal document for more information.
Supervised: we have inlier/outlier labels, so we can train a classifier the normal way. This is very similar to a standard classification problem.
Semi-supervised: we have inlier/outlier labels and perform an unsupervised preprocessing step, and then a supervised classification algorithm.
Unsupervised: we don’t have inlier/outlier labels and cannot assume all training data are inliers. These methods construct inlier criteria that will classify some training data as outliers too based on distributional traits. New observations are classified against these criteria. However, it’s not possible to evaluate how well the model detects outliers without labels.
We propose the following logic to determine type of learning:
We’ve chosen 3000 based on the results of the NeurIPS AD Benchmark paper (linked above). The authors report that semi-supervised learning outperforms supervised learning when the number of samples used is less than 5% of the size of the training dataset. The average size of the training datasets in their study is 60,000, therefore this 5% corresponds to 3000 samples on average.
</Info> <Info> **Reasoning for default models on each type**We refer to the NeurIPS AD Benchmark paper (linked above) to make these choices:
Before proceeding, ensure the following prerequisites are met:
Create an AI engine from the Anomaly Detection handler.
CREATE ML_ENGINE anomaly_detection_engine
FROM anomaly_detection;
Create a model using anomaly_detection_engine as an engine.
CREATE ANOMALY DETECTION MODEL anomaly_detection_model
FROM datasource
(SELECT * FROM data_table)
PREDICT target_column
USING
engine = 'anomaly_detection_engine', -- engine name as created via CREATE ML_ENGINE
...; -- other parameters shown in usage examples below
To run example queries, use the data from this CSV file.
CREATE ANOMALY DETECTION MODEL mindsdb.unsupervised_ad
FROM files
(SELECT * FROM anomaly_detection)
USING
engine = 'anomaly_detection_engine';
DESCRIBE MODEL mindsdb.unsupervised_ad.model;
SELECT t.class, m.outlier as anomaly
FROM files.anomaly_detection as t
JOIN mindsdb.unsupervised_ad as m;
CREATE MODEL mindsdb.semi_supervised_ad
FROM files
(SELECT * FROM anomaly_detection)
PREDICT class
USING
engine = 'anomaly_detection_engine';
DESCRIBE MODEL mindsdb.semi_supervised_ad.model;
SELECT t.carat, t.category, t.class, m.class as anomaly
FROM files.anomaly_detection as t
JOIN mindsdb.semi_supervised_ad as m;
CREATE MODEL mindsdb.supervised_ad
FROM files
(SELECT * FROM anomaly_detection)
PREDICT class
USING
engine = 'anomaly_detection_engine',
type = 'supervised';
DESCRIBE MODEL mindsdb.supervised_ad.model;
SELECT t.carat, t.category, t.class, m.class as anomaly
FROM files.anomaly_detection as t
JOIN mindsdb.supervised_ad as m;
CREATE ANOMALY DETECTION MODEL mindsdb.unsupervised_ad_knn
FROM files
(SELECT * FROM anomaly_detection)
USING
engine = 'anomaly_detection_engine',
model_name = 'knn';
DESCRIBE MODEL mindsdb.unsupervised_ad_knn.model;
SELECT t.class, m.outlier as anomaly
FROM files.anomaly_detection as t
JOIN mindsdb.unsupervised_ad_knn as m;
CREATE ANOMALY DETECTION MODEL mindsdb.unsupervised_ad_local
FROM files
(SELECT * FROM anomaly_detection)
USING
engine = 'anomaly_detection_engine',
anomaly_type = 'local';
DESCRIBE MODEL mindsdb.unsupervised_ad_local.model;
SELECT t.class, m.outlier as anomaly
FROM files.anomaly_detection as t
JOIN mindsdb.unsupervised_ad_local as m;
CREATE ANOMALY DETECTION MODEL mindsdb.ad_ensemble
FROM files
(SELECT * FROM anomaly_detection)
USING
engine = 'anomaly_detection_engine',
ensemble_models = ['knn','ecod','lof'];
DESCRIBE MODEL mindsdb.ad_ensemble.model;
SELECT t.class, m.outlier as anomaly
FROM files.anomaly_detection as t
JOIN mindsdb.ad_ensemble as m;
Next Steps
Watch demo 1 and demo 2 to see usage examples.
Go to the Use Cases section to see more examples. </Tip>