Back to Clickhouse

Machine Learning Functions

docs/en/sql-reference/functions/machine-learning-functions.md

26.4.1.1-new7.7 KB
Original Source

Machine learning functions

evalMLMethod {#evalmlmethod}

Prediction using fitted regression models uses evalMLMethod function. See link in linearRegression.

stochasticLinearRegression {#stochasticlinearregression}

The stochasticLinearRegression aggregate function implements stochastic gradient descent method using linear model and MSE loss function. Uses evalMLMethod to predict on new data.

stochasticLogisticRegression {#stochasticlogisticregression}

The stochasticLogisticRegression aggregate function implements stochastic gradient descent method for binary classification problem. Uses evalMLMethod to predict on new data.

naiveBayesClassifier {#naivebayesclassifier}

Classifies input text using a Naive Bayes model with n-grams and Laplace smoothing. The model must be configured in ClickHouse before use.

Syntax

sql
naiveBayesClassifier(model_name, input_text);

Arguments

  • model_name — Name of the pre-configured model. String The model must be defined in ClickHouse's configuration files (see below).
  • input_text — Text to classify. String Input is processed exactly as provided (case/punctuation preserved).

Returned Value

  • Predicted class ID as an unsigned integer. UInt32 Class IDs correspond to categories defined during model construction.

Example

Classify text with a language detection model:

sql
SELECT naiveBayesClassifier('language', 'How are you?');
response
┌─naiveBayesClassifier('language', 'How are you?')─┐
│ 0                                                │
└──────────────────────────────────────────────────┘

Result 0 might represent English, while 1 could indicate French - class meanings depend on your training data.


Implementation Details {#implementation-details}

Algorithm Uses Naive Bayes classification algorithm with Laplace smoothing to handle unseen n-grams based on n-gram probabilities based on this.

Key Features

  • Supports n-grams of any size
  • Three tokenization modes:
    • byte: Operates on raw bytes. Each byte is one token.
    • codepoint: Operates on Unicode scalar values decoded from UTF‑8. Each codepoint is one token.
    • token: Splits on runs of Unicode whitespace (regex \s+). Tokens are substrings of non‑whitespace; punctuation is part of the token if adjacent (e.g., "you?" is one token).

Model Configuration {#model-configuration}

You can find sample source code for creating a Naive Bayes model for language detection here.

Additionally, sample models and their associated config files are available here.

Here is an example configuration for a naive Bayes model in ClickHouse:

xml
<clickhouse>
    <nb_models>
        <model>
            <name>sentiment</name>
            <path>/etc/clickhouse-server/config.d/sentiment.bin</path>
            <n>2</n>
            <mode>token</mode>
            <alpha>1.0</alpha>
            <priors>
                <prior>
                    <class>0</class>
                    <value>0.6</value>
                </prior>
                <prior>
                    <class>1</class>
                    <value>0.4</value>
                </prior>
            </priors>
        </model>
    </nb_models>
</clickhouse>

Configuration Parameters

ParameterDescriptionExampleDefault
nameUnique model identifierlanguage_detectionRequired
pathFull path to model binary/etc/clickhouse-server/config.d/language_detection.binRequired
modeTokenization method:
  • byte: Byte sequences
  • codepoint: Unicode characters
  • token: Word tokens | token | Required | | n | N-gram size (token mode):
  • 1=single word
  • 2=word pairs
  • 3=word triplets | 2 | Required | | alpha | Laplace smoothing factor used during classification to address n-grams that do not appear in the model | 0.5 | 1.0 | | priors | Class probabilities (% of the documents belonging to a class) | 60% class 0, 40% class 1 | Equal distribution |

Model Training Guide

File Format In human-readable format, for n=1 and token mode, the model might look like this:

text
<class_id> <n-gram> <count>
0 excellent 15
1 refund 28

For n=3 and codepoint mode, it might look like:

text
<class_id> <n-gram> <count>
0 exc 15
1 ref 28

Human-readable format is not used by ClickHouse directly; it must be converted to the binary format described below.

Binary Format Details Each n-gram stored as:

  1. 4-byte class_id (UInt, little-endian)
  2. 4-byte n-gram bytes length (UInt, little-endian)
  3. Raw n-gram bytes
  4. 4-byte count (UInt, little-endian)

Preprocessing Requirements Before the model is being created from the document corpus, the documents must be preprocessed to extract n-grams according to the specified mode and n. The following steps outline the preprocessing:

  1. Add boundary markers at the start and end of each document based on tokenization mode:

    • Byte: 0x01 (start), 0xFF (end)
    • Codepoint: U+10FFFE (start), U+10FFFF (end)
    • Token: <s> (start), </s> (end)

    Note: (n - 1) tokens are added at both the beginning and the end of the document.

  2. Example for n=3 in token mode:

    • Document: "ClickHouse is fast"
    • Processed as: <s> <s> ClickHouse is fast </s> </s>
    • Generated trigrams:
      • <s> <s> ClickHouse
      • <s> ClickHouse is
      • ClickHouse is fast
      • is fast </s>
      • fast </s> </s>

To simplify model creation for byte and codepoint modes, it may be convenient to first tokenize the document into tokens (a list of bytes for byte mode and a list of codepoints for codepoint mode). Then, append n - 1 start tokens at the beginning and n - 1 end tokens at the end of the document. Finally, generate the n-grams and write them to the serialized file.


<!-- The inner content of the tags below are replaced at doc framework build time with docs generated from system.functions. Please do not modify or remove the tags. See: https://github.com/ClickHouse/clickhouse-docs/blob/main/contribute/autogenerated-documentation-from-source.md --> <!--AUTOGENERATED_START--> <!--AUTOGENERATED_END-->