presto-docs/src/main/sphinx/admin/verifier.rst
Presto Verifier is a tool to run queries and verify correctness. It can be used to test whether a new Presto version produces the correct query results, or to test if pairs of Presto queries have the same semantics.
During each Presto release, Verifier is run to ensure that there is no correctness regression.
In a MySQL database, create the following table and load it with the queries you would like to run:
.. code-block:: sql
CREATE TABLE verifier_queries (
id int(11) unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
suite varchar(256) NOT NULL,
name varchar(256) DEFAULT NULL,
control_catalog varchar(256) NOT NULL,
control_schema varchar(256) NOT NULL,
control_query text NOT NULL,
control_username varchar(256) DEFAULT NULL,
control_password varchar(256) DEFAULT NULL,
control_session_properties text DEFAULT NULL,
test_catalog varchar(256) NOT NULL,
test_schema varchar(256) NOT NULL,
test_query text NOT NULL,
test_username varchar(256) DEFAULT NULL,
test_password varchar(256) DEFAULT NULL,
test_session_properties text DEFAULT NULL)
Next, create a config.properties file:
.. code-block:: none
source-query.suites=suite
source-query.database=jdbc:mysql://localhost:3306/mydb?user=my_username&password=my_password
control.hosts=127.0.0.1
control.http-port=8080
control.jdbc-port=8080
control.application-name=verifier-test
test.hosts=127.0.0.1
test.http-port=8081
test.jdbc-port=8081
test.application-name=verifier-test
test-id=1
Verifier can run in either query-bank or control-test mode by setting configuration
running-mode.
control-test: this is the default mode. Both the control query and the test query are
executed and their result checksums are compared.
query-bank: in this mode, the control query is skipped and the comparison is done between
a saved snapshot result and the test result.
Create a verifier_snapshots table:
.. code-block:: sql
CREATE TABLE verifier_snapshots (
id int(11) unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
suite varchar(256) NOT NULL,
name varchar(256) NOT NULL DEFAULT '.',
is_explain BOOLEAN NOT NULL DEFAULT false,
snapshot json NOT NULL,
updated_at datetime NOT NULL DEFAULT now(),
UNIQUE(suite, name, is_explain));
Download :maven_download:verifier and rename it to verifier.jar. To run the Verifier:
.. code-block:: none
java -Xmx1G -jar verifier.jar verify config.properties
Before running in query-bank mode, snapshots must be saved. Add configurations:
.. code-block:: none
running-mode=query-bank
save-snapshot=true
Run the verifier and the snapshots will be saved to the table verifier_snapshots.
To run in query-bank mode, set save-snapshot=false or just delete it:
.. code-block:: none
running-mode=query-bank
#save-snapshot=true
The following steps summarize the workflow of Verifier.
Importing Source Queries
Query Pre-processing and Filtering
Select, Insert,
CreateTableAsSelect, create table and create view are supported.Query rewriting
Select queries to CreateTableAsSelect
Select query with LIMIT 0.Insert and CreateTableAsSelect queries to have their table names replaced.
Insert query.nondeterministic-function-substitutes
if the configuration is set.Query Execution
Query-bank mode, the control
cluster is skipped and a saved snapshot is used instead.Failure Resolution_ for auto-resolving of query failures.Results Comparison
Select, Insert, and CreateTableAsSelect queries, results are written into
temporary tables.query-bank mode, Control checksums will be restored from snapshots saved in mysql
database.mysql database if configuration save-snapshot is
set to true.Column Checksums_ for special handling
of different column types.Determinism_ for handling of non-deterministic queries.Emitting Results
JSON, or human readable text.For each column in the control/test query, one or more columns are generated in the checksum queries.
DOUBLE and REAL columns, 4 columns are generated for verification:
NAN count of the columnNAN count, positive and negative infinity count matches.VARCHAR columns, a simple checksum column is generated for verification using the :func:!checksum.validate-string-as-double is set, seven columns below are generated. If NULL counts are equal before and after casting all values to DOUBLE, apply the floating point validation. Otherwise, check if the simple checksum matches.
NULL valuesNULL values after all values are casted to DOUBLE
After casting all values to DOUBLENAN count of the valuesarr of type array(E), three columns are generated for verification:
E is not orderable, array checksum is checksum(arr).E is orderable, array checksum is coalesce(checksum(try(array_sort(arr))), checksum(arr)).use-error-margin-for-floating-point-arrays is set and E is DOUBLE or REAL, these six columns below are generated instead. Check if the sum of the cardinality matches and if the checksum of the cardinality matches. Apply the floating point validation to the rest of results.
NAN elements of all array valuesvalidate-string-as-double is set and E is VARCHAR, these nine columns below are generated instead. Check if the sum and the checksum of the cardinality match. If NULL counts are equal before and after casting all array elements to DOUBLE, apply the floating point validation. Otherwise, check if the array checksum matches.
checksum(array_sort(arr))NULL elements of all array valuesNULL elements after all array elements are casted to DOUBLE
After casting all array elements to DOUBLENAN elements of all array valuesmap(K, V), four columns are generated for verification:
validate-string-as-double is set and K is VARCHAR, six additional columns are generated:
NULL elements of all key setsNULL elements of the key sets after all map keys are casted to DOUBLE
After casting all map keys to DOUBLENAN elements of all key setsvalidate-string-as-double is set and V is VARCHAR, six additional columns are generated:
NULL elements of all value setsNULL elements of the value sets after all map values are casted to DOUBLE
After casting all map values to DOUBLENAN elements of all value setsFor all other column types, generates a simple checksum using the :func:!checksum function.
A result mismatch, either a row count mismatch or a column mismatch, can be caused by non-deterministic query features. To avoid false alerts, we perform determinism analysis for the control query. If a query is found non-deterministic, we skip the verification as it does not provide insights.
Determinism analysis follows the following steps. If a query is found non-deterministic at any point, the analysis will conclude.
determinism.non-deterministic-catalog.
If a query references any table from those catalogs, the query is considered non-deterministic.LIMIT n clause but no ORDER BY clause at the top level:
LIMIT
clause.n, treats the control query as
non-deterministic.The differences in configuration, including cluster size, can cause a query to succeed on the control cluster but fail on the test cluster. A checksum query can also fail, which may be due to limitation of Presto or Presto Verifier. Thus, we allow Verifier to automatically resolve certain query failures.
EXCEEDED_GLOBAL_MEMORY_LIMIT: Resolves if the control query uses more memory than the test
query.EXCEEDED_TIME_LIMIT: Resolves unconditionally.TOO_MANY_HIVE_PARTITIONS: Resolves if the test cluster does not have enough workers to make
sure the number of partitions assigned to each worker stays within the limit.COMPILER_ERROR, GENERATED_BYTECODE_TOO_LARGE: Resolves if the control checksum query
fails with this error. If the control query has too many columns, generated checksum queries
might be too large in certain cases.In cases of result mismatches, Verifier may be giving noisy signals, and we allow Verifier to automatically resolve certain mismatches.
In explain mode, Verifier checks whether source queries can be explained instead of whether they produces the same results. Verification is marked as succeeded when both control query and test query can be explained.
The field matchType in the output event can be used as an indicator whether there are
plan differences between the control run and the test run.
For non-DML queries, the control query and the plan comparison are skipped.
Verifier can be extended for further behavioral changes in addition to configuration properties.
AbstractVerifyCommand <https://github.com/prestodb/presto/blob/master/presto-verifier/src/main/java/com/facebook/presto/verifier/framework/AbstractVerifyCommand.java>_
shows the components that be extended. Implement the abstract class and create a command line wrapper similar to
PrestoVerifier <https://github.com/prestodb/presto/blob/master/presto-verifier/src/main/java/com/facebook/presto/verifier/PrestoVerifier.java>_.
General Configuration
=========================================== ===============================================================================
Name Description
=========================================== ===============================================================================
``whitelist`` A comma-separated list specifying the names of the queries within the suite
to verify.
``blacklist`` A comma-separated list specifying the names of the queries to be excluded
from the suite. ``blacklist`` is applied after ``whitelist``.
``source-query-supplier`` The name of the source query supplier. Supports ``mysql``.
``source-query.table-name`` The name of the table that holds verifier queries. Available only when
``source-query-supplier`` is ``mysql``.
``event-clients`` A comma-separated list specifying where the output events should be emitted.
Supports ``json`` and ``human-readable``.
``json.log-file`` The output files of ``JSON`` events. If not set, ``JSON`` events are emitted to
``stdout``.
``human-readable.log-file`` The output files for human-readable events. If not set, human-readable events
are emitted to ``stdout``.
``control.table-prefix`` The table prefix to be appended to the control target table.
``test.table-prefix`` The table prefix to be appended to the test target table.
``control.reuse-table`` If ``true``, reuse the output table of the control source Insert and CreateTableAsSelect
query. Otherwise, run the control source query and write to a temporary table.
``test.reuse-table`` If ``true``, reuse the output table of the test source Insert and CreateTableAsSelect
query. Otherwise, run the test source query and write to a temporary table.
``test-id`` A string to be attached to output events.
``max-concurrency`` Maximum number of concurrent verifications.
``suite-repetition`` How many times a suite is verified.
``query-repetition`` How many times a source query is verified.
``relative-error-margin`` Maximum tolerable relative error between control sum and test sum of a
floating point column.
``absolute-error-margin`` Floating point averages that are below this threshold are treated as ``0``.
``run-teardown-on-result-mismatch`` Whether to run teardown query in case of result mismatch.
``verification-resubmission.limit`` A limit on how many times a source query can be re-submitted for verification.
``running-mode`` Set to ``query-bank`` to make the Verifier run in ``query-bank`` mode. Supports
``query-bank`` and ``control-test``. Defaults to ``control-test``.
``save-snapshot`` Set to ``true`` to save checksums to ``mysql`` database.
``extended-verification`` Set to ``true`` to run extended table layout verification for written tables.
It only applies to ``Insert`` and ``CreateTableAsSelect`` queries.
It would verify each partition's data checksum if the inserted table is partitioned.
It would verify each bucket's data checksum if the inserted table is bucketed.
``function-substitutes`` Specification of function substitutions, in the format of
``/foo(c0,_)/bar(c0)/,/fred(c0,c1)/baz(qux(c1,c0))/,/foobar(c0)/if(qux(c1),bar(c0),baz(c1))/,...``,
where ``foo(c0, _)`` would be substituted by ``bar(c0)``,
with the declared arguments applied to the corresponding positions.
Concatenate function substitutions with a comma.
Select a function substitute that has the return type and argument types
compatible with those of the original function, to produce a valid source query. For
example, ``/array_agg(z)/array_sort(array_agg(z))/,/approx_percentile(x,y)/avg(x)/``.
Declare the function arguments as identifiers if they need to be applied to
the function substitute.
=========================================== ===============================================================================
Query Override Configuration
The following configurations control the behavior of query metadata modification before verification starts.
Counterparts are also available for test queries with prefix control being replaced with test.
================================================ ===============================================================================
Name Description
================================================ ===============================================================================
control.catalog-override The catalog to be applied to all queries if specified.
control.schema-override The schema to be applied to all queries if specified.
control.username-override The username to be applied to all queries if specified.
control.password-override The password to be applied to all queries if specified.
control.session-properties-override-strategy Supports 3 values. NO_ACTION: Use the session properties as specified for
each query. OVERRIDE: Merge the session properties of each query with the
override, with override being the dominant. SUBSTITUTE, The session
properties of each query is replaced with the override.
control.session-properties-override The session property to be applied to all queries.
================================================ ===============================================================================
Query Execution Configuration
The following configurations control the behavior of query execution on the control cluster.
Counterparts are also available for test clusters with prefix ``control`` being replaced with ``test``.
=========================================== ===============================================================================
Name Description
=========================================== ===============================================================================
``control.hosts`` Comma-separated list of the control cluster hostnames or IP addresses.
``control.jdbc-port`` JDBC port of the control cluster.
``control.http-host`` HTTP port of the control cluster.
``control.jdbc-url-parameters`` A ``JSON`` map representing the additional URL parameters for control JDBC.
``control.query-timeout`` The execution time limit of the control and the test queries.
``control.metadata-timeout`` The execution time limit of ``DESC`` queries and ``LIMIT 0`` queries.
``control.checksum-timeout`` The execution time limit of checksum queries.
``control.application-name`` ApplicationName to be passed in ClientInfo. Can be used to set source.
=========================================== ===============================================================================
Determinism Analyzer Configuration
=========================================== ===============================================================================
Name Description
=========================================== ===============================================================================
determinism.run-teardown Whether to run teardown queries for tables produced in determinism analysis.
determinism.max-analysis-runs Maximum number of additional control runs to check for the determinism of the
control query.
determinism.handle-limit-query Whether to enable the special handling for queries with a top level LIMIT
clause.
determinism.non-deterministic-catalogs A comma-separated list of non-deterministic catalogs. Queries referencing table
from those catalogs are treated as non-deterministic.
=========================================== ===============================================================================
Failure Resolution Configuration
========================================================= ======================================================================
Name Description
========================================================= ======================================================================
``exceeded-global-memory-limit.failure-resolver.enabled`` Whether to enable the failure resolver for test query failures with
``EXCEEDED_GLOBAL_MEMORY_LIMIT``.
``exceeded-time-limit.failure-resolver.enabled`` Whether to enable the failure resolver for test query failures with
``EXCEEDED_TIME_LIMIT``.
``verifier-limitation.failure-resolver.enabled`` Whether to enable the failure resolver for failures due to Verifier
limitations.
``too-many-open-partitions.failure-resolver.enabled`` Whether to enable the failure resolver for test query failures with
``HIVE_TOO_MANY_OPEN_PARTITIONS``.
``too-many-open-partitions.max-buckets-per-writer`` The maximum buckets count per writer configured on the control and the
test cluster.
``too-many-open-partitions.cluster-size-expiration`` The time limit of the test cluster size being cached.
``structured-column.failure-resolver.enabled`` Whether to enable the failure resolver for column mismatches of
structured-type columns.
``ignored-functions.failure-resolver.enabled`` Whether to enable the ``IgnoredFunctions`` result mismatch failure
resolver.
``ignored-functions.functions`` A comma-separated list of functions. Resolves mismatches if a query
uses any functions in the list.
========================================================= ======================================================================