.. Licensed to the Apache Software Foundation (ASF) under one .. or more contributor license agreements. See the NOTICE file .. distributed with this work for additional information .. regarding copyright ownership. The ASF licenses this file .. to you under the Apache License, Version 2.0 (the .. "License"); you may not use this file except in compliance .. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing, .. software distributed under the License is distributed on an .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY .. KIND, either express or implied. See the License for the .. specific language governing permissions and limitations .. under the License.

.. highlight:: console

.. _cpp-fuzzing:

================= Fuzzing Arrow C++

To make the handling of invalid input more robust, we have enabled fuzz testing on several parts of the Arrow C++ feature set, currently:

the IPC stream reader
the IPC file reader
the Parquet file reader
the Parquet encoders and decoders
the CSV file reader

We welcome any contribution to expand the scope of fuzz testing and cover areas ingesting potentially invalid or malicious data.

Fuzz Targets and Utilities

By passing the -DARROW_FUZZING=ON CMake option (or equivalently, using the fuzzing preset), you will build the fuzz targets corresponding to the aforementioned Arrow features, as well as additional related utilities.

Generating the seed corpus

Fuzzing essentially explores the domain space by randomly mutating previously tested inputs, without having any high-level understanding of the area being fuzz-tested. However, the domain space is so huge that this strategy alone may fail to actually produce any "interesting" inputs.

To guide the process, it is therefore important to provide a seed corpus of valid (or invalid, but remarkable) inputs from which the fuzzing infrastructure can derive new inputs for testing. A script is provided to automate that task. Assuming the fuzzing executables can be found in build/debug, the seed corpus can be generated thusly:

.. code-block::

$ ./build-support/fuzzing/generate_corpuses.sh build/debug

Continuous fuzzing infrastructure

The process of fuzz testing is computationally intensive and therefore benefits from dedicated computing facilities. Arrow C++ is exercised by the OSS-Fuzz_ continuous fuzzing infrastructure operated by Google.

Issues found by OSS-Fuzz are notified and available to a limited set of core developers <https://github.com/google/oss-fuzz/blob/master/projects/arrow/project.yaml>_. If you are a Arrow core developer and want to be added to that list, you can ask on the :ref:mailing-list <contributing>.

.. _OSS-Fuzz: https://google.github.io/oss-fuzz/

Reproducing locally

When a crash is found by fuzzing, it is often useful to download the data used to produce the crash, and use it to reproduce the crash so as to debug and investigate.

Assuming you are in a subdirectory inside cpp, the following command would allow you to build the fuzz targets with debug information and the various sanitizer checks enabled.

.. code-block::

$ cmake .. --preset=fuzzing

Then, assuming you have downloaded the crashing data file (let's call it testcase-arrow-ipc-file-fuzz-123465), you can reproduce the crash by running the affected fuzz target on that file:

.. code-block::

$ build/debug/arrow-ipc-file-fuzz testcase-arrow-ipc-file-fuzz-123465

(you may want to run that command under a debugger so as to inspect the program state more closely)

Using conda

The fuzzing executables must be compiled with clang and linked to libraries which provide a fuzzing runtime. If you are using conda to provide your dependencies, you may need to install these before building the fuzz targets:

.. code-block::

$ conda install clang clangxx compiler-rt $ cmake .. --preset=fuzzing

.. _fuzz-regression-files:

Regression files

When a fuzzer-detected bug is found and fixed, the corresponding reproducer must be stored in the arrow-testing <https://github.com/apache/arrow-testing/>__ repository to ensure that the code doesn't regress.

The locations for these files are as follows:

IPC streams: in the data/arrow-ipc-stream directory <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream>__
IPC files: in the data/arrow-ipc-file directory <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-file>__
Parquet files: in the data/parquet/fuzzing directory <https://github.com/apache/arrow-testing/tree/master/data/parquet/fuzzing>__
CSV files: in the data/csv/fuzzing directory <https://github.com/apache/arrow-testing/tree/master/data/csv/fuzzing>__

Most of those files are invalid files for their respective formats and stress proper error detection and reporting in the implementation code.