ERRATA.md
Affected version range: 8.78.54 (first release with documented support for the feature triggering the bug) up to 8.116.25 (first release with a complete fix).
In early November 2022, Vespa-8.78.54 officially added support for multi-threaded
evaluation of filters in combination with the nearestNeighbor query operator.
This feature caused a subtle behavior change that could trigger a long-present,
latent bug in boundary condition handling in low-level bit vector code. This bug
allowed for memory corruptions to take place during query evaluation when a
particular set of conditions was satisfied. This was fixed in Vespa-8.116.25 on
2023-01-25. It is highly advised to update to a version that is at least as recent
as this.
Using multiple threads per query is not enabled by default.
Due to the many conditions required to trigger this bug, the Vespa team has only confirmed observations of this particular bug on one single application.
Vespa supports efficient multi-threaded query evaluation where the internal
document space is partitioned across multiple threads and data structures
are dynamically chosen based on the underlying matching document data.
When evaluating an OR clause where the field was an index and the term was
present in more than 3% of the documents, Vespa uses a posting list represented
as a bit vector. When this bit vector had no overlap with the partition a
particular query thread was responsible for, erroneous boundary-tagging code
would potentially set some bits in an 8-byte area that did not belong to the
bit vector.
This could happen if the following conditions were met:
nearestNeighbor
operator in the same query as searching the index field.Depending on what these 8 bytes were originally used for, memory corruption could manifest itself in many ways. Most commonly, the observed effect would be random; otherwise unexplainable crashes in indirectly affected code, caused by segmentation faults due to corrupted pointers or assertion failures caused by broken invariants.
If memory was corrupted prior to being written to disk, it's possible that corrupted data has become persistent in the cluster. If you suspect this to be the case, the most robust solution is to re-feed your document set after upgrading to a Vespa version containing the fix.
If corrupted data was written to the transaction log, the content nodes containing a replica of the corrupted document may enter a crash loop since they can never successfully replay the log upon startup. Wiping the index on the affected nodes and re-feeding after upgrading Vespa is the suggested remediation.
This bug was introduced more than 10 years ago and remained unobserved until very recently. Fixed in Vespa-7.306.16.
The following needs to happen to trigger the bug:
Tracking of successfully merged documents is done by exchanging bitmasks between the nodes, where bit positions correspond to document presence on the nth node involved in the merge operation. The bug caused bitmasks for sub-operations optimized to focus on source-only nodes to not be correctly transformed onto the bitmask tracking documents across all nodes. Even when not all documents could be transferred from the source-only node due to exceeding chunk transfer limits, the system would believe it had transferred all remaining such documents due to the erroneous bitmask transform.
This is a silent data loss bug and would be observed by the global document count in the system decreasing during a period of data merging.
This bug was introduced in Vespa-7.277.38, and fixed in Vespa-7.292.82. The following needs to happen to trigger the bug:
Solution:
There exists a regression introduced in Vespa 7.141 where updates marked as create: true (i.e. create if missing)
may cause data loss or undetected inconsistencies in certain edge cases.
This regression was introduced as part of an optimization effort to greatly reduce the common-case overhead of updates
when replicas are out of sync.
Fixed in Vespa 7.157.9 and beyond. If running a version affected (7.141 up to and including 7.147) you are strongly advised to upgrade.
See #11686 for details.