docs/en/26-tdinternal/11-compress.md
Data compression is a technology that reorganizes and processes data using specific algorithms without losing effective information, aiming to reduce the storage space occupied by data and improve data transmission efficiency. TDengine employs this technology in both the storage and transmission processes to optimize the use of storage resources and accelerate data exchange.
TDengine adopts columnar storage technology in its storage architecture, meaning that data is stored continuously by column in the storage medium. This is different from traditional row-based storage, where data is stored continuously by row in the storage medium. Columnar storage, combined with the characteristics of time-series data, is particularly suitable for handling steadily changing time-series data.
To further improve storage efficiency, TDengine uses differential encoding technology. This technique stores data by calculating the difference between adjacent data points, rather than storing the raw values, thereby significantly reducing the amount of information needed for storage. After differential encoding, TDengine also uses general compression techniques to further compress the data, achieving a higher compression rate.
For stable time-series data collected by devices, TDengine's compression effect is particularly significant, with compression rates typically within 10%, and even higher in some cases. This efficient compression technology saves users a significant amount of storage costs while also improving the efficiency of data storage and access.
After time-series data is collected from devices, following TDengine's data modeling rules, each collection device is constructed as a subtable. Thus, all time-series data generated by a device is recorded in the same subtable. During the data storage process, data is stored in blocks, each block containing data from only one subtable. Compression is also performed on a block basis, compressing each column of data in the subtable separately, and the compressed data is still stored on the disk by block.
The stability of time-series data is one of its main characteristics, such as collected atmospheric temperature, water temperature, etc., which usually fluctuate within a certain range. Using this characteristic, data can be re-encoded, and different encoding techniques can be adopted according to different data types to achieve the highest compression efficiency. The following introduces the compression methods for various data types.
After completing the specialized compression for specific data types, TDengine further uses general compression techniques to compress the data as undifferentiated binary data for a second time. Compared to primary compression, the focus of secondary compression is on eliminating information redundancy between data blocks. This dual compression technique, focusing on local data simplification on one hand and overall data overlap elimination on the other, works together to achieve the ultra-high compression rate in TDengine.
TDengine supports multiple compression algorithms, including LZ4, ZLIB, ZSTD, XZ, etc. Users can flexibly balance between compression rate and write speed according to specific application scenarios and needs, choosing the most suitable compression scheme.
TDengine engine provides two modes for floating-point type data: lossless compression and lossy compression. The precision of floating-point numbers is usually determined by the number of digits after the decimal point. In some cases, the precision of floating-point numbers collected by devices is high, but the precision of interest in actual applications is low. In such cases, using lossy compression can effectively save storage space. TDengine's lossy compression algorithm is based on a prediction model, the core idea of which is to use the trend of previous data points to predict the trend of subsequent data points. This algorithm can significantly improve the compression rate, and its compression effect far exceeds that of lossless compression. The name of the lossy compression algorithm is TSZ.
TDengine provides compression functionality during data transmission to reduce network bandwidth consumption. When using native connections to transmit data from the client (such as taosc) to the server, compression transmission can be configured to save bandwidth. In the configuration file taos.cfg, the compressMsgSize option can be set to achieve this goal. The configurable values are as follows:
When using RESTful and WebSocket connections to communicate with taosAdapter, taosAdapter supports industry-standard compression protocols, allowing the connecting end to enable or disable compression during the transmission process according to industry-standard protocols. Here are the specific implementation methods:
The diagram below shows the compression and decompression process of the TDengine engine in the entire transmission and storage process of time-series data, to better understand the entire handling process.