docs/sort_facet.md
Document Format used for the test scenario:
<div style="overflow-x: auto;"> <table> <tr> <th>Document 1</th> <th>Document 2</th> <th>... Document i</th> <th>Document 5000</th> </tr> <tr> <td style="vertical-align: top; width: 20%;"> <pre> { "dummyTerm":"Term", "dummyDate":"2000-01-01T00:00:00", "dummyNumber":1 } </pre> </td> <td style="vertical-align: top; width: 20%;"> <pre> { "dummyTerm":"Term", "dummyDate":"2000-01-01T01:00:00", "dummyNumber":2 } </pre> </td> <td style="vertical-align: top; width: 20%;"> <pre> { "dummyTerm":"Term", "dummyDate":"2000-01-01T01:00:00"+(i hours), "dummyNumber":i } </pre> </td> <td style="vertical-align: top; width: 20%;"> <pre> { "dummyTerm":"Term", "dummyDate":"2000-01-01T01:00:00 + (5000 hours)", "dummyNumber":5000 } </pre> </td> </table> </div> <p align="justify">Now I ran the following set of search requests across both the indexes, while increasing the number of documents indexed from 2000 to 4000.</p> <div style="overflow-x: auto;"> <table> <tr> <th>Request 1</th> <th>Request 2</th> <th>... Request i</th> <th>Request 1000</th> </tr> <tr> <td style="vertical-align: top; width: 20%;"> <pre> { "explain": true, "fields": [ "*" ], "highlight": {}, "query": { "match": "term", "field":"dummyTerm" }, "facets":{ "myDate":{ "field":"dummyDate", "size":100000, "date_ranges":[ { "start":"2000-01-01T00:00:00", "end":"2000-01-01T01:00:00" } ] }, "myNum":{ "field":"dummyNumber", "size":100000, "numeric_ranges":[ { "min": 1000, "max": 1001 } ] } }, "size": 10, "from": 0 } </pre> </td> <td style="vertical-align: top; width: 20%;"> <pre> { "explain": true, "fields": [ "*" ], "highlight": {}, "query": { "match": "term", "field":"dummyTerm" }, "facets":{ "myDate":{ "field":"dummyDate", "size":100000, "date_ranges":[ { "start":"2000-01-01T01:00:00", "end":"2000-01-01T02:00:00" } ] }, "myNum":{ "field":"dummyNumber", "size":100000, "numeric_ranges":[ { "min": 999, "max": 1000 } ] } }, "size": 10, "from": 0 } </pre> </td> <td style="vertical-align: top; width: 20%;"> <pre> { "explain": true, "fields": [ "*" ], "highlight": {}, "query": { "match": "term", "field":"dummyTerm" }, "facets":{ "myDate":{ "field":"dummyDate", "size":100000, "date_ranges":[ { "start":"2000-01-01T00:00:00" + i hour "end":"2000-01-01T00:00:00" + (i+1) hour } ] }, "myNum":{ "field":"dummyNumber", "size":100000, "numeric_ranges":[ { "min": 1000-i, "max": 1000-i+1 } ] } }, "size": 10, "from": 0 } </pre> </td> <td style="vertical-align: top; width: 20%;"> <pre> { "explain": true, "fields": [ "*" ], "highlight": {}, "query": { "match": "term", "field":"dummyTerm" }, "facets":{ "myDate":{ "field":"dummyDate", "size":100000, "date_ranges":[ { "start":"2000-01-01T01:00:00" + 1000 hour, "end":"2000-01-01T02:00:00" + 1001 hour } ] }, "myNum":{ "field":"dummyNumber", "size":100000, "numeric_ranges":[ { "min": 0, "max": 1 } ] } }, "size": 10, "from": 0 } </pre> </td> </table> </div> <div style="overflow-x: auto;"> <table> <tr> <th>Bleve index size growth with increase in indexed documents</th> <th>Total query time for 1000 queries with increase in number of indexed documents</th> </tr> <td></td> <td></td> </tr> </table> </div> <div style="overflow-x: auto;"> <table> <tr> <th style="width:50%">Average increase in index size (in bytes) by enabling DocValues</th> <th style="width:50%">Average reduction in time taken to perform 1000 queries (in milliseconds) by enabling DocValues</th> </tr> <tr> <td align="center"><code>7762.47</code></td> <td align="center"><code>27.034</code></td> </tr> </table> Even at this small scale, with a small document size and a very limited number of indexed documents, we still observe a noticeable tradeoff. With just a slight increase in the index size (an average of 7KB), we obtain a 20ms reduction in the total execution time, on average, for only 1000 queries. <h3>Technical Information</h3> <p align="justify">When a search request involves facet or sorting operations on a field F, these operations occur after the main search query is executed. For instance, if the main query yields a result of 200 documents, the sorting and faceting processes will be applied to these 200 documents. However, the main query result only provides a set of document IDs, not the actual document contents.</p> <p align="justify">Here's where docValues become essential. If the field mapping for F is docValue enabled, the system can directly access the values for the field from the stored docValue part in the index file. This means that for each document ID returned in the search result, the field values are readily available.</p> <p align="justify">However, if docValues are not enabled for field F, the system must take a different approach. It needs to "fetch the document" from the index file, read the value for field F, and cache this field-document pair in memory for further processing. The issue becomes apparent in the latter scenario. By not enabling docValues for field F, you essentially retrieve all the documents in the search result (at the worst case), which can be a substantial amount of data. Moreover, you have to cache this information in memory, leading to increased memory usage. As a result, query latency significantly suffers because you're essentially fetching and processing all documents, which can be both time-consuming and resource-intensive. Enabling docValues for the relevant fields is, therefore, a crucial optimization to enhance query performance and reduce memory overhead in such situations.</p>