src/stirling/protocol_contribution_guide_part2.md
In How to contribute a protocol parser in Pixie, we walked through the overall parsing pipeline and the 5 core functions that need to be implemented. We are now ready to integrate the parser with the rest of Stirling.
Before we can parse anything, we need to first recognize what type of traffic it is. Currently, when a connection opens, Stirling runs a rule-based model to predict what protocol the connection is. The model takes in a buffer of bytes (payload of TCP/UDP) and returns the type (request or response) of a specific protocol. Each protocol has its own inference rules, and are all implemented in src/stirling/source_connectors/socket_tracer/bcc_bpf/protocol_inference.h. For example, the HTTP inference rules look like the following.
static __inline enum message_type_t infer_http_message(const char* buf, size_t count) {
// Smallest HTTP response is 17 characters:
// HTTP/1.1 200 OK\r\n
// Smallest HTTP response is 16 characters:
// GET x HTTP/1.1\r\n
if (count < 16) {
return kUnknown;
}
if (buf[0] == 'H' && buf[1] == 'T' && buf[2] == 'T' && buf[3] == 'P') {
return kResponse;
}
if (buf[0] == 'G' && buf[1] == 'E' && buf[2] == 'T') {
return kRequest;
}
if (buf[0] == 'H' && buf[1] == 'E' && buf[2] == 'A' && buf[3] == 'D') {
return kRequest;
}
if (buf[0] == 'P' && buf[1] == 'O' && buf[2] == 'S' && buf[3] == 'T') {
return kRequest;
}
if (buf[0] == 'P' && buf[1] == 'U' && buf[2] == 'T') {
return kRequest;
}
if (buf[0] == 'D' && buf[1] == 'E' && buf[2] == 'L' && buf[3] == 'E' && buf[4] == 'T' &&
buf[5] == 'E') {
return kRequest;
}
return kUnknown;
}
The rules are run sequentially, and the first protocol rule to return kRequest or kResponse becomes the classification result.
Caveats:
For these reasons, we generally do the following:
0x00 or 0x01 alone is not tight enough to confidently classify a protocol, because of how often they can occur. Missing some traffic in the beginning is much better than misclassifying the connection.The Pixie team can help with testing the new protocol inference rule in a network traffic dataset. Please file a feature request issue on github or send a message in the Pixie slack channel.
Next, we should think about what columns the table should have. We should add a table spec for the new protocol in src/stirling/source_connectors/socket_tracer. See examples in http_table.h or mysql_table.h. Accordingly, in src/stirling/source_connectors/socket_tracer/socket_trace_connectors, overload the AppendMessage() function for the new protocol. This function appends a single record to the data table.
Search for PROTOCOL_LIST under src/stirling/source_connectors/socket_tracer for the complete list of places that need to be updated for the new protocol.
You'll need to read the comments and existing code associated with PROTOCOL_LIST for the exact changes needed.
For example, in conn_tracker.cc, you'll need to add the role for tracing into CreateTraceRoles().
With everything hooked up, we should test the implementation with stirling_wrapper, which only
runs stirling without the other Pixie components. See src/stirling/binaries/stirling_wrapper.cc for details.
bazel build //src/stirling/binaries:stirling_wrapper.sudo bazel-bin/src/stirling/binaries/stirling_wrapper --print_record_batches=<protocol> --timeout_secs=-1With everything hooked up, we should be able to test tracing the protocol end-to-end, by tracing an actual application. We need to find/write a simple client server application that uses the protocol. Take a look at mysql_trace_bpf_test.cc and others as an example. Similarly, we can create our custom ContainerRunner in socket_tracer/testing/container_images.h. The test spins up the client and server, generates some traffic, and checks the records in the table are as expected.
The trace bpf test almost never passes on the first try. To debug it, we first need to figure out if the error happened in eBPF or user space.
Some suggestions:
In src/stirling/source_connectors/socket_tracer/conn_tracker.cc, add SetDebugTrace(2); to the top of ConnTracker::CheckTracker(). This turns on debug trace and we should be able to see all the traffic received by ConnTracker in the user space.
With ConnTrace turned on in step 1, check the log of the trace bpf test, and see if the ConnTrace logs everything you would expect. If all the expected traffic shows up in ConnTrace, this means that eBPF is tracing correctly and data has reached user space safely. Otherwise, I would suspect there’s a misclassification of traffic in eBPF and double-check the protocol inference rules. It can be tricky to debug issues in eBPF and don’t hesitate to ask the Pixie team for help.
If the traffic shows up as expected in ConnTrace, there is probably something wrong with the protocol parser.
Add a log statement at the top of ParseFrame to see if the buffer passed into it is expected. It’s a common bug that the buffer isn’t correctly aligned with the start of a frame. Make sure the correct length is stripped off ParseFrame at the end.
Add logs in StitchFrames to check that all the tags are correctly matched.
Check if any error was produced by the full body parsing module. Consider incorporating the failing cases into unit tests.
Cross checking the traced records with WireShark can also be very helpful.
###State
A struct State is useful in some protocols to retain information across time, such as MySQL and Kafka. The state can be useful in ParseFrame or FindFrameBoundary as additional signals to make the implementation more robust. For example, in Kafka, it stores a set of all the correlation_ids (tags) seen on the request buffer, and FindFrameBoundary only returns positions on the response buffer where the correlation_id has been seen before. The state is stored in ConnTracker and will remain available as long as the connection is open.
We should consider adding a state if there isn’t enough information in the current frame to detect whether it’s valid or where the frame ends, and that there’s additional information across frames that we can utilize.
Congratulations on successfully adding a new protocol to Pixie! The new protocol parser will benefit many others thanks to your contribution. Make sure to test Pixie on your own applications and the Pixie Team would much appreciate any feedback or bug reports.