docs/subsystems/logging.md
Having a good system for logging error reporting is essential to making a large project like Zulip successful. Without reliable error reporting, one has to rely solely on bug reports from users in order to produce a working product.
Our goal as a project is to have zero known 500 errors on the backend and zero known JavaScript exceptions on the frontend. While there will always be new bugs being introduced, that goal is impossible without an efficient and effective error reporting framework.
We provide integration with Sentry to make it easier for very large installations like zulip.com to manage their exceptions and ensure they are all tracked down, but our default email-based system is great for small installations.
The Django framework provides much of the infrastructure needed by our error reporting system:
django.utils.log.AdminEmailHandler._RateLimitFilter in
zerver/lib/logging_util.py)zerver/filters.py.JsonableError, which is our standard
system for API code to return a JSON-format HTTP error response.Since 500 errors in any Zulip server are usually a problem the server administrator should investigate and/or report upstream, we have this email reporting system configured to report errors by default.
Zulip's optional backend Sentry integration will aggregate errors to show which users and realms are affected, any logging which happened prior to the exception, local variables in each frame of the exception, and the full request headers which triggered it.
You can enable it by:
/etc/zulip/settings.py
as SENTRY_DSN:
## Controls the DSN used to report errors to Sentry.io
SENTRY_DSN = "https://[email protected]/1234"
zulip user, restart Zulip by running:
/home/zulip/deployments/current/scripts/restart-server
You may also want to enable Zulip's Sentry deploy hook.
Django's logging system uses the standard
Python logging infrastructure. We have configured
them so that logging.exception and logging.error get emailed to
the server maintainer, while logging.warning will just appear in
/var/log/zulip/errors.log. Lower log levels just appear in the main
server log (as well as in the log for corresponding process, be it
django.log for the main Django processes or the appropriate
events_* log file for a queue worker).
The main Zulip server log contains a line for each backend request.
It also contains warnings, errors, and the full tracebacks for any
Python exceptions. In production, it goes to
/var/log/zulip/server.log; in development, it goes to the terminal
where you run run-dev.
In development, it's good to keep an eye on the run-dev console
as you work on backend changes, since it's a great way to notice bugs
you just introduced.
In production, one usually wants to look at errors.log for errors
since the main server log can be very verbose, but the main server log
can be extremely valuable for investigating performance problems.
2016-05-20 14:50:22.056 INFO [zr] 127.0.0.1 GET 302 528ms (db: 1ms/1q) (+start: 123ms) / (unauth@zulip via ?)
[20/May/2016 14:50:22]"GET / HTTP/1.0" 302 0
2016-05-20 14:50:22.272 INFO [zr] 127.0.0.1 GET 200 124ms (db: 3ms/2q) /login/ (unauth@zulip via ?)
2016-05-20 14:50:26.333 INFO [zr] 127.0.0.1 POST 302 37ms (db: 6ms/7q) /accounts/login/local/ (unauth@zulip via ?)
[20/May/2016 14:50:26]"POST /accounts/login/local/ HTTP/1.0" 302 0
2016-05-20 14:50:26.538 INFO [zr] 127.0.0.1 POST 200 12ms (db: 1ms/2q) (+start: 53ms) /api/v1/events/internal [1463769771:0/0] (8@zulip via internal)
2016-05-20 14:50:26.657 INFO [zr] 127.0.0.1 POST 200 10ms (+start: 8ms) /api/v1/events/internal [1463769771:0/0] (8@zulip via internal)
2016-05-20 14:50:26.959 INFO [zr] 127.0.0.1 GET 200 588ms (db: 26ms/21q) / [1463769771:0] (8@zulip via website)
The format of this output is:
The performance data details are particularly useful for investigating performance problems, since one can see at a glance whether a slow request was caused by delays in the database, in the Markdown processor, in memcached, or in other Python code.
One useful thing to note, however, is that the database time is only the time spent connecting to and receiving a response from the database. Especially when response are large, there can often be a great deal of Python processing overhead to marshall the data from the database into Django objects that is not accounted for in these numbers.
Zulip comes with a tool, ./scripts/log-search, to quickly search
through the main server.log log file based on a number of different
axes -- including client IP address, client user-id, request path,
response code. It can also search the NGINX logs, which provide
similar information.
Because often the requests to static resources, or to things like user avatars, are not as important, they are also filtered out of the output by default.
The output shows timestamp, request duration, client IP address, response code, and request method, hostname, and path; any property which is limited by the tool is not displayed, for conciseness:
zulip@example-prod:~/deployments/current$ ./scripts/log-search realm-name
22:30:36.593 1ms 2606:2800:220:1:248:1893:25c8:1946 302 GET /
22:30:42.508 366ms 2606:2800:220:1:248:1893:25c8:1946 200 GET /login/
23:18:30.977 1ms 93.184.216.34 302 GET /
23:18:31.286 132ms 93.184.216.34 200 GET /login/
23:18:48.094 20ms 93.184.216.34 200 GET /accounts/password/reset/
23:18:51.520 149ms 93.184.216.34 200 GET /login/
23:18:59.420 20ms 93.184.216.34 200 GET /accounts/password/reset/
23:19:02.929 1300ms 93.184.216.34 302 POST /accounts/password/reset/
23:19:03.056 93ms 93.184.216.34 200 GET /accounts/password/reset/done/
23:19:08.911 26ms 93.184.216.34 302 GET /accounts/password/reset/OA/b56jfp-bd80ee99b98e703456b3bdcd91892be2/
23:19:09.189 116ms 93.184.216.34 200 GET /accounts/password/reset/OA/set-password/
23:19:18.743 215ms 93.184.216.34 302 POST /accounts/password/reset/OA/set-password/
23:19:18.779 12ms 93.184.216.34 200 GET /accounts/password/done/
23:19:20.796 12ms 93.184.216.34 200 GET /accounts/login/
23:19:29.323 295ms 8@ 93.184.216.34 302 POST /accounts/login/
23:19:29.704 362ms 8@ 93.184.216.34 200 GET /
23:20:04.980 110ms 8@ 93.184.216.34 200 DELETE /json/users/me/subscriptions
zulip@example-prod:~/deployments/current$ ./scripts/log-search 2606:2800:220:1:248:1893:25c8:1946
22:30:36.593 1ms 302 GET https://realm-one.example-prod.example.com/
22:30:42.508 366ms 200 GET https://realm-one.example-prod.example.com/login/
See ./scripts/log-search --help for complete documentation.
We have a custom library, called blueslip (named after the form used at MIT to
report problems with the facilities), that logs and reports JavaScript errors in
the browser. In production, this uses Sentry (if configured) to report and
aggregate errors. In development, this means displaying a highly visible overlay
over the message view area, to make exceptions in testing a new feature hard to
miss.
web/src/blueslip.ts.error event on window.blueslip.error calls are sent to
Sentry, if configured.blueslip = require("./src/blueslip"); blueslip.get_log().Blueslip supports several error levels:
throw new Error(…): For fatal errors that cannot be easily recovered
from. We try to avoid using it, since it kills the current JS thread, rather
than returning execution to the caller.blueslip.error: For logging of events that are definitely caused by a bug
and thus sufficiently important to be reported, but where we can handle the
error without creating major user-facing problems (e.g., an exception when
handling a presence update).blueslip.warn: For logging of events that are a problem but not important
enough to log an error to Sentry in production. They are, however, highlighted
in the JS console in development, and appear as breadcrumb logs in Sentry in
production.blueslip.log (and blueslip.info): Logged to the JS console in development
and also in the Sentry breadcrumb log in production. Useful for data that
might help discern what state the browser was in during an error (e.g., whether
the user was in a narrow).blueslip.debug: Similar to blueslip.log, but are not printed to
the JS console in development.Zulip's optional JavaScript Sentry integration will aggregate errors to show which users and realms are affected, any logging which happened prior to the exception, and any DOM interactions which happened prior to the error.
You can enable it by:
/etc/zulip/settings.py
as SENTRY_FRONTEND_DSN:
## Controls the DSN used to report JavaScript errors to Sentry.io
SENTRY_FRONTEND_DSN = "https://[email protected]/1234"
SENTRY_FRONTEND_SAMPLE_RATE down from 1.0.zulip user, restart Zulip by running:
/home/zulip/deployments/current/scripts/restart-server
You may also want to enable Zulip's Sentry deploy hook.
In order to make it easier to debug potential performance problems in the critically latency-sensitive message sending code pathway, we log and report to the server the following whenever a message is sent:
send_message response returned from the server.get_events protocol (these last two race with each other).The code is all in zerver/lib/report.py and web/src/sent_messages.ts.
We have similar reporting for the time it takes to narrow / switch to a new view: