doc/developer/diagnostic-questions.md
This is a running checklist of things to think about when a non-engineer (a "user" hereafter in the document) reports an issue.
I find it is useful to think about user issues holistically because:
Some of the checklist may also be helpful to remind engineers as well on the various ways that can cause Materialize to behave not as expected.
Are the user that is running Materialize, the user that set up Materialize, and the user reporting the issue all the same person?
Is the version of Materialize:
select version().
If it is an old version, check if the issue reported existed back then.
If the issue existed but then was fixed, tell user to update to a newer version.mzbuild-... because that's somebody's random PR.Are the command line arguments reasonable?
In the case of a slow select query:
materialized.log for the creation/destruction of indexes, or you can ask the
user to run SHOW INDEXES ON <source/view name> or SHOW FULL VIEWS|SOURCES.What is the user doing?
What is the user's goal?
What are the column types?
Is the plan sane? (Ask for raw/decorrelated/optimized plans)
Is the rendering sane? (Ask for the memory graph)
TopK needs to reorder the entire set with every new timestamp.Is one worker (or a few workers) behind the others?
TopK or a Reduce operator on blank keys or vacuous columns?TopK or Reduce that is really busy on those workers?Is the failover behavior sane?
In the case of OOM or Materialize falling over:
In the case of a performance problem gradually observed getting worse over a substantial period of time with prometheus:
Does Materialize behave like we think it does?