modules_poc/README.md
This folder contains a POC implementation of a module metrics tracker and enforcement. This documentation includes basic information about modules, and commands which will run the scanner across the entire first-party codebase and merge the results. All commands are assumed to run at the root of the checkout, inside of a correctly activated python virtual env.
A module:
TODO
The file modules_poc/modules.yaml contains a list of modules, each containing
a list of files. Each file must be contained in only one module. Note that
module assignment is not required to map neatly to team ownership.
In cases where multiple globs match a file, the current rule is that the longest glob wins. This is used as a simpler-to-implement version of most-specific glob wins, which we may switch to in the future.
When submitting a review, you are strongly encouraged to include a generated diff of the changes to the modules list. This can be done by running:
modules_poc/mod_mapping.py --dump-modules-list > modules.old
# make your changes
diff -u0 modules.old <(modules_poc/mod_mapping.py --dump-modules-list)
Github will nicely format the diff if you put it in a block like this:
```diff
@@ -1234 +1234 @@
-path/to/file -- old_module
+path/to/file -- new_module
```
Run modules_poc/mod_mapping.py --dump-modules to produce a modules_dump.yaml
file in current directory. This file is a multi-level map from
module name to team name to directory path to list of file names.
For unassigned files it uses __NONE__ as the module name, and for unowned
files it uses __NO_OWNER__ as the team, both of which conveniently sort first.
For owned files it uses the part of the team-name after @10gen/ with -
replaced with _ to be friendlier to querying. In cases where multiple teams
own a file, the file is duplicated to each team's list.
This file can be viewed directly in VSCode. The yaml plugin's breadcrumbs and
folding are very helpful. yq
(jq for yaml) is also a powerful tool. Here are a few
examples using it, some of which produce enough output to be worth opening in vscode:
# list of teams
yq '[.[] | keys] | add | sort | unique[]' -r modules_dump.yaml
# unassigned files owned by server-programmability
yq '.__NONE__.server_programmability' modules_dump.yaml
# files owned by server-programmability across all modules (or lack thereof)
yq '.[] |= (.server_programmability | values)' modules_dump.yaml
# assigned files owned by server-programmability outside of the core module
yq '.[] |= (.server_programmability | values) | del(.core) | del(.__NONE__)' modules_dump.yaml
# assigned files owned by server-programmability in modules that don't start with core
yq '.[] |= (.server_programmability | values) | with_entries(select(.key | startswith("core") | not)) | del(.__NONE__)' modules_dump.yaml
# unowned files as a flat list
yq '.[].__NO_OWNER__ | values | to_entries | map("\(.key)/\(.value[])") | .[] ' modules_dump.yaml -r | sort
# unowned files grouped by directory
yq '[.[].__NO_OWNER__ | to_entries? | .[]] | group_by(.key) | map({key: .[0].key, value: ([.[].value] | add | sort)}) | from_entries' modules_dump.yaml
# assigned files owned by non server-programmability teams inside of the core module (grouped by teams, then directory)
yq '.core | with_entries(select(.key != "server_programmability"))' modules_dump.yaml
To make an API or class available for use by other modules, add a tag to its header declaration.
class MONGO_MOD_PUBLIC Foo {
};
MONGO_MOD_PUBLIC int foo();
Availability specification can also be done at the namespace level.
namespace MONGO_MOD_PRIVATE my_details {
} // namespace MONGO_MOD_PRIVATE my details
Elements inside a class or namespace default to the visibility of the enclosing scope. Note that the canonical version of "inside" can be subtle, with, e.g., member functions being "inside" the class definition, not the location the member function is defined. All forward declarations of the same function or class should have the same visibility tags, and forward declarations across module boundaries should be avoided.
If visibility is not specified at any containing scope,
it defaults to MONGO_MOD_PRIVATE (except in cases where the
header doesn't include mongo/util/modules.h, where the default is UNKNOWN
to facilitate incrementally tagging APIs).
Documentation for individual MONGO_MOD_* tags is present in
mongo/util/modules.h.
This will build the merged_decls.json file in the current directory:
buildscripts/poetry_sync.sh # make sure the python env has the right packages installed
python modules_poc/merge_decls.py
merge_decls.py takes an optional flag --[no-]intra-module to indicate whether you want to include
intra module accesses and declarations that are only used from within their module or submodules. It
defaults to --intra-module, which provides the most information to consumers.
Running merge_decls.py also validates that private APIs aren't being used
outside of where they are permitted. If any are, the script will fail, though
merged_decls.json will still be generated, and the
invalid uses will be printed to stdout.
private_headers.py can be used to find all files where there are no currently detected external usages
and automatically mark them as private. This does not necessarily mean that all automatically marked headers
are intended to be private. A human should review to ensure that the marked headers match intent. Filters
can be applied to adjust the files that are modified.
For headers matching the filter, the script will also warn of usages of _forTest external to the file
family.
Example usage:
./modules_poc/private_headers.py --team=server_programmability --module=core --glob="src/mongo/executor/*"
--dry-run can be used to view all of the changes without applying them.
You can also scan a single file which is useful when iterating on this. You can
either pass it the same flags used to compile, or pass it just a cpp file and it
will figure out the flags from your compile_commands.json. It will create a file
called decls.yaml to the current directory when run this way.
modules_poc/mod_scanner.py src/mongo/bson/bsonobj.cpp
You can start the browser by running modules_poc/browse.py. If merged_decls.json is not present, it will
ask if you want it to run the scanner for you. You can rescan at any time by pressing <kbd>r</kbd>.
It will show the available keybindings on the right, which can be toggled
by pressing <kbd>?</kbd>. If you are running from a VSCode or neovim terminal, you can press
<kbd>g</kbd> to go to any location in your editor. You can also press <kbd>p</kbd> to toggle an
embedded preview of the location the current line is currently on (you probably want to hide the
help when doing this). You can press <kbd>Tab ↹</kbd> to switch between the tree and preview.
The browser is primarily intended to assist in labeling public APIs, so the files are sorted with the most number of unlabeled declarations ("unknowns") first. Only declarations that are used outside of their module are counted and shown. You can search for a file by pressing <kbd>f</kbd> or press <kbd>m</kbd> to filter the files by module.
As an advanced feature, you can pass a custom file to browse.py and it will
use it rather than the default merged_decls.json. It does need still to have
the same shape as the original. This works best with [jq] filtering to do
advanced filtering. For example, here is a command that will only show
declarations where some TUs will only see a forward declaration from another
module, and will assume that that module is the owner (we need to fix this):
./modules_poc/browse.py <(jq '[.[] | select(.other_mods)]' merged_decls.json)
In general, your jq query should be of the form [.[] | select( SOME QUERY )]
to avoid breaking the format expectations. For more advanced analysis, using
jq directly is a good idea.
Run the following command to upload
python modules_poc/upload.py $MONGO_URI # fill this in
If the upload fails with an error connecting and you need to update the IP
whitelist for your virtual workstation, curl -4 wtfismyip.com/text is a good
way to see your public IP address
Once we no longer have errant forward declarations in the wrong module, we can
make the processing a lot faster by having the scanner only write out things
that are used across modules (or in the __NONE__ module if we still want to
track that).
We should explore if using the indexing API (eg clang_indexSourceFile) will
yield better results. In particular, there is a flag to opt-in to visiting all
implicit instantiations which I think is currently a blind spot. Unfortunately
it isn't exposed in the python API yet, so we would need to add it there
first.
Other interesting options would be a clang-tidy plugin or a clang plugin. We
already have a lot of infrastructure to support clang-tidy plugins, but they
will ignore any lines with // NOLINT comments. A clang plugin is
particularly interesting if it will be able to run inside of clangd so we can
show warnings when accessing an unfortunately-public API across modules (we
may want to mark it as deprecated in that case) and errors when accessing
private.
Parallelize the merge script. Right now it is single threaded. We can use
multiprocessing or similar to parallelize it. It should use a queue of input
files and have workers merge them into a local all_decls map and then the
main thread should merge the results from each worker. If we just pass the
path to the workers, this will also parallelize reading the files and parsing
the json.
We should try to report loc as the header declaring the entity, but right
now it will report the cpp file where it is defined. This is currently
important since we use the definition to decide the canonical location and
module when merging. This may cause issues for free functions if the namespace
is marked public in the header. The latter issue can be worked around when
merging by using the visibility from other files if the current visibility is
unknown. But we should pick the right location for loc regardless.
Try to collapse template instantiations that clang's specialized_template
helper fails to. I don't know why it fails to, but it seems to on many of
our templates. Maybe we should just merge all declarations with the same loc?
If we do that, we should try to prefer decls where kind is a template.
Split "unfortunately public" into 3 categories:
NEEDS_REPLACEMENT: The current API isn't ideal for a public API, but consumers need its
functionality. It is on the module owner to provide a better API.USE_INSTEAD("replacement"): A replacement for this API has been provided that external
consumers should switch to. It is on the module consumer to update their code.CURRENTLY_USED: This is a marker we can put on code as we improve the scanner if it finds
new usages of private APIs that were hidden in older versions. The module owner should examine
these and decide if they should be public or marked for replacement.Browser enhancements:
spelling in the output from the scanner, and use that tol
highlight that part of the decl in the listpretty_location() but with spelling rather than display_nameselect() and another
that goes into a sort_by().
For now you can do browse.py <(jq '[.[] | select( ... )]' merged_decls.json), but we
should come up with something better.