services/web/scripts/delete-orphaned-docs/README.md
Because of the large numbers of documents and projects it is necessary to detect orphaned docs using bulk exports of the raw data.
Follow the directions in google-ops/README.md for exporting data from mongo
and copying the files to your local machine.
Run the following doc export command to export all doc ids and their associated project ids in batches of 10,000,000.
mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection docs --fields '_id,project_id' --skip 0 --limit 10000000 --type=csv --out docs.00000000.csv
This will produce files like:
_id,project_id
ObjectId(5babb6f864c952737a9a4c32),ObjectId(5b98bba5e2f38b7c88f6a625)
ObjectId(4eecaffcbffa66588e000007),ObjectId(4eecaffcbffa66588e00000d)
Concatenate these into a single file: cat docs.*csv > all-docs-doc_id-project_id.csv
For object ids the script will accept either plain hex strings or the ObjectId(...)
format used by mongoexport.
Export project ids from all projects and deletedProjects
mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection projects --fields '_id' --type=csv --out projects.csv
mongoexport --uri $READ_ONLY_MONGO_CONNECTION_STRING --collection deletedProjects --fields 'project._id' --type=csv --out deleted-projects.csv
Concatenate these: cat projects.csv deleted-projects.csv > all-projects-project_id.csv
cut -d, -f 2 all-docs-doc_id-project_id.csv | sort | uniq > all-docs-project_ids.sorted.uniq.csv
sort all-projects-project_id.csv | uniq > all-projects-project_id.sorted.uniq.csv
comm --check-order -23 all-docs-project_ids.sorted.uniq.csv all-projects-project_id.sorted.uniq.csv > orphaned-doc-project_ids.csv
grep -F -f orphaned-doc-project_ids.csv all-docs-doc_id-project_id.csv > orphaned-doc-doc_id-project_id.csv
node delete-orphaned-docs orphaned-doc-doc_id-project_id.csv
By default the script will only print the list of project ids and docs ids to be
deleted. In order to actually delete docs run with the --commit argument.
The --limit and --offset arguments can be used to specify which lines to
process. There is one doc per line so a single project will often have multiple
lines, but deletion is based on project id, so if one doc for a project is
deleted all will be deleted, even if all of the input lines are not processed.