rfcs/nosql-schema-sampling/readme.md
As we operate in an environment encompassing both SQL and NoSQL databases, the inherent unstructured nature of NoSQL presents specific challenges.
Despite the flexibility and scalability advantages of NoSQL databases, managing data efficiently becomes a complex task, particularly in the context of ensuring a predictable API and improved coding ergonomics.
To address these challenges, the proposal is for an automatic schema generation tool leveraging NoSQL sampling techniques is proposed.
A proof-of-concept is below for MongoDB, and the same approach can be applied to other NoSQL databases.
Essentially, the idea around is that we can use sampling to take either the entire, or a subset of a collection's documents. Then we would run an analysis to get an idea about the shapes and types that are found within the universe of documents in the schema. Then we would use that analysis to generate a schema which could be used as an onboarding starting point to power the GraphQL schema in Hasura.
This schema would then be customizable to allow for changes as needed by the end-user.
int types and a couple of string types, which should be taken?
We've created a proof of concept for a schema sampler that can be run against a MongoDB database.
It's built using a combination of mongosh (https://www.mongodb.com/docs/mongodb-shell/), Variety MongoDB schema analyzer (https://github.com/variety/variety), and Node.js.
It generates a MongoDB validation schema based on the analysis of the documents in the collection, and then optionally will apply it to the database.
[Later this same work can be used for generating Hasura-representations of the schema using our logical models as well.]
That schema can then be used by Hasura to generate a GraphQL schema on top of the MongoDB datasource.
docker compose up
sample_mflix sample database./schema_sampler/archive.sh which bootstraps introspecting the collections, running them through variety, converting the analysis to a validation schema, and then updating that schema back into MongoDB.The docker container currently contains some environment variables below which can be used for selecting the collections to analyze and sample.
If you'd like to customize the sampling method, you can edit the ./schema_sampler/analyze.sh file.
On line 29 there's:
mongosh ${MONGO_DATABASE} --quiet --eval "var collection = '${collection//\'/}', outputFormat='json'" --username ${MONGO_USERNAME} --password ${MONGO_PASSWORD} --authenticationDatabase=admin /schema_sampler/variety.js > "/schema_exports/analysis/${collection//\'/}.json"
which is where the data is retrieved for sampling using mongosh. You can edit the --eval command to change the sampling method (for example, adding a find() to only records using version of a schema, or a limit() to only return the first 5000 records).
Running docker logs mongodb_sampling would allow you to see what was run when inside the sampling container.
Will contain the intermediate analysis files and JSON files for the validation schema export.
The mongodb_sampler container has a few environment variable helpers which you can set:
MONGO_DATABASE: The MongoDB connection string.MONGO_USERNAME: The MongoDB username.MONGO_PASSWORD: The MongoDB password.MONGO_SELECT_COLLECTIONS: Which collections to analyze and sample. ('' [for all collections]), (movies,comments)MONGO_UPDATE_COLLECTIONS: Automatically update collections in the database with the generated validation schemas. (true or false [or blank for false])