hphp/hack/doc/HIPs/gradual_modularity.md
Status: draft, not actively being worked on. If this changes, this pre-HIP proposal should be updated to match the current HIP template.
Last updated: 2019-10-09
Shared as a HIP for external visibility
In production builds of Facebook WWW, certain directories are dropped, such as internal tools and test directories. This means that any code attempting to access symbols in those directories will fatal at runtime with an autoloader fatal. Hack is not currently able to enforce this.
As a related problem, due to the global nature of our repo, framework developers struggle to hide the implementation details of their frameworks. Often, we rely on naming conventions like _DO_NOT_USE. But, again, there is no static analysis to enforce this, and these symbols end up getting used anyway.
Then, there are similar problems that arise when attempting to define 'black box' APIs - we may want hard isolation while maintaining access to core infrastructure: for example, there may be a blessed library to interact with a data store, and direct access should be banned except by the library.
Lastly, there's the classic problem of open-sourcing Hack frameworks with a source of truth as part of a proprietary monorepo: for example, all of Facebook WWW can use the HSL, but the HSL can not depend on any proprietary code.
These problems are manifestations of a more general problem: the lack of modularity in the Hack language—everything is globally accessible. Upon closer inspection, we can see that there are multiple granularities to this problem: two examples are the "environment" level and the "library" level.
Note that these granularities are not mutually exclusive—for example, HSL is both a library and an environment.
In the same way that we slowly introduced types into an untyped Hack, we can introduce modularity gradually, from coarsest to finest granularity. Accordingly, this document explores a solution to the problem of "environments," while leaving the possibility open to solving the "libraries" problem in the future.
This section describes the environments feature in an abstract sense, using pseudocode, to bring the semantics of the feature to the forefront. Later sections propose a concrete syntax for the feature.
An "environment" is an isolated subdivision of a monorepo, where "isolated" means that code defined outside of a particular environment may not access code within that environment (barring certain exceptions). The boundaries are enforced both by the typechecker and the runtime.
The build system will select environments to include in the build, but all environments do not have to correspond to a hard runtime boundary. This allows environments to be used for massive sections of a repository (e.g. intern/prod), but also for libraries (e.g. HSL).
An environment is specified inside a special configuration file in a directory. A special filename isn't strictly necessary, but it makes environments easy to find within a codebase containing extremely large numbers of files, and makes it harder for users to accidentally define environments.
// Example: in ~/www/flib/environment.x:
environment {
name = Prod
}
All code in source files in said directory and recursive subdirectories belongs to this environment. A source file belongs to exactly one environment: the nearest one defined. For example, if flib/ defines an environment "Prod" and flib/intern/ defines an environment "Intern", then all code under flib/intern/ belongs only to Intern. Source files which do not have a nearest environment definition belong to a “default environment”.
Environments can depend on multiple other environments. For migration purposes, all environments implicitly depend on the “default environment,” with the assumption that all code will eventually move to a defined environment. To define an environment that depends on another, simply define the environment normally and include its dependencies:
// Example: in ~/www/flib/intern/environment.x:
environment {
name = Intern
dependencies = [Prod]
}
As with the first example, all code under flib/intern/ belongs only to Intern. However, now that Intern depends on Prod, code belonging to Intern may access code in Prod, but not vice-versa! The dependency relationship is not transitive; for example, if there were an environment "Scripts" that depended on Intern, it would only be able to access code in Prod, if it also depended on Prod.
That flib/intern/ is a subdirectory of flib/ is irrelevant—dependent environments don't have to be defined in subdirectories of the environments they depend on. Conversely, Intern doesn't have to depend on Prod just because flib/intern/ is a subdirectory of flib/. In fact, the relationship can be inverted—for example, Prod could depend on an "HSL" environment defined in flib/core/hack/lib/!
All environments implicitly depend on the aforementioned “default environment”. This allows for environments to be gradually migrated into a monorepo, without having to do it all at once. It also handles the case of HHVM builtins, which are currently provided by HHIs. Eventually, they may all be inside of a “builtin” environment, but initially they will simply live in the default environment and be accessible everywhere.
Given that __tests__ are scattered around the codebase currently, any proposed feature would be restricted to running regexes on filenames, which would hamstring its ability to integrate into the language. Ideally, this feature assumes that tests in WWW are moved from __tests__ directories to one top-level tests/ directory. But, we may be able to reach a compromise if one environment.x file is able to define multiple environments via regular expressions that only apply to files in that directory.
// Example: in ~/www/flib/environment.x:
environments {
ProdTests {
regex = "#/__tests__/#"
dependencies = [Prod]
}
Prod {
regex = "#.*#"
}
}
A “build” is a collection of environments that indicate the code that’s available in a particular build of the repository. It is specified inside a special configuration file in the root directory of the repository. For example, Facebook’s builds might look like this. Note that our builds evolved to be hierarchical, allowing us to only specify one environment per build, but it's not necessary.
// Example: in ~/www/builds.x:
build {
name = Prod
environments = [Prod]
}
build {
name = Intern
environments = [Intern]
}
build {
name = Scripts
environments = [Scripts]
}
There are two cases of issues in which existing boundaries in WWW are violated. The first issue is simpler than the other: certain classes are defined in Intern, but really should be in Prod. The solution to this problem is to make environments typechecker-only initially, and use the static analysis to move definitions to where they need to be before turning on runtime enforcement.
The second issue is trickier. Sometimes, definitions can't be moved, explicit checks on the current environment at runtime, which explicitly break the abstraction. To support this permeability, we need to be able to introspect on which environments are available in the current build.
// Example: in some prod file:
if (<current build includes Intern>) {
// Access intern code here...
}
Or, if we’re refining with some other mechanism (for example, being is a script context implies that the Scripts environment is available), then users can assert an environment is available with invariant to provide a useful error message.
// Example: in some prod file:
if (Environment::isScript()) {
invariant(
<current build includes Scripts>,
'Being in a script implies that Scripts is available',
);
// Access scripts code here...
}
Environment accessibility is enforced whenever an identifier is referenced. Enforcement is based entirely on the source files of the “caller” and “callee”—the permeability construct doesn’t affect functions called within a permeability block.
// Example: in some prod file:
class ProdClass extends InternClass // Error
implements InternInterface { // Error
use InternTrait; // Error
}
function f(mixed $x): void {
intern_function(); // Error
intern_function<>; // Error
InternClass::SOME_CONST; // Error
h<InternClass>(); // Error
new InternClass(); // Error
$x is InternClass; // Error
if (<current build includes Intern>) {
intern_function(); // OK
intern_function<>; OK
InternClass::SOME_CONST; // OK
h<InternClass>(); // OK
new InternClass(); // OK
$x is InternClass; // OK
g(); // Note that the usages in g() are still errors
}
}
function g(): void {
intern_function(); // Error
echo InternClass::SOME_CONST; // Error
// etc...
}
function h<reify T>(): void {}
When typechecking a file, the typechecker maintains a list of available environments. Initially, the list has exactly one element. Whenever a symbol is referenced, the typechecker compares the environment of the current source file against the environment of the referenced symbol. If they’re incompatible, an error is raised. The permeability construct will add the checked environment to the otherwise singleton list of current environments, and the typechecker will iterate over the current environments within the permeating block.
When a boundary violation is found, the typechecker's error message will list the two environments, including an explanation as to why each source file is in its respective environment. The explanation can be computed from the two ways environments can be defined (directory plus optional string pattern). However, the error message will not point to the environment file, as we do not want to advertise this feature to WWW users (but they can easily find it by looking in the directory specified).
For example, consider the environments defined above (Prod, ProdTests, Intern). If code from Intern attempted to access code from ProdTests, the error message would say:
Cannot reference a symbol in another environment
This use-site is in the Intern environment because it is in directory flib/intern/
The symbol X is in the ProdTests environment because it is in directory flib/ and matches the pattern __tests__
Intern does not depend on ProdTests
In repo-authoritative mode, the current build should be known when the repository is compiled (e.g. we know if we are going to deploy to prod, intern, etc). This means that environments that aren’t part of the build will simply be dropped from the repository, any environment checks on identifiers can be elided, and permeability conditions can be statically checked and compiled out (akin to ifdef). However, if a function is annotated with the __DynamicallyCallable attribute, then the checks must remain, because we don’t know which environments it may be called from.
In sandbox mode, the available environments must be defined per-request, so that they behave identically to how such a request would behave in production. For example, in development, a user should not be able to access scripts in a web request. The native autoloader can be environment-aware, and refuse to load symbols if the environments are incompatible. Similarly to the typechecker, the permeability construct will “refine” the current environment to the one checked within the scope of the checked block.
Some ideas on how to determine the available environments:
Facebook has benefited tremendously by betting on the monorepo model. This feature is intended to mitigate some shortcomings of the monorepo without cutting into the benefits. We want to help users understand when they’re writing code that will behave differently in production, but we do not want to encourage carving up the WWW repo into silos. We want to avoid a world in which developers do not feel empowered to contribute to parts of the codebase because it “belongs” to someone else. We also don’t want to enable an intractable dependency graph to develop in the WWW repo.
Therefore, for the initial rollout of this feature, we will only allow environments that correspond to a form of hard isolation (e.g. a build boundary, or an OSS library). Framework maintainers who wish to hide implementation details will have to wait—environments are not intended to be a packages feature.
What if we decide that environments aren't the right solution for WWW? Perhaps they'd proliferate too quickly or make the WWW developer experience too cumbersome. Recall this property of environment definitions discussed above:
Source files which do not have a nearest environment definition belong to a “default environment”.
It follows that to roll back to a pre-environment state in WWW, it would be sufficient to delete all environment files from WWW. Then, all declarations would be in the default environment, and would be accessible to each other, both in the typechecker and the runtime (if we get that far before rolling back).
The environment checking code could then be removed from the typechecker, followed by the code that records environments into saved states.
This is one possible initial configuration which adheres to the principle defined above:
flib/ but not flib/intern/.
flib/intern/.
scripts/.
Then, our three builds would be:
There are a few fundamental points that determine how environments will affect Hack's incremental typechecking model.
Then, most working copy changes related to environments may be reduced down to one or more files moving from environment A to B, in which case we re-typecheck those files and their dependents. This means that the only machinery we must add to the server is watching for environment changes, and mapping those changes to a list of files that moved environments. Consider the following situations:
Another concern is how the addition of environments affect saved states. TODO:
If each of our OSS libraries will be an environment internally, there is potential to integrate environment specification files with package managers. In order to do so specifications would have to be extended with at least the following information:
For a specific package manager, the current frontrunner is Esy, Reason's package manager. One large factor in choosing Esy over Yarn is that it's a binary (versus Yarn requiring Node on the machine, or us having to package Node with HHVM). We'd need to implement a plugin to read HDF and make sure environment definitions are flexible enough to be used by package managers, and the developers of Esy have been open to collaborating on the design process.
A pitfall to consider is that Esy and Yarn install packages in a global cache on the system (instead of in the project root). This means that the typechecker will need to be taught about this cache in some way to still work in OSS.
For a concrete proposal for environments, we’ll choose HDF. It’s consistent with HHVM’s current configuration format, and it supports dictionaries and lists. Other options considered were YAML, TOML, and JSON, but each were either inappropriate or suboptimal.
To define environments, define named nodes inside an environments field in a file named __environments.hdf in a directory. The special filename isn't strictly necessary, but it makes environments easy to find within a 2.5M file codebase, and makes it harder for users to accidentally define environments. To define an environment that depends on another, simply define the environment normally add a dependencies field:
// Example: in ~/www/flib/__environments.hdf:
environments {
ProdTests {
regex = "#/__tests__/#"
dependencies {
Prod
TestInfra
}
}
Prod {
regex = "#.*#"
}
}
// Example: in ~/www/flib/intern/__environments.hdf:
environments {
InternTests {
regex = "#/__tests__/#"
dependencies {
Intern
TestInfra
}
}
Intern {
regex = "#.*#"
dependencies {
Prod
}
}
A "build" represents a concrete subdivision of a monorepo which is composed of environments. To define builds, define named nodes inside a builds field inside a file named __builds.hdf in the root directory of the repository.
// Example: in ~/www/__builds.hdf:
builds {
Prod {
environments {
Prod
}
}
Intern {
environments {
Intern
}
}
Scripts {
environments {
Scripts
}
}
}
Note that our builds evolved to be hierarchical, allowing us to only specify one environment per build, but it's not necessary. A build may contain multiple completely separate environments.
To implement runtime permeability, we use a new builtin function HH\environment_available, which takes a string representing an environment name as defined above. If the environment is included in the current build, it returns true, or false otherwise. The typechecker understands this function and will include the listed environment for the duration of the block. It follows that the argument passed to the function must be a string literal.
// Example: in some prod file:
if (HH\environment_available('Intern')) {
// Access intern code here...
}
Or, we’re refining with some other mechanism (for example, being in a script implies that the Scripts environment is available), then we can assert an environment is available with invariant to provide a better error message.
// Example: in some prod file:
if (Environment::isScript()) {
invariant(
HH\environment_available('Scripts'),
'Being in a script implies that Scripts is available',
);
// Access scripts code here...
}
A Cargo workspace is a set of packages that share the same Cargo.lock and output directory. Packages within a workspace may depend on each other, and depend on the same set of external dependencies.
This is similar to this proposal in which the repo is the workspace, and each “package” is an environment. At Facebook, there is no lock file, but externally, a repo would have just one composer.lock, and each environment in the repo would depend on the same external dependencies.
One key difference is that there isn’t a notion of transitive dependencies (there was at some point, but it was a bug); dependencies must be declared explicitly:
The top-level Cargo.lock now contains information about the dependency of
add-oneonrand. However, even thoughrandis used somewhere in the workspace, we can’t use it in other crates in the workspace unless we addrandto their Cargo.toml files as well.
The nomenclature is different here, but the concepts are similar to Cargo workspaces. A Yarn workspace is what Cargo would call a package (or we’d call an environment), but otherwise works similarly for interdependencies and external dependencies:
Requiring
workspace-afrom a file located inworkspace-bwill now use the exact code currently located inside your project rather than what is published on npm, and thecross-envpackage has been correctly deduped and put at the root of your project to be used by bothworkspace-aandworkspace-b.
Some key differences between Yarn workspaces and this proposal include:
package.json file, your tests might still pass locally if another package already downloaded that dependency into the workspace root.”
You can think of an assembly as a collection of types and resources that form a logical unit of functionality and are built to work together. In .NET Core and .NET Framework, an assembly can be built from one or more source code files. In .NET Framework, assemblies can contain one or more modules.
Assemblies are analogous to environments in our proposal, and modules would be analogous to packages, which we may design in the future.
One key difference between assemblies and our proposal is the notion of friend assemblies: a friend assembly is an assembly that can access another assembly's internal (in C# or Friend in Visual Basic) types and members. For V1 of this proposal, we have omitted environment-level visibility and punted until we choose to design a packages feature.
Probably something to look into as well, because: