hphp/hack/doc/HSL_design/io.md
Fully implemented in Hack code (https://github.com/hhvm/hsl-experimental and Facebook WWW); widely used externally by most CLI applications in Hack, including HHAST’s LSP support. Originally derived from HHAST LSP server’s async IO implementation.
Used in Facebook www in limited places due to ‘experimental’ status.
HSL IO aims:
It is composed of several new namespaces:
HH\Lib\OS: this is a thin layer exposing the traditional C File-Descriptor-based APIs; minimal changes are made to the C APIs.
OS\write("foo") only writes 1 character, this is considered success, not a failure, and users must check for it - just like in C.HH\Lib\{File, Unix, TCP}: functions, classes, and interfaces that are specific to a particular kind of IO ‘handle’.HH\Lib\{IO, Network}: functions, classes, and interfaces that are shared or reusable between multiple kinds of IO handlesphp://input, php://output, print(), etc. See ‘Future Work’ for details.This document is not intended to be a full API design review of the library; however, for completeness, the full APIs can be reviewed in the documentation at https://docs.hhvm.com/hsl-experimental/reference/
The primary motivations are:
posix_get_last_error() , posix_errno(), and socket_last_error() are unreliable, especially when async or CLI server mode is being used; they also depend on mutable global state.resource types; file/stream/socket resources have observable destructor-like behaviorAdditional motivations are:
resource typeT | false return types are prevalent and not supported by the Hack type system.Making HSL IO built-in will unblock work on:
STDIN, STDOUT, php://input, php://output
$file = File\open_write_only(
'/tmp/foo.txt',
File\WriteMode::OPEN_OR_CREATE, // optional
0644 // optional
);
// Close the file handle on scope exit:
using $file->closeWhenDisposed();
$conn = await TCP\connect_async(
'localhost',
8080,
shape( // optional
'timeout_ns' => 123,
'ip_version' => Network\IPProtocolBehavior::PREFER_IPV6,
),
);
using $conn->closeWhenDisposed();
// Write "foo\n" or throw:
await $file->writeAllAsync("foo\n");
// Ditto:
await $conn->writeAllAsync("foo\n");
// the OS\write()/POSIX behavior:
await $conn->writeAllowPartialSuccessAsync("foo\n");
$conn->close();
// Line- and character-based operations
$br = new IO\BufferedReader($file);
$line = await $br->readLinexAsync();
foreach ($br->linesIterator() await as $line) {
// do stuff with each line, without awaiting in a loop
}
$chunk = await $br->readUntilAsync("\nMARK\n");
// Escape hatch in case you have an edge case we don't have a
// high-level API for:
$fd = $file->getFileDescriptor();
invariant(!OS\isatty($fd), '/tmp/foo.txt is a tty?!');
$stdin = OS\request_input();
if ($stdin is IO\FDHandle) {
// probably CLI mode
if (OS\isatty($stdin->getFileDescriptor())) {
// do interactive things
} else {
// Maybe `myapp < /tmp/foo.txt` or `someotherapp | myapp` ?
// do non-interactive things
}
} else {
// probably POST data, and $stdin->getFileDescriptor() would be a type
// error. Equivalent to `php://input` thing, which may or may not be
// backed by a real FD; it isn't when using proxygen.
}
$file is a File\CloseableWriteHandle; in turn, this is an IO\CloseableSeekableWriteHandle and:
IO\Handle: this is an empty base interfaceIO\CloseableHandle: an IO\Handle with close(); currently, all concrete IO\Handles are closeable, but others have been suggested in the past; e.g. a IO\server_error() handle returning process STDERR; an individual request should not be able to close HHVM server stderr.IO\WritableHandle and an IO\SeekableHandleIO\FDHandle: this is the integration point with the OS\ namespace: this means that:
$file->getFileDescriptor() is available and returns an OS\FileDescriptorOS\ functions, e.g. writeAllAsync() is implemented with OS\write() and OS\poll_async()$conn is also a Closeable, Writable, and FileDescriptor Handle, but it is not a Seekable handle.
These interfaces are best thought of as intersections: a FooBarBazHandle is a FooBarHandle, FooBazHandle, BarBazHandle, FooHandle, BarHandle, and BazHandle. Concretely, a CloseableReadWriteHandle is a CloseableHandle, a ReadHandle, a WriteHandle, a ReadWriteHandle, a CloseableReadHandle, and CloseableWriteHandle. Function authors should aim to describe what functionality they need when restricting input types, and take less specific interface possible - for example, if a function only needs to call writeAllAsync(), it should take an IO\WriteHandle, not a File\WriteHandle
IO\Handles are not disposables; they must manually be closed, or closeWhenDisposed() should be called to get a disposable that will close on exit. This disposable is not itself an IO\Handle.
The majority of IO\Handles are IO\FDHandles, built on top of an OS\FileDescriptor. This is a native object which is a thin wrapper around the C int file descriptor concept, which ensures that:
For example, a writeAllowPartialSuccessAsync() call ends up being a call to OS\write($this->fd).
OS\write() is a very thin Hack wrapper around the native builtin HH\Lib\_Private\_OS\write(); the separation of responsibilities is that:
For example, _OS\write() may throw an _OS\ErrnoException(), and OS\write() may catch this and instead throw an OS\FileNotFoundException; as user-facing exception hierarchy is a very subjective and opinionated area, it is left for the Hack code.
Async support for IO\FDHandle is built on O_NONBLOCK, and libevent/libevent2’s FD support.
The current exception hierarchy is based on Python 3’s work, which appears well received. Concretely, there is:
OS\ErrnoException: this is both a base class, and instantiable when there is not a more specific exceptionOS\AlreadyExistsException, OS\IsADirectoryException, OS\IsNotADirectoryExceptionWhile catch (OS\ErrnoException $e) { switch ($e->getErrno()) { /* ... */ }} is possible, the hierarchy aims to make this an antipattern in the common case.
Implementing this HIP would require moving the relevant code from Facebook’s www repository and the github hsl-experimental repository to HHVM builtins, and adding appropriate HHIs.
There were previously strong opinions that all IO handles should be Disposable, and closed when disposed.
The vast majority of PHP IO usage in Facebook WWW should be using Disposable-based APIs, specifically APIs focussed on temporary files. However, the majority of IO does not go through PHP IO, or things we usually think of as “the IO library” - it uses dedicated extensions, such as:
We aim for HSL IO to be usable for implementing clients instead of extensions for other services that are not currently supported (for example, gRPC, Redis) but solve similar use cases; as such, we should be asking ourselves: “would this design choice prevent us from reimplementing McRouter in Hack using HSL IO?”.
I do not believe that disposable-only Thrift/McRouter/MySQL/<censored> APIs would be practical for Facebook for the same reasons that I believe that HSL IO can not be disposable-only, detailed below.
Alternatively: Disposable is a ban on encapsulation.
If HSL IO was Disposable-based, and a 100% replacement:
TCP\Socket as an implementation detail. Instead, either:
interface Logger { public function logAsync(string $message): Awaitable<void> } could be implemented in three ways, none of which are practical:
logAsync() is called; this is unacceptable performance-wise even for local files, and definitely for networked logging services.<<__AcceptDisposable>> IO\WriteHandle parameters. This is bad for usability, prevents implementation hiding, and requires a codemod to add an extra arg to every callsite if we ever want to log to two places (e.g. when migrating logging frameworks).I mean this in a similar sense to ‘viral licenses’: if one thing is Disposable, everything that touches it must also be Disposable, recursively.
Using the previous Logger example: if we make the IO handle a parameter, any component that could possibly want to log - or use another component that may want to log - would need to take an <<__AcceptDisposable>> $logTo parameter.
If the common ‘controller’ pattern is used:
For some use cases (e.g. live stream uploads), it must be possible to access raw IO streams for POST data; this means that the web controller stack would have similar needs to the hypothetical logger case.
Instead of making it so that STDERR or similar is passed through the application, we could keep it accessible via a free function. The next immediate problem is that the first function that logs to STDERR will acquire it as a disposable, and automatically close it when that function exits.
We could address this by special-casing the STDIO handles to not actually close on exit, or by making them global constants similar to the existing STDIN/STDOUT/STDERR constants (which would require allowing constants that are objects, and that implement IDisposable).
The major shortcoming here is that replacing “log to STDERR” with “log to file”, “log to syslog”, or “log to my favorite SAAS logging provider” becomes a massive challenge requiring refactoring of every logging callsite. It also is itself a breach of encapsulation, as a IO\WriteHandle parameter without <<__AcceptDisposable>> will effectively be the same thing as an STDIOWriteHandle.
This could be avoided by implementing a special API to ‘strip’ d
Instead of the disposable actually controlling open/close, it could be an indirect reference to the real IO handle (similar to a shared_ptr). This would address:
This is not actually a solution for Disposable-based IO handles: it is equivalent to “not Disposable-based” as it requires there to be an underlying non-disposable IO handle backing this, and requires it to be possible to acquire one from a disposable, removing the desired safety.
This is undefined behavior:
concurrent {
await $f->writeAllAsync('FooBar');
await $f->writeAllAsync('HerpDerp');
}
As OS\write() can partially succeed, even with a single thread, this can result in FooHerpBarDerp or many other sequences.
In an early experimental version of HSL IO, write operations were queued, meaning that “FooBarHerpDerp” and “HerpDerpFooBar” would be the only possible results from the above code. This queueing was removed when callers were added that were essentially:
concurrent {
await async {
await $f->seekAsync(123);
$data = await $f->writeAllAsync("foo");
};
await $f->writeAllAsync("barbaz\n");
}
This was similarly undefined behavior; additionally, there is no guarantee that the “foo” write is the next write after the seek.
This queuing was removed as the only way to make multiple writers actually safe is to add application-level locking/queuing that spans multiple IO operations (such as Async\Semaphore, or linked-list-of-awaitable-style queues). Removing this built-in queuing also removed the need for methods like seek() and close() to be async.
HSL IO/OS’s exception hierarchy is heavily influenced by Python 3; this is a substantial change from Python 2, described in https://www.python.org/dev/peps/pep-3151/. These tend to indicate the cause of failure, rather than what went wrong. For example, opening a file may fail with a FileNotFoundError, not a FileOpenError.
The primary goal for HSL IO was for the exception type to be sufficient to determine the appropriate action; in particular, we did not want this to become a common pattern:
try {
...
} catch (IO\Exception $e) {
switch ($e->getErrno()) {
case ENOENT:
...
...
}
}
Instead, we wanted most code to look like this:
try {
...
} catch (OS\FileNotFoundException $e) {
...
}
This difference is very similar to the change that has been made between Python 2 and 3, and the Python 3 hierarchy meets HSL IO’s design goals, and is well received by the Python community. While the first form is still possible (and necessary for some rare cases), it is not the usual pattern. Java and Ruby take a similar approach.
There is a more in depth discussion and cross-language comparison at https://github.com/hhvm/hsl-experimental/issues/37
An early version of HSL IO did not have read-write handles; instead, you could have both a read and write handle for the same file. This would have reduced the number of interfaces significantly, and resolved some of the issues that the locking/queuing were also meant to address, assuming the two handles operated independently.
This feature was removed and replaced with ReadWrite handles as there is not adequate portable support in libc:
dup() call, they are not truly independent, and behavior various by platform. For example, a seek() on the ‘read’ FD may also seek the ‘write’ FD.Rust, Python, and the Qt C++ library all provide disposable-like behavior, however it is either optional, or additional language features (pointers, borrowing) or weaker restrictions (e.g. composability) address the problems.
Files and Sockets in Rust have Disposable-like behavior/restrictions, however the “borrowing” language feature provides a clean way to avoid the issues raised previously.
In Python, with is strongly encouraged, but is optional. This provides safety for simple cases, but allows the flexibility needed to address others (e.g. composition is permitted).
QIODevice, QFile, and QTCPSocket classes in the Qt library for C++ are non-copyable like disposables, close on destruct, however pointers and composition are permitted, avoiding the problems above.
Rust and Python take a similar approach to HSL IO: common operations are provided via high-level classes - however the raw file descriptor is available for niche operations. HSL IO differs in that an object wrapper is used (OS\FileDescriptor) instead of directly exposing the int.
The wrapper object is needed for correctness in CLI-server mode, and in a multi-request environment.
Buffered reading was split out to a separate class as mixing buffered and unbuffered reads leads to undefined and unintuitive behavior; for example, readByteAsync() may have read and cached 1024 bytes but only returned the first 1. Unbuffered reads need to remain possible for packet-based network protocols.
Rust (std::io::BufReader), Java (java.io.BufferedReader), and Python (io.BufferedReader) all take the same solution to this problem.
Python also offers implicit buffering when opening a file, however this makes all operations buffered - it does not mix them.
In Rust, Read and Write are distinct interfaces - however, File::open() always returns a File, which implements both, regardless of the open mode. This allows functions to specify that a parameter only needs to be readable, but it remains possible to pass a write-only file to such a function.
Java IO is similar to HSL IO for handling of read (via InputStream), write (via OutputStream), and read write (ByteChannel) for files - however these are created via separate functions/methods, and are not part of a File or Socket object. For example, a Socket does not extend OutputStream, but provides getInputStream and getOutputStream methods. This design encourages APIs to take InputStream or OutputStream parameters instead of Sockets, providing similar benefits to HSL IO’s IO\ReadHandle and IO\WriteHandle separation.
var_dump(), print_r(), etc: setting O_NONBLOCK on stderr breaks a lot of debugging utilities, especially with large outputs. Should they set O_NONBLOCK off temporarily, or should they be async, or hidden HH\Asio\join`?
What do we ‘encourage’ in www, e.g. via namespace aliasing?
IO\: yes, especially the interfacesOS\: yes: while the functions in this namespace should very rarely be used directly, the exception classes should, and this is the correct place for themFile\: maybe; we may want to only encourage wrappers instead, e.g. restricting what paths can be opened.TCP\, Unix\, Network\: yes: these will be rarely used, but are so niche that it is unlikely that general-purpose wrappers will be written.In order to remove the PHP STDIN, STDOUT, STDERR constants, $_POST, $_GET and similar, we need to define a replacement. This should be built on HSL IO, and will effectively tie all Hack programs to some aspects of HSL IO’s design - in particular, if HSL IO handles were disposable, this would apply severe constraints to the design of Hack programs.
Suggestions have included:
<<__CLIEntryPoint>>
async function main(
vec<string> $argv,
vec<string> $envp,
IO\ReadHandle $stdin,
IO\WriteHandle $stdout,
IO\WriteHandle $stderr,
): Awaitable<void> {}
<<__CLIEntryPoint>>
function main(CLIRequestContext $ctx): void {
$argv = $argv->getArgv();
$stdin = $ctx->getStdin();
}
<<__WebEntryPoint>>
function main(WebRequestContext $ctx): void {}
<<__EntryPoint>>
function main(): Awaitable<void> {
$argv = OS\argv();
$envp = OS\envp();
$stdin = OS\stdin(); // an IO\CloseableReadFDHandle or throw
$input = OS\request_input(); // CLI STDIN or HTTP POST data (php://input)
}
While the specific API does not matter at this point, this both constrains - and is constrained by - HSL IO’s design, especially if Disposables were used. If Disposables were used, the ‘virality’ issues above will be severe for every Hack program that does any form of IO, including sending HTML to the browser.
IO resources inherited from PHP are:
Once HSL IO is built-in, we can start to remove these.
There are various operations that are only appropriate for some handles, e.g.:
Each set of related operations is represented by an interface, e.g.:
IO\ReadHandleIO\WriteHandleIO\SeekableHandleNetwork\SocketAs mentioned in ‘user experience’, each combination needs its’ own interface in order to be usable as a parameter type; for example, if a function needs to read, write, and seek, it takes an IO\SeekableReadWriteHandle.
Early versions of HSL IO contained manual definitions of the ‘reasonable’ combinations, but this was a frequent source of bugs due to oversights. Now the interfaces are generated, which means that if any new interface is added, the number of generated interfaces roughly doubles. Currently, there are 26.
There have been requests/suggestions for additional interfaces, e.g.:
If these (or any other 3) were added, there would be 245 generated interfaces; if 5 were added, we’d get to 1010. This is not scalable.
We will either need to be extremely selective about what new interfaces are added, or we will need a new way to represent these types. Denotable intersection types are the clearest fit, as these empty interfaces are effectively defining intersections:
- interface ReadWriteHandle extends ReadHandle, WriteHandle {}
+ type ReadWriteHandle = ReadHandle & WriteHandle;
The main advantage is that the intersections do not need to be predefined: currently, if a function needs to read() and seek(), it should take an IO\SeekableReadHandle, which must have been predefined in the HSL, and implemented by the concrete handle being passed.
If denotable intersection types were supported:
IO\SeekableReadHandle was defined as an intersection, it would not need to be explicitly present in the type hierarchy of the concrete implementationsIO\SeekableReadHandle does not need to be predefined: a function could take a IO\SeekableHandle & IO\ReadHandle parameter instead.This would:
While a non-disposable IO API is essential as a building block for higher-level libraries (e.g. redis clients), a Disposable API is highly desirable for operations we commonly think of as IO, like opening files and directly interacting with a TCP socket. This can be built on top of a non-Disposable API, but a non-Disposable API can not be built on top of a Disposable API.
Some ways this could be implemented:
A File\DisposableReadWriteHandle could wrap a File\ReadWriteHandle; functions should aim to specify their parameters as <<__AcceptDisposable>> wherever possible.
This requires another interface (and the corresponding doubling of generated interfaces), and hand-written wrappers for each instantiable handle class.
The user experience would be similar to Python’s with:
class File implements ..., IOptionallyDisposable {
// ...
public function __dispose() {
$this->close();
}
}
// No `using`, __dispose is simply not called:
$f = new File('/tmp/foo');
// ... so must manually close:
$f->close();
using $f = new File('/tmp/bar');
// automatically closed - but you can call `$f->close()` earlier if you want
This should be a separate HIP, but I believe it would be the best approach:
There is one major issue with this concept though:
function takes_rh(IO\ReadHandle $_): void {}
function takes_drh(<<__AcceptDisposable>> IO\ReadHandle): void {}
takes_rh(new File('/tmp/foo')); // No error
using $f = new File('/tmp/foo');
takes_drh($f); // No error
takes_rh($f); // error?
Error: it is being used as a disposable, so should act like one
No error: IOptionallyDisposable would primarily be useful for convenience; it should not be considered an ‘ownership’ model. As such, it is likely that __dispose() calls another public method (like close()), so if something keeps hold of the handle after __dispose was called, it may be in a bad state, but this state would be reachable even if it wasn’t disposable.