roadmap/implementers-guide/src/node/utility/pvf-host-and-workers.md
The PVF host is responsible for handling requests to prepare and execute PVF code blobs, which it sends to PVF workers running in their own child processes.
This system has two high-levels goals that we will touch on here: determinism and security.
One high-level goal is to make PVF operations as deterministic as possible, to reduce the rate of disputes. Disputes can happen due to e.g. a job timing out on one machine, but not another. While we do not have full determinism, there are some dispute reduction mechanisms in place right now.
If the execution request fails during preparation, we will retry if it is possible that the preparation error was transient (e.g. if the error was a panic or time out). We will only retry preparation if another request comes in after 15 minutes, to ensure any potential transient conditions had time to be resolved. We will retry up to 5 times.
If the actual execution of the artifact fails, we will retry once if it was a possibly transient error, to allow the conditions that led to the error to hopefully resolve. We use a more brief delay here (1 second as opposed to 15 minutes for preparation (see above)), because a successful execution must happen in a short amount of time.
We currently know of the following specific cases that will lead to a retried execution request:
We use timeouts for both preparation and execution jobs to limit the amount of time they can take. As the time for a job can vary depending on the machine and load on the machine, this can potentially lead to disputes where some validators successfuly execute a PVF and others don't.
One dispute mitigation we have in place is a more lenient timeout for preparation during execution than during pre-checking. The rationale is that the PVF has already passed pre-checking, so we know it should be valid, and we allow it to take longer than expected, as this is likely due to an issue with the machine and not the PVF.
Another timeout-related mitigation we employ is to measure the time taken by jobs using CPU time, rather than wall clock time. This is because the CPU time of a process is less variable under different system conditions. When the overall system is under heavy load, the wall clock time of a job is affected more than the CPU time.
In general, for errors not raising a dispute we have to be very careful. This is only sound, if we either:
Reasoning: Otherwise it would be possible to register a PVF where candidates can not be checked, but we don't get a dispute - so nobody gets punished. Second, we end up with a finality stall that is not going to resolve!
There are some error conditions where we can't be sure whether the candidate is really invalid or some internal glitch occurred, e.g. panics. Whenever we are unsure, we can never treat an error as internal as we would abstain from voting. So we will first retry the candidate, and if the issue persists we are forced to vote invalid.
With on-demand parachains, it is much easier to submit PVFs to the chain for preparation and execution. This makes it easier for erroneous disputes and slashing to occur, whether intentional (as a result of a malicious attacker) or not (a bug or operator error occurred).
Therefore, another goal of ours is to harden our security around PVFs, in order to protect the economic interests of validators and increase overall confidence in the system.
Webassembly is already sandboxed, but there have already been reported multiple CVEs enabling remote code execution. See e.g. these two advisories from Mar 2023 and Jul 2022.
So what are we actually worried about? Things that come to mind:
polkadot binary
with a polkadot-evil binary. Should again not be possible with the above
sandboxing in place.A basic security mechanism is to make sure that any thread directly interfacing with untrusted code does not have access to the file-system. This provides some protection against attackers accessing sensitive data or modifying data on the host machine.
We clear environment variables before handling untrusted code, because why give attackers potentially sensitive data unnecessarily? And even if everything else is locked down, env vars can potentially provide a source of randomness (see point 1, "Consensus faults" above).