TODO.md
Many manager configuration settings that are only applicable to user manager or system manager can be always set. It would be better to reject them when parsing config.
Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected]. Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected]. Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
Fedora: add an rpmlint check that verifies that all unit files in the RPM are listed in %systemd_post macros.
dbus:
fedora: suggest auto-restart on failure, but not on success and not on coredump. also, ask people to think about changing the start limit logic. Also point people to RestartPreventExitStatus=, SuccessExitStatus=
neither pkexec nor sudo initialize environ[] from the PAM environment?
fedora: update policy to declare access mode and ownership of unit files to root:root 0644, and add an rpmlint check for it
missing shell completions:
<command> <verb> -<TAB> should complete options, but currently does notsystemctl status should know about 'systemd-analyze calendar ... --iterations='
If timer has just OnInactiveSec=..., it should fire after a specified time after being started.
write blog stories about:
look for close() vs. close_nointr() vs. close_nointr_nofail()
check for strerror(r) instead of strerror(-r)
pahole
set_put(), hashmap_put() return values check. i.e. == 0 does not free()!
link up selected blog stories from man pages and unit files Documentation= fields
machined: make remaining machine bus calls compatible with unpriv machined + unpriv npsawn: GetAddresses(), GetSSHInfo(), GetOSRelease(), OpenPTY(), OpenLogin(), OpenShell(), BindMount(), CopyFrom(), CopyTo(), OpenRootDirectory(). Similar for images: GetHostname(), GetMachineID(), GetMachineInfo(), GetOSRelease().
rework mount.c and swap.c to follow proper state enumeration/deserialization semantics, like we do for device.c now
Replace our fstype_is_network() with a call to libmount's mnt_fstype_is_netfs()? Having two lists is not nice, but maybe it's now worth making a dependency on libmount for something so trivial.
drop set_free_free() and switch things over from string_hash_ops to string_hash_ops_free everywhere, so that destruction is implicit rather than explicit. Similar, for other special hashmap/set/ordered_hashmap destructors.
generators sometimes apply C escaping and sometimes specifier escaping to paths and similar strings they write out. Sometimes both. We should clean this up, and should probably always apply both, i.e. introduce unit_file_escape() or so, which applies both.
xopenat() should pin the parent dir of the inode it creates before doing its thing, so that it can create, open, label somewhat atomically.
use CHASE_MUST_BE_DIRECTORY and CHASE_MUST_BE_REGULAR at more places (the majority of places that currently employ chase() probably should use this)
Remove any support for booting without /usr pre-mounted in the initrd entirely. Update INITRD_INTERFACE.md accordingly.
remove cgroups v1 support (overdue since EOY 2023). As per https://lists.freedesktop.org/archives/systemd-devel/2022-July/048120.html and then rework cgroupsv2 support around fds, i.e. keep one fd per active unit around, and always operate on that, instead of cgroup fs paths.
drop support for LOOP_CONFIGURE-less loopback block devices, once kernel baseline is 5.8.
Remove /dev/mem ACPI FPDT parsing when /sys/firmware/acpi/fpdt is ubiquitous. That requires distros to enable CONFIG_ACPI_FPDT, and have kernels v5.12 for x86 and v6.2 for arm.
Remove support for deprecated FactoryReset EFI variable in systemd-repart, replaced by FactoryResetRequest (was planned for v260).
Consider removing root=gpt-auto, and push people to use root=dissect instead.
remove any trace of "cpuacct" cgroup controller, it's a cgroupv1 thing. similar "devices"
bootctl set-tries for setting retry counters on boot entries
report: allow to compile statically (together with the basic and cgroup backends)
report: make sure backends can also be invoked via forking off
report: backend that extracts 10 most recent log msgs of a certain priority
implement enough of PCP in a new sd-pcp-client library that networkd can use to punch holes for wireguard into common NAT routers.
measure an uapi16 manifest of /etc/ during early boot (so that pre-initialized /etc/ can be detected when systems are enrolled into some subsystem)
optionally turn off import of imds on non-firstboot creds (so that IMDS can be considered an attack vector, except for TOFU)
store workload identity OIDC server contact info in cloud imds hwdb.
systemd-analyze unit-shell-me-harder that has both host and unit trees around but mostly lives in unit namespces
os-release consumption at boot: version validation, and maybe in os-release
ed25519 authentication for sd-boot upgrades for the dm-verity key logic
change machine tags into key/value pairs instead of just labels
in sysupdate resolve %C or so as specifier in transfer fiels to the value of a specific machine tag channel= or so.
make vmspawn parse UKIs for direct kernel boot
portabled driving by system credential
sysinstall: add fully automatic mode that automatically picks target disk, non-interactively. Should wait to ensure system is up for a certain amount of minimal time (alternatively: certain amount of time since the last disk showed up), to ensure disks have shown up before making the decision. Usecase for this: redfish style server provisioning.
nspawn: optionally provide a /dev/tpm0 + /dev/tpmrm0 that is backed by swtpm, much like we do in vmspawn. let's us minimize differences between environments systemd runs in.
nspawn/vmspawn: add a concept how we can hand into the payload some proof that it is runnin on a certain host, which it can then include in the report, and which allows us to put together a map about which node runs as payload of which other note. in particular useful for transient nodes, as it gives them a better location
add a small varlink service that wraps the raw sftp logic (without ssh) after a varlink protocol upgrade, which enables varlink clients to do file transfers, which is in particular useful when accessing a system via http varlink proxy
add a small varlink service that allocates a pty and then does ptyfwd stuff after a protocol upgrade on the incoming connection. Then spawn a shell/getty on it. This enables varlink clients to acquire a fully featured ssh-like interactive tty/shell via varlink, which is again useful via http varlink proxy.
add something like podman's conmon as a native systemd subsystem: i.e. allocate ptys, that can be bound to stdio/console of containers and VMs, that maintain a bit of a scrollback buffer, and one can reconnect to later. fun idea: might even make /dev/tty1 and friends accessible via /dev/vcsa1 under the same protocol. this subsystem should potentially be the same as the varlink ssh-like thing listed above.
maybe introduce a new ansi sequence that allows propagate SIGWINCH inline. Idea would be: to enable inline notification of window sizes client sends a new, to be defined ANSI sequence with its current assumption of terminal size. Server compares it with current state. If the same it sends nothing immediately, but does send exactly one update if it changes, and disables the logic. If not the same sends correction immediately, and disables the logic. Client has to reissue sequence immediately after getting notification to get live updates. Benefit of all of this: better terminal experience if we just forward terminal bytes through a serial link/stream connection, as terminal sizes will be properly propagated. Write a UAPI spec for all this. ptyfwd could translate turn upstream SIGWINCH into upstream sequences of this type, so that every step of the way we get the right behaviour.
now that the kernel supports xattrs on sockets: mark varlink entrypoint sockets, server side of varlink sockets, and client sides of valrink sockets with distinct xattrs to make them recognizable (similar maybe for our other protocols, such as syslog, journal native entry point). For entrypoints might require new .socket unit setting.
implement "varlinkctl trace" or so, that watches socket traffic on a group of processes (select by pid, select by cgroup, select by all machine), and shows traffic of all sockets marked via the new varlink socket xattrs. Use BPF for all of that of course.
systemd-report: implement signing via callout varlink dir
add tooling for generating dictionary-based hostnames
do not pull dbus daemon/broker anymore, instead lazy activate it. Given how the Varlinkifcation has progressed various non-desktop usescase might not need D-Bus running at all anymore.
format-table: introduce the concept of a "title" for a table, which remains closely associated with the table. in most cases where want to output multiple tables from the same tool we want to separate things with a title, hence we might as well associate the title with the table itself, and streamline a few things.
allow metrics to indicate which values mean "nothing"/"invalid"/"zero"/"please-suppress". Then use that to reduce noise in systemd-report output.
cgroup-metrics: add per-cgroup PSI metrics
sysupdate: offer reading transfer files/components/features optionally from some JSON fragment rather than transfer files, so that we can update it independently from any DDI, and it needs no activation cycle. Why? so that making additional transfers/components/features available can be done without reloading confext/sysext, and out-band with other configuration changes.
sysupdate: go through all components, and update them all, one by one.
sysupdate: add concept for enabling/disabling specific components explicitly, just like features.
udev: add a MACHINE_TAGS field, that augments /etc/machine-info configured tags.
hostnamectl: management, collation of all tags. four sources: udev, /etc/machine-info, credentials, and /etc/machine-tags.d/*.conf
sysupdate: add conditions to transfer files, copying what we have for unit files and .network files
pid1,sysupdate,network: add support for a new "tags" condition, that checks all of the above.
sysupdate: write out database of all files created, and support gc of it
pcrextend: we probably should measure /etc/machine-info during boot somehow
pcrextend: we should measure something when we enter developer mode, by some definition of developer mode.
firstboot: optionally accept credentials at firstboot without authentication
firstboot/sysinstall: add simple interface for prompting users to enable "features" exposed by of sysupdate.
bootctl link + sysupdate integration
a tool that can prep credentials, put them in the ESP, for provisioning systems for SBC or UEFI/HTTP boot. Should be doing what sysinstall does with the credentials, and maybe even be sysinstall.
make sure we always pass O_NOFOLLOW on O_CREAT
xopenat(): maybe imply O_NOFOLLOW on O_CREAT
StorageProvider interface + storagectl
clean up credential naming a bit: let's say encrypted creds always should carry .cred suffix, and unencrypted should not.
clean up naming of sidecar files in sd-stub: let's put global ones strictly into /loader/extras/
a small tool that can do basic btrfs raid policy mgmt. i.e. gets started as part of the initial transaction for some btrfs raid fs, waits for some time, then puts message on screen (plymouth, console) that some devices apparently are not showing up, then counts down, eventually set a flag somewhere, and retriggers the fs is was invoked for, which causes the udev rules to rerun that assemble the btrfs raid, but this time force degraded assembly.
add a report backend that simply exposes a bunch of static files that are symlinked to some dir {/run,/etc/,/var/lib/}systemd/report-files/ or so as facts. Use that for exposing SSH keys and suchlike.
report generators for:
a way for container managers to turn off getty starting via $container_headless= or so...
add "conditions" for bls type 1 and type 2 profiles that allow suppressing them under various conditions: 1. if tpm2 is available or not available; 2. if sb is on or off; 3. if we are netbooted or not; …
add "homectl export" and "homectl import" that gets you an "atomic" snapshot of your homedir, i.e. either a tarball or a snapshot of the underlying disk (use FREEZE/THAW to make it consistent, btrfs snapshots)
Add "purpose" flag to partition flags in discoverable partition spec that indicate if partition is intended for sysext, for portable service, for booting and so on. Then, when dissecting DDI allow specifying a purpose to use as additional search condition. Use case: images that combined a sysext partition with a portable service partition in one.
systemd-sysinstall:
repart: add MatchLabel= which matches against partition label, so that we truly can install different images in parallel
add "systemctl wait" or so, which does what "systemd-run --wait" does, but for all units. It should be both a way to pin units into memory as well as a wait to retrieve their exit data.
add "systemd-analyze debug" + AttachDebugger= in unit files: The former specifies a command to execute; the latter specifies that an already running "systemd-analyze debug" instance shall be contacted and execution paused until it gives an OK. That way, tools like gdb or strace can be safely be invoked on processes forked off PID 1.
add "systemd-sysext identify" verb, that you can point on any file in /usr/ and that determines from which overlayfs layer it originates, which image, and with what it was signed.
add --vacuum-xyz options to coredumpctl, matching those journalctl already has.
sysupdate: in .transfer files have a 2nd url that is used if we auto-rollbacked the OS before.
sysupdate: optionally enrich URL with countme=1 once a week
sysupdate: have an explicit concept of update policies: i.e. a choice of at least
Add a "systemctl list-units --by-slice" mode or so, which rearranges the output of "systemctl list-units" slightly by showing the tree structure of the slices, and the units attached to them.
Add a concept of ListenStream=anonymous to socket units: listen on a socket that is deleted in the fs. Use case would be with ConnectSocket= above.
add a ConnectSocket= setting to service unit files, that may reference a socket unit, and which will connect to the socket defined therein, and pass the resulting fd to the service program via socket activation proto.
add a dbus call to generate target from current state
add a dependency on standard-conf.xml and other included files to man pages
add a job mode that will fail if a transaction would mean stopping running units. Use this in timedated to manage the NTP service state. https://lists.freedesktop.org/archives/systemd-devel/2015-April/030229.html
add a kernel cmdline switch (and cred?) for marking a system to be "headless", in which case we never open /dev/console for reading, only for writing. This would then mean: systemd-firstboot would process creds but not ask interactively, getty would not be started and so on.
add a Load= setting which takes literal data in text or base64 format, and puts it into a memfd, and passes that. This enables some fun stuff, such as embedding bash scripts in unit files, by combining Load= with ExecStart=/bin/bash /proc/self/fd/3
add a mechanism we can drop capabilities from pid1 before transitioning from initrd to host. i.e. before we transition into the slightly lower trust domain that is the host systems we might want to get rid of some caps. Example: CAP_SYS_BPF in the signed bpf loading logic above. (We already have CapabilityBoundingSet= in system.conf, but that is enforced when pid 1 initializes, rather then when it transitions to the next.)
add a new "debug" job mode, that is propagated to unit_start() and for services results in two things: we raise SIGSTOP right before invoking execve() and turn off watchdog support. Then, use that to implement "systemd-gdb" for attaching to the start-up of any system service in its natural habitat.
add a new flag to chase() that stops chasing once the first missing component is found and then allows the caller to create the rest.
add a new PE binary section ".mokkeys" or so which sd-stub will insert into Mok keyring, by overriding/extending whatever shim sets in the EFI var. Benefit: we can extend the kernel module keyring at ukify time, i.e. without recompiling the kernel, taking an upstream OS' kernel and adding a local key to it.
add a new specifier to unit files that figures out the DDI the unit file is from, tracing through overlayfs, DM, loopback block device.
add a new switch --auto-definitions=yes/no or so to systemd-repart. If specified, synthesize a definition automatically if we can: enlarge last partition on disk, but only if it is marked for growing and not read-only.
add a new syscall group "@esoteric" for more esoteric stuff such as bpf() and usefaultd() and make systemd-analyze check for it.
Add a new verb "systemctl top"
add a pam module that on password changes updates any LUKS slot where the password matches
add a percentage syntax for TimeoutStopSec=, e.g. TimeoutStopSec=150%, and then use that for the setting used in [email protected]. It should be understood relative to the configured default value.
add a plugin for factory reset logic that erases certain parts of the ESP, but leaves others in place.
add a proper concept of a "developer" mode, i.e. where cryptographic protections of the root OS are weakened after interactive confirmation, to allow hackers to allow their own stuff. idea: allow entering developer mode only via explicit choice in boot menu: i.e. add explicit boot menu item for it. When developer mode is entered, generate a key pair in the TPM2, and add the public part of it automatically to keychain of valid code signature keys on subsequent boots. Then provide a tool to sign code with the key in the TPM2. Ensure that boot menu item is the only way to enter developer mode, by binding it to locality/PCRs so that keys cannot be generated otherwise.
add a system-wide seccomp filter list for syscalls, kill "acct()" "@obsolete" and a few other legacy syscalls that way.
add a test if all entries in the catalog are properly formatted. (Adding dashes in a catalog entry currently results in the catalog entry being silently skipped. journalctl --update-catalog must warn about this, and we should also have a unit test to check that all our message are OK.)
add a utility that can be used with the kernel's CONFIG_STATIC_USERMODEHELPER_PATH and then handles them within pid1 so that security, resource management and cgroup settings can be enforced properly for all umh processes.
add a way to lock down cgroup migration: a boolean, which when set for a unit makes sure the processes in it can never migrate out of it
add ability to path_is_valid() to classify paths that refer to a dir from those which may refer to anything, and use that in various places to filter early. i.e. stuff ending in "/", "/." and "/.." definitely refers to a directory, and paths ending that way can be refused early in many contexts.
Add ACL-based access management to .socket units. i.e. add AllowPeerUser= + AllowPeerGroup= that installs additional user/group ACL entries on AF_UNIX sockets.
Add AddUser= setting to unit files, similar to DynamicUser=1 which however creates a static, persistent user rather than a dynamic, transient user. We can leverage code from sysusers.d for this.
add an explicit parser for LimitRTPRIO= that verifies the specified range and generates sane error messages for incorrect specifications.
Add and pickup tpm2 metadata for creds structure.
add another PE section ".fname" or so that encodes the intended filename for PE file, and validate that when loading add-ons and similar before using it. This is particularly relevant when we load multiple add-ons and want to sort them to apply them in a define order. The order should not be under control of the attacker.
add bus API for creating unit files in /etc, reusing the code for transient units
add bus api to query unit file's X fields.
add bus API to remove unit files from /etc
add bus API to retrieve current unit file contents (i.e. implement "systemctl cat" on the bus only)
Add ConditionDirectoryNotEmpty= handle non-absolute paths as a search path or add ConditionConfigSearchPathNotEmpty= or different syntax? See the discussion starting at https://github.com/systemd/systemd/pull/15109#issuecomment-607740136.
add CopyFile= or so as unit file setting that may be used to copy files or directory trees from the host to the services RootImage= and RootDirectory= environment. Which we can use for /etc/machine-id and in particular /etc/resolv.conf. Should be smart and do something useful on read-only images, for example fall back to read-only bind mounting the file instead.
Add ELF section to make systemd main binary recognizable cleanly, the same way as we make sd-boot recognizable via PE section.
Add ExecMonitor= setting. May be used multiple times. Forks off a process in the service cgroup, which is supposed to monitor the service, and when it exits the service is considered failed by its monitor.
add field to bls type 1 and type 2 profiles that ensures an item is never considered for automatic selection
add generator that pulls in systemd-network from containers when CAP_NET_ADMIN is set, more than the loopback device is defined, even when it is otherwise off
add growvol and makevol options for /etc/crypttab, similar to x-systemd.growfs and x-systemd-makefs.
Add knob to cryptsetup, to trigger automatic reboot on failure to unlock disk. Enable this by default for rootfs, also in gpt-auto-generator
add linker script that implicitly adds symbol for build ID and new coredump json package metadata, and use that when logging
add new gpt type for btrfs volumes
add new tool that can be used in debug mode runs in very early boot, generates a random password, passes it as credential to sysusers for the root user, then displays it on screen. people can use this to remotely log in.
add option to sockets to avoid activation. Instead just drop packets/connections, see http://cyberelk.net/tim/2012/02/15/portreserve-systemd-solution/
add PR_SET_DUMPABLE service setting
add proper .osrel matching for PE addons. i.e. refuse applying an addon intended for a different OS. Take inspiration from how confext/sysext are matched against OS.
add proper dbus APIs for the various sd_notify() commands, such as MAINPID=1 and so on, which would mean we could report errors and such.
add service file setting to force the fwmark (a la SO_MARK) to some value, so that we can allowlist certain services for imds this way.
Add service unit setting ConnectStream= which takes IP addresses and connects to them.
add some optional flag to ReadWritePaths= and friends, that has the effect that we create the dir in question when the service is started. Example:
ReadWritePaths=:/var/lib/foobar
add some service that makes an atomic snapshot of PCR state and event log up to that point available, possibly even with quote by the TPM.
add some special mode to LogsDirectory=/StateDirectory=… that allows declaring these directories without necessarily pulling in deps for them, or creating them when starting up. That way, we could declare that systemd-journald writes to /var/log/journal, which could be useful when we doing disk usage calculations and so on.
add support for "portablectl attach http://foobar.com/waaa.raw (i.e. importd integration)
add support for activating nvme-oF devices at boot automatically via kernel cmdline, and maybe even support a syntax such as root=nvme:<trtype>:<traddr>:<trsvcid>:<nqn>:<partition> to boot directly from nvme-oF
add support for asymmetric LUKS2 TPM based encryption. i.e. allow preparing an encrypted image on some host given a public key belonging to a specific other host, so that only hosts possessing the private key in the TPM2 chip can decrypt the volume key and activate the volume. Use case: systemd-confext for a central orchestrator to generate confext images securely that can only be activated on one specific host (which can be used for installing a bunch of creds in /etc/credstore/ for example). Extending on this: allow binding LUKS2 TPM based encryption also to the TPM2 internal clock. Net result: prepare a confext image that can only be activated on a specific host that runs a specific software in a specific time window. confext would be automatically invalidated outside of it.
Add support for extra verity configuration options to systemd-repart (FEC, hash type, etc)
Add SUPPORT_END_URL= field to os-release with more actionable information what to do if support ended
Add systemd-analyze security checks for RestrictFileSystems= and RestrictNetworkInterfaces=
Add [email protected] which is instantiated for a block device and invokes systemd-mount and exits. This is then useful to use in ENV{SYSTEMD_WANTS} in udev rules, and a bit prettier than using RUN+=
Add systemd-sysupdate-initrd.service or so that runs systemd-sysupdate in the initrd to bootstrap the initrd to populate the initial partitions. Some things to figure out:
add systemd.abort_on_kill or some other such flag to send SIGABRT instead of SIGKILL (throughout the codebase, not only PID1)
Add UKI profile conditioning so that profiles are only available if secure boot is turned off, or only on. similar, add conditions on TPM availability, network boot, and other conditions.
Allocate UIDs/GIDs automatically in userdbctl load-credentials if none are included in the user/group record credentials
allow dynamic modifications of ConcurrencyHardMax= and ConcurrencySoftMax= via DBus (and with that also by daemon-reload). Similar for portabled.
also include packaging metadata (á la https://systemd.io/PACKAGE_METADATA_FOR_EXECUTABLE_FILES/) in our UEFI PE binaries, using the same JSON format.
also parse out primary GPT disk label uuid from gpt partition device path at boot and pass it as efi var to OS.
as soon as we have sender timestamps, revisit coalescing multiple parallel daemon reloads: https://lists.freedesktop.org/archives/systemd-devel/2014-December/025862.html
augment CODE_FILE=, CODE_LINE= with something like CODE_BASE= or so which contains some identifier for the project, which allows us to include clickable links to source files generating these log messages. The identifier could be some abbreviated URL prefix or so (taking inspiration from Go imports). For example, for systemd we could use CODE_BASE=github.com/systemd/systemd/blob/98b0b1123cc or so which is sufficient to build a link by prefixing "http://" and suffixing the CODE_FILE.
Augment MESSAGE_ID with MESSAGE_BASE, in a similar fashion so that we can make clickable links from log messages carrying a MESSAGE_ID, that lead to some explanatory text online.
automatic boot assessment: add one more default success check that just waits for a bit after boot, and blesses the boot if the system stayed up that long.
automatically ignore threaded cgroups in cg_xyz().
automatically mount one virtiofs during early boot phase to /run/host/, similar to how we do that for nspawn, based on some clear tag.
automatically propagate LUKS password credential into cryptsetup from host (i.e. SMBIOS type #11, …), so that one can unlock LUKS via VM hypervisor supplied password.
automatically reset specific EFI vars on factory reset (make this generic enough so that infra can be used to erase shim's mok vars?)
be able to specify a forced restart of service A where service B depends on, in case B needs to be auto-respawned?
be more careful what we export on the bus as (usec_t) 0 and (usec_t) -1
beef up log.c with support for stripping ANSI sequences from strings, so that it is OK to include them in log strings. This would be particularly useful so that our log messages could contain clickable links for example for unit files and suchlike we operate on.
beef up pam_systemd to take unit file settings such as cgroups properties as parameters
blog about fd store and restartable services
bootctl:
BootLoaderSpec: define a way how an installer can figure out whether a BLS compliant boot loader is installed.
BootLoaderSpec: document @saved pseudo-entry, update mention in BLI
bootspec: permit graceful "update" from type #2 to type #1. If both a type #1 and a type #2 entry exist under otherwise the exact same name, then use the type #1 entry, and ignore the type #2 entry. This way, people can "upgrade" from the UKI with all parameters baked in to a Type #1 .conf file with manual parametrization, if needed. This matches our usual rule that admin config should win over vendor defaults.
bpf: see if we can address opportunistic inode sharing of immutable fs images with BPF. i.e. if bpf gives us power to hook into openat() and return a different inode than is requested for which we however it has same contents then we can use that to implement opportunistic inode sharing among DDIs: make all DDIs ship xattr on all reg files with a SHA256 hash. Then, also dictate that DDIs should come with a top-level subdir where all reg files are linked into by their SHA256 sum. Then, whenever an inode is opened with the xattr set, check bpf table to find dirs with hashes for other prior DDIs and try to use inode from there.
bpf: see if we can use BPF to solve the syslog message cgroup source problem: one idea would be to patch source sockaddr of all AF_UNIX/SOCK_DGRAM to implicitly contain the source cgroup id. Another idea would be to patch sendto()/connect()/sendmsg() sockaddr on-the-fly to use a different target sockaddr.
bsod: add target "bsod.target" or so, which invokes systemd-bsod.target and waits and then reboots. Then use OnFailure=bsod.target from various jobs that should result in system reboots, such as TPM tamper detection cases.
bsod: maybe use graphical mode. Use DRM APIs directly, see https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c for an example for doing that.
build short web pages out of each catalog entry, build them along with man pages, and include hyperlinks to them in the journal output
busctl: maybe expose a verb "ping" for pinging a dbus service to see if it exists and responds.
bypass SIGTERM state in unit files if KillSignal is SIGKILL
cache sd_event_now() result from before the first iteration...
calenderspec: add support for week numbers and day numbers within a year. This would allow us to define "bi-weekly" triggers safely.
cgroups: use inotify to get notified when somebody else modifies cgroups owned by us, then log a friendly warning.
cgroups:
chase(): take inspiration from path_extract_filename() and return O_DIRECTORY if input path contains trailing slash.
Check that users of inotify's IN_DELETE_SELF flag are using it properly, as usually IN_ATTRIB is the right way to watch deleted files, as the former only fires when a file is actually removed from disk, i.e. the link count drops to zero and is not open anymore, while the latter happens when a file is unlinked from any dir.
Clean up "reboot argument" handling, i.e. set it through some IPC service instead of directly via /run/, so that it can be sensible set remotely.
clean up date formatting and parsing so that all absolute/relative timestamps we format can also be parsed
complete varlink introspection comments:
confext/sysext: instead of mounting the overlayfs directly on /etc/ + /usr/, insert an intermediary bind mount on itself there. This has the benefit that services where mount propagation from the root fs is off, an still have confext/sysext propagated in.
consider adding a new partition type, just for /opt/ for usage in system extensions
coredump: maybe when coredumping read a new xattr from /proc/$PID/exe that may be used to mark a whole binary as non-coredumpable. Would fix: https://bugs.freedesktop.org/show_bug.cgi?id=69447
coredump:
credentials system:
creds: add a new cred format that reused the JSON structures we use in the LUKS header, so that we get the various newer policies for free.
cryptenroll/cryptsetup/homed: add unlock mechanism that combines tpm2 and fido2, as well as tpm2 + ssh-agent, inspired by ChromeOS' logic: encrypt the volume key with the TPM, with a policy that insists that a nonce is signed by the fido2 device's key or ssh-agent key. Thus, add unlock/login time the TPM generates a nonce, which is sent as a challenge to the fido2/ssh-agent, which returns a signature which is handed to the tpm, which then reveals the volume key to the PC.
cryptenroll/cryptsetup/homed: similar to this, implement TOTP backed by TPM.
cryptsetup/homed: implement TOTP authentication backed by TPM2 and its internal clock.
cryptsetup:
sysext: make systemd-{sys,conf}ext-sysroot.service work in the split /var
configuration.
sd-varlink: add fully async modes of the protocol upgrade stuff
repart: maybe remove iso9660/eltorito superblock from disk when booting via gpt, if there is one.
crypttab/gpt-auto-generator: allow explicit control over which unlock mechs to permit, and maybe have a global headless kernel cmdline option
currently x-systemd.timeout is lost in the initrd, since crypttab is copied into dracut, but fstab is not
dbus: when a unit failed to load (i.e. is in UNIT_ERROR state), we should be able to safely try another attempt when the bus call LoadUnit() is invoked.
ddi must be listed as block device fstype
define a generic "report" varlink interface, which services can implement to provide health/statistics data about themselves. then define a dir somewhere in /run/ where components can bind such sockets. Then make journald, logind, and pid1 itself implement this and expose various stats on things there. Then issue parallel calls to these interfaces from the systemd-report tool, combine into one json document, and include measurement logs and tpm quote. tpm quote should protect the json doc via the nonce field studd. Allow shipping this off elsewhere for analyze.
define a JSON format for units, separating out unit definitions from unit runtime state. Then, expose it:
define gpt header bits to select volatility mode
delay activation of logind until somebody logs in, or when /dev/tty0 pulls it in or lingering is on (so that containers don't bother with it until PAM is used). also exit-on-idle
deprecate RootDirectoryStartOnly= in favour of a new ExecStart= prefix char
dhcp6:
dhcp:
dissection policy should enforce that unlocking can only take place by certain means, i.e. only via pw, only via tpm2, or only via fido, or a combination thereof.
do a console daemon that takes stdio fds for services and allows to reconnect to them later
doc: prep a document explaining PID 1's internal logic, i.e. transactions, jobs, units
doc: prep a document explaining resolved's internal objects, i.e. Query vs. Question vs. Transaction vs. Stream and so on.
docs: bring https://systemd.io/MY_SERVICE_CANT_GET_REALTIME up to date
document Environment=SYSTEMD_LOG_LEVEL=debug drop-in in debugging document
document org.freedesktop.MemoryAllocation1
document:
in particular an example how to do the equivalent of switching runlevels
dot output for --test showing the 'initial transaction'
drop nss-myhostname in favour of nss-resolve?
drop NV_ORDERLY flag from the product uuid nvpcr. Effect of the flag is that it pushes the thing into TPM RAM, but a TPM usually has very little of that, less than NVRAM. hence setting the flag amplifies space issues. Unsetting the flag increases wear issues on the NVRAM, however, but this should be limited for the product uuid nvpcr, since its only changed once per boot. this needs to be configurable by nvpcr however, as other nvpcrs are different, i.e. verity one receives many writes during system uptime quite possibly. (also, NV_ORDERLY makes stuff faster, and dropping it costs possibly up to 100ms supposedly)
EFI:
enable LockMLOCK to take a percentage value relative to physical memory
Enable RestrictFileSystems= for all our long-running services (similar: RestrictNetworkInterfaces=)
encode type1 entries in some UKI section to add additional entries to the menu.
enumerate virtiofs devices during boot-up in a generator, and synthesize mounts for rootfs, /usr/, /home/, /srv/ and some others from it, depending on the "tag". (waits for: https://gitlab.com/virtio-fs/virtiofsd/-/issues/128)
/etc/veritytab: allow that the roothash column can be specified as fs path including a path to an AF_UNIX path, similar to how we do things with the keys of /etc/crypttab. That way people can store/provide the roothash externally and provide to us on demand only.
exponential backoff in timesyncd when we cannot reach a server
expose MS_NOSYMFOLLOW in various places
expose the handoff timestamp fully via the D-Bus properties that contain ExecStatus information
extend the smbios11 logic for passing credentials so that instead of passing the credential data literally it can also just reference an AF_VSOCK CID/port to read them from. This way the data doesn't remain in the SMBIOS blob during runtime, but only in the credentials fs.
extend the verity signature partition to permit multiple signatures for the same root hash, so that people can sign a single image with multiple keys.
figure out a nice way how we can let the admin know what child/sibling unit causes cgroup membership for a specific unit
Figure out how to do unittests of networkd's state serialization
Figure out naming of verbs in systemd-analyze: we have (singular) capability, exit-status, but (plural) filesystems, architectures.
figure out what to do about credentials sealed to PCRs in kexec + soft-reboot scenarios. Maybe insist sealing is done additionally against some keypair in the TPM to which access is updated on each boot, for the next, or so?
figure out when we can use the coarse timers
Find a solution for SMACK capabilities stuff: https://lists.freedesktop.org/archives/systemd-devel/2014-December/026188.html
fix bug around run0 background color on ls in fresh terminal
Fix DECIMAL_STR_MAX or DECIMAL_STR_WIDTH. One includes a trailing NUL, the other doesn't. What a disaster. Probably to exclude it.
fix homed/homectl confusion around terminology, i.e. "home directory" vs. "home" vs. "home area". Stick to one term for the concept, and it probably shouldn't contain "area".
fix our various hwdb lookup keys to end with ":" again. The original idea was that hwdb patterns can match arbitrary fields with expressions like ":foobar:", to wildcard match both the start and the end of the string. This only works safely for later extensions of the string if the strings always end in a colon. This requires updating our udev rules, as well as checking if the various hwdb files are fine with that.
flush_accept() should look at sockdiag queued sockets count and exit once we flushed out the specified number of connections.
flush_fd() should probably try to be smart and stop reading once we know that all further queued data was enqueued after flush_fd() was originally called. For that, try SIOCINQ if fd refers to stream socket, and look at timestamps for datagram sockets.
for better compat with major clouds: implement simple PTP device support in timesyncd
for better compat with major clouds: introduce imds mini client service that sets up primary netif in a private netns (ipvlan?) to query imds without affecting rest of the host. pick up literal credentials from there plus the fields the hwdb reports for the other fields and turn them into credentials. then write generator that used detected virtualization info and plugs this service into the early boot, waiting for the DMI and network device to show up.
for better compat with major clouds: recognize clouds via hwdb on DMI device, and add udev properties to it that help with handling IMDS, i.e. entrypoint URL, which fields to find ip hostname, ssh key, …
For timer units: add some mechanisms so that timer units that trigger immediately on boot do not have the services they run added to the initial transaction and thus confuse Type=idle.
for vendor-built signed initrds:
foreign uid:
fstab-generator: default to tmpfs-as-root if only usr= is specified on the kernel cmdline
generator that automatically discovers btrfs subvolumes, identifies their purpose based on some xattr on them.
generic interface for varlink for setting log level and stuff that all our daemons can implement
get rid of compat with libbpf.so.0 (retainly only for libbpf.so.1)
Get rid of the symlinks in /run/systemd/units/* and exclusively use cgroupfs xattrs to convey info about invocation ids, logging settings and so on. support for cgroupfs xattrs in the "trusted." namespace was added in linux 3.7, i.e. which we don't pretend to support anymore.
go through all --help texts in our codebases, and make sure:
go through all uses of table_new() in our codebase, and make sure we support all three of:
go through our codebase, and convert "vertical tables" (i.e. things such as "systemctl status") to use table_new_vertical() for output
gpt-auto-generator:
have a signal that reloads every unit that supports reloading
hibernate/s2h: if swap is on weird storage and refuse if so
homed/pam_systemd: allow authentication by ssh-agent, so that run0/polkit can be allowed if caller comes with the right ssh-agent keys.
homed/userdb: maybe define a "companion" dir for home directories where apps can safely put privileged stuff in. Would not be writable by the user, but still conceptually belong to the user. Would be included in user's quota if possible, even if files are not owned by UID of user. Use case: container images that owned by arbitrary UIDs, and are owned/managed by the users, but are not directly belonging to the user's UID. Goal: we shouldn't place more privileged dirs inside of unprivileged dirs, and thus containers really should not be placed inside of traditional UNIX home dirs (which are owned by users themselves) but somewhere else, that is separate, but still close by. Inform user code about path to this companion dir via env var, so that container managers find it. the ~/.identity file is also a candidate for a file to move there, since it is managed by privileged code (i.e. homed) and not unprivileged code.
homed:
systemd-home user or so, so that we
can easily set overall quota for all usershonour validatefs xattrs in dissect-image.c too
hook up journald with TPMs? measure new journal records to the TPM in regular intervals, validate the journal against current TPM state with that. (taking inspiration from IMA log)
Hook up journald's FSS logic with TPM2: seal the verification disk by time-based policy, so that the verification key can remain on host and ve validated via TPM.
Hook up systemd-journal-upload with RESTART_RESET=1 logic (should probably be conditioned on the num of successfully uploaded entries?)
hostnamectl: show root image uuid
if /usr/bin/swapoff fails due to OOM, log a friendly explanatory message about it
if we fork of a service with StandardOutput=journal, and it forks off a subprocess that quickly dies, we might not be able to identify the cgroup it comes from, but we can still derive that from the stdin socket its output came from. We apparently don't do that right now.
If we try to find a unit via a dangling symlink, generate a clean error. Currently, we just ignore it and read the unit from the search path anyway.
image policy should be extended to allow dictating how a disk is unlocked, i.e. root=encrypted-tpm2+encrypted-fido2 would mean "root fs must be encrypted and unlocked via fido2 or tpm2, but not otherwise"
imds: maybe do smarter api version handling
implement a varlink registry service, similar to the one of the reference implementation, backed by /run/varlink/registry/. Then, also implement connect-via-registry-resolution in sd-varlink and varlinkctl. Care needs to be taken to do the resolution asynchronousy. Also, note that the Varlink reference implementation uses a different address syntax, which needs to be taken into account.
implement Distribute= in socket units to allow running multiple service instances processing the listening socket, and open this up for ReusePort=
importd/importctl:
importd: support image signature verification with PKCS#7 + OpenBSD signify logic, as alternative to crummy gpg
In .socket units, add ConnectStream=, ConnectDatagram=, ConnectSequentialPacket= that create a socket, and then connect to rather than listen on some socket. Then, add a new setting WriteData= that takes some base64 data that systemd will write into the socket early on. This can then be used to create connections to arbitrary services and issue requests into them, as long as the data is static. This can then be combined with the aforementioned journald subscription varlink service, to enable activation-by-message id and similar.
In DynamicUser= mode: before selecting a UID, use disk quota APIs on relevant disks to see if the UID is already in use.
in journald, write out a recognizable log record whenever the system clock is changed ("stepped"), and in timesyncd whenever we acquire an NTP fix ("slewing"). Then, in journalctl for each boot time we come across, find these records, and use the structured info they include to display "corrected" wallclock time, as calculated from the monotonic timestamp in the log record, adjusted by the delta declared in the structured log record.
in journald: whenever we start a new journal file because the boot ID changed, let's generate a recognizable log record containing info about old and new ID. Then, when displaying log stream in journalctl look for these records, to be able to order them.
in networkd, when matching device types, fix up DEVTYPE rubbish the kernel passes to us
in nss-systemd, if we run inside of RootDirectory= with PrivateUsers= set, find a way to map the User=/Group= of the service to the right name. This way a user/group for a service only has to exist on the host for the right mapping to work.
in os-release define a field that can be initialized at build time from SOURCE_DATE_EPOCH (maybe even under that name?). Would then be used to initialize the timestamp logic of ConditionNeedsUpdate=.
in pid1: include ExecStart= cmdlines (and other Exec*= cmdlines) in polkit request, so that policies can match against command lines.
in sd-id128: also parse UUIDs in RFC4122 URN syntax (i.e. chop off urn:uuid: prefix)
in sd-stub: optionally add support for a new PE section .keyring or so that contains additional certificates to include in the Mok keyring, extending what shim might have placed there. why? let's say I use "ukify" to build + sign my own fedora-based UKIs, and only enroll my personal lennart key via shim. Then, I want to include the fedora keyring in it, so that kmods work. But I might not want to enroll the fedora key in shim, because this would also mean that the key would be in effect whenever I boot an archlinux UKI built the same way, signed with the same lennart key.
in the initrd, once the rootfs encryption key has been measured to PCR 15, derive default machine ID to use from it, and pass it to host PID 1.
in the initrd: derive the default machine ID to pass to the host PID 1 via $machine_id from the same seed credential.
in the long run: permit a system with /etc/machine-id linked to /dev/null, to make it lose its identity, i.e. be anonymous. For this we'd have to patch through the whole tree to make all code deal with the case where no machine ID is available.
In vmspawn/nspawn/machined wait for X_SYSTEMD_UNIT_ACTIVE=ssh-active.target and X_SYSTEMD_SIGNALS_LEVEL=2 as indication whether/when SSH and the POSIX signals are available. Similar for D-Bus (but just use sockets.target for that). Report as property for the machine.
initialize the hostname from the fs label of /, if /etc/hostname does not exist?
initrd: when transitioning from initrd to host, validate that
/lib/modules/uname -r exists, refuse otherwise
instead of going directly for DefineSpace when initializing nvpcrs, check if they exist first. apparently DefineSpace is broken on some tpms, and also creates log spam if the nvindex already exists.
introduce /etc/boottab or so which lists block devices that bootctl + kernel-install shall update the ESPs on (and register in EFI BootXYZ variables), in addition to whatever is currently the booted /usr/. systemd-sysupdate should also take it into consideration and update the /usr/ images on all listed devices.
introduce a .acpitable section for early ACPI table override
Introduce a CGroupRef structure, inspired by PidRef. Should contain cgroup path, cgroup id, and cgroup fd. Use it to continuously pin all v2 cgroups via a cgroup_ref field in the CGroupRuntime structure. Eventually switch things over to do all cgroupfs access only via that structure's fd.
introduce a new group to own TPM devices
introduce an option (or replacement) for "systemctl show" that outputs all properties as JSON, similar to busctl's new JSON output. In contrast to that it should skip the variant type string though.
introduce DefaultSlice= or so in system.conf that allows changing where we place our units by default, i.e. change system.slice to something else. Similar, ManagerSlice= should exist so that PID1's own scope unit could be moved somewhere else too. Finally machined and logind should get similar options so that it is possible to move user session scopes and machines to a different slice too by default. Use case: people who want to put resources on the entire system, with the exception of one specific service. See: https://lists.freedesktop.org/archives/systemd-devel/2018-February/040369.html
introduce mntid_t, and make it 64bit, as apparently the kernel switched to 64bit mount ids
introduce new ANSI sequence for communicating log level and structured error metadata to terminals.
introduce new structure Tpm2CombinedPolicy, that combines the various TPm2 policy bits into one structure, i.e. public key info, pcr masks, pcrlock stuff, pin and so on. Then pass that around in tpm2_seal() and tpm2_unseal().
introduce per-unit (i.e. per-slice, per-service) journal log size limits.
investigate whether the gnome pty helper should be moved into systemd, to provide cgroup support.
journal:
journalctl/timesyncd: whenever timesyncd acquires a synchronization from NTP, create a structured log entry that contains boot ID, monotonic clock and realtime clock (I mean, this requires no special work, as these three fields are implicit). Then in journalctl when attempting to display the realtime timestamp of a log entry, first search for the closest later log entry of this kinda that has a matching boot id, and convert the monotonic clock timestamp of the entry to the realtime clock using this info. This way we can retroactively correct the wallclock timestamps, in particular for systems without RTC, i.e. where initially wallclock timestamps carry rubbish, until an NTP sync is acquired.
landlock: for unprivileged systemd (i.e. systemd --user), use landlock to implement ProtectSystem=, ProtectHome= and so on. Landlock does not require privs, and we can implement pretty similar behaviour. Also, maybe add a mode where ProtectSystem= combined with an explicit PrivateMounts=no could request similar behaviour for system services, too.
landlock: lock down RuntimeDirectory= via landlock, so that services lose ability to write anywhere else below /run/. Similar for StateDirectory=. Benefit would be clear delegation via unit files: services get the directories they get, and nothing else even if they wanted to.
Lennart: big blog story about "why systemd-boot"
Lennart: big blog story about building initrds
Lennart: big blog story about DDIs
let's not GC a unit while its ratelimits are still pending
libsystemd-journal, libsystemd-login, libudev: add calls to easily attach these objects to sd-event event loops
lock down acceptable encrypted credentials at boot, via simple allowlist, maybe on kernel command line: systemd.import_encrypted_creds=foobar.waldo,tmpfiles.extra to protect locked down kernels from credentials generated on the host with a weak kernel
lock down swtpm a bit to make it harder to extract keys from it as it is running. i.e. make ptracing + termination hard from the outside. also run swtpm as unpriv user (not trivial, probably requires patch swtpm, as it needs to allocate vtpm device), to lock it down from the inside.
loginctl: show "service identifier" in tabular list-sessions output, to make run0 sessions easily visible.
loginctl: show argv[] of "leader" process in tabular list-sessions output
logind:
look at nsresourced, mountfsd, homed, importd, portabled, and try to come up with a way how the forked off worker processes can be moved into transient services with sandboxing, without breaking notify socket stuff and so on.
machined:
make "bootctl install" + "bootctl update" useful for installing shim too. For that introduce new dir /usr/lib/systemd/efi/extra/ which we copy mostly 1:1 into the ESP at install time. Then make the logic smart enough so that we don't overwrite bootx64.efi with our own if the extra tree already contains one. Also, follow symlinks when copying, so that shim rpm can symlink their stuff into our dir (which is safe since the target ESP is generally VFAT and thus does not have symlinks anyway). Later, teach the update logic to look at the ELF package metadata (which we also should include in all PE files, see above) for version info in all *.EFI files, and use it to only update if newer.
make cryptsetup lower --iter-time
Make it possible to set the keymap independently from the font on the kernel cmdline. Right now setting one resets also the other.
make killing more debuggable: when we kill a service do so setting the .si_code field with a little bit of info. Specifically, we can set a recognizable value to first of all indicate that it's systemd that did the killing. Secondly, we can give a reason for the killing, i.e. OOM or so, and also the phase we are in, and which process we think we are killing (i.e. main vs control process, useful in case of sd_notify() MAINPID= debugging). Net result: people who try to debug why their process gets killed should have some minimal, nice metadata directly on the signal event.
make MAINPID= message reception checks even stricter: if service uses User=, then check sending UID and ignore message if it doesn't match the user or root.
make nspawn containers, portable services and vmspawn VMs optionally survive soft reboot wholesale.
Make nspawn to a frontend for systemd-executor, so that we have to ways into the executor: via unit files/dbus/varlink through PID1 and via cmdline/OCI through nspawn.
make persistent restarts easier by adding a new setting OpenPersistentFile= or so, which allows opening one or more files that is "persistent" across service restarts, hot reboot, cold reboots (depending on configuration): the files are created empty on first invocation, and on subsequent invocations the files are reboot. The files would be backed by tmpfs, pmem or /var depending on desired level of persistency.
make repeated alt-ctrl-del presses printing a dump
make rfkill uaccess controllable by default, i.e. steal rule from gnome-bluetooth and friends
Make run0 forward various signals to the forked process so that sending signals to a child process works roughly the same regardless of whether the child process is spawned via run0 or not.
make sure systemd-ask-password-wall does not shutdown systemd-ask-password-console too early
make sure the ratelimit object can deal with USEC_INFINITY as way to turn off things
make systemd work nicely without /bin/sh, logins and associated shell tools around
make the systemd-repart "seed" value provisionable via credentials, so that confidential computing environments can set it and deterministically enforce the uuids for partitions created, so that they can calculate PCR 15 ahead of time.
make use of ethtool veth peer info in machined, for automatically finding out host-side interface pointing to the container.
make use of new glibc 2.32 APIs sigabbrev_np().
make vmspawn/nspawn/importd/machined a bit more usable in a WSL-like fashion. i.e. teach unpriv systemd-vmspawn/systemd-nspawn a reasonable --bind-user= behaviour that mounts the calling user through into the machine. Then, ship importd with a small database of well known distro images along with their pinned signature keys. Then add some minimal glue that binds this together: downloads a suitable image if not done so yet, starts it in the bg via vmspawn/nspawn if not done so yet and then requests a shell inside it for the invoking user.
man: rework os-release(5), and clearly separate our extension-release.d/ and initrd-release parts, i.e. list explicitly which fields are about what.
man: the documentation of Restart= currently is very misleading and suggests the tools from ExecStartPre= might get restarted.
maybe add a "systemd-report" tool, that generates a TPM2-backed "report" of current system state, i.e. a combination of PCR information, local system time and TPM clock, running services, recent high-priority log messages/coredumps, system load/PSI, signed by the local TPM chip, to form an enhanced remote attestation quote. Use case: a simple orchestrator could use this: have the report tool upload these reports every 3min somewhere. Then have the orchestrator collect these reports centrally over a 3min time window, and use them to determine what which node should now start/stop what, and generate a small confext for each node, that uses Uphold= to pin services on each node. The confext would be encrypted using the asymmetric encryption proposed above, so that it can only be activated on the specific host, if the software is in a good state, and within a specific time frame. Then run a loop on each node that sends report to orchestrator and then sysupdate to update confext. Orchestrator would be stateless, i.e. operate on desired config and collected reports in the last 3min time window only, and thus can be trivially scaled up since all instances of the orchestrator should come to the same conclusions given the same inputs of reports/desired workload info. Could also be used to deliver Wireguard secrets and thus to clients, thus permitting zero-trust networking: secrets are rolled over via confext updates, and via the time window TPM logic invalidated if node doesn't keep itself updated, or becomes corrupted in some way.
maybe add a new standard slice where process that are started in the initrd and stick around for the whole system runtime (i.e. root fs storage daemons, the bpf loader daemon discussed above, and such) are placed. maybe protected.slice or so? Then write docs that suggest that services like this set Slice=protected.slice, RefuseManualStart=yes, RefuseManualStop=yes and a couple of other things.
maybe add call sd_journal_set_block_timeout() or so to set SO_SNDTIMEO for the sd-journal logging socket, and, if the timeout is set to 0, sets O_NONBLOCK on it. That way people can control if and when to block for logging.
maybe add kernel cmdline params: to force random seed crediting
maybe add new flags to gpt partition tables for rootfs and usrfs indicating purpose, i.e. whether something is supposed to be bootable in a VM, on baremetal, on an nspawn-style container, if it is a portable service image, or a sysext for initrd, for host os, or for portable container. Then hook portabled/… up to udev to watch block devices coming up with the flags set, and use it.
maybe add support for binding and connecting AF_UNIX sockets in the file system outside of the 108ch limit. When connecting, open O_PATH fd to socket inode first, then connect to /proc/self/fd/XYZ. When binding, create symlink to target dir in /tmp, and bind through it.
Maybe add SwitchRootEx() as new bus call that takes env vars to set for new PID 1 as argument. When adding SwitchRootEx() we should maybe also add a flags param that allows disabling and enabling whether serialization is requested during switch root.
maybe allow timer units with an empty Units= setting, so that they can be used for resuming the system but nothing else.
maybe beef up sd-event: optionally, allow sd-event to query the timestamp of next pending datagram inside a SOCK_DGRAM IO fd, and order event source dispatching by that. Enable this on the native + syslog sockets in journald, so that we add correct ordering between the two. Use MSG_PEEK + SCM_TIMESTAMP for this.
maybe define a /etc/machine-info field for the ANSI color to associate with a hostname. Then use it for the shell prompt to highlight the hostname. If no color is explicitly set, hash a color automatically from the hostname as a fallback, in a reasonable way. Take inspiration from the ANSI_COLOR= field that already exists in /etc/os-release, i.e. use the same field name and syntax. When hashing the color, use the hsv_to_rgb() helper we already have, fixate S and V to something reasonable and constant, and derive the H from the hostname. Ultimate goal with this: give people a visual hint about the system they are on if the have many to deal with, by giving each a color identity. This code should be placed in hostnamed, so that clients can query the color via varlink or dbus.
maybe do not install [email protected] symlink in /etc but in /usr?
maybe extend .path units to expose fanotify() per-mount change events
maybe extend the capsule concept to the per-user instance too: invokes a systemd --user instance with a subdir of $HOME as $HOME, and a subdir of $XDG_RUNTIME_DIR as $XDG_RUNTIME_DIR.
Maybe extend the service protocol to support handling of some specific SIGRT signal for setting service log level, that carries the level via the sigqueue() data parameter. Enable this via unit file setting.
maybe implicitly attach monotonic+realtime timestamps to outgoing messages in log.c and sd-journal-send
maybe introduce "@icky" as a seccomp filter group, which contains acct() and certain other syscalls that aren't quite obsolete, but certainly icky.
Maybe introduce a helper safe_exec() or so, which is to execve() which safe_fork() is to fork(). And then make revert the RLIMIT_NOFILE soft limit to 1K implicitly, unless explicitly opted-out.
maybe introduce a new partition that we can store debug logs and similar at the very last moment of shutdown. idea would be to store reference to block device (major + minor + partition id + diskseq?) in /run somewhere, than use that from systemd-shutdown, just write a raw JSON blob into the partition. Include timestamp, boot id and such, plus kmsg. on next boot immediately import into journal. maybe use timestamp for making clock more monotonic. also use this to detect unclean shutdowns, boot into special target if detected
maybe introduce a new per-unit drop-in directory .confext.d/ that may contain symlinks to confext images to enable for the unit.
Maybe introduce an InodeRef structure inspired by PidRef, which references a specific inode, and combines: a path, an O_PATH fd, and possibly a FID into one. Why? We often pass around path and fd separately in chaseat() and similar calls. Because passing around both separately is cumbersome we sometimes only one pass one, once the other and sometimes both. It would make the code a lot simpler if we could path both around at the same time in a simple way, via an InodeRef which both pins the inode via an fd, and gives us a friendly name for it.
maybe introduce an OSC sequence that signals when we ask for a password, so that terminal emulators can maybe connect a password manager or so, and highlight things specially.
maybe introduce [email protected] or so, to match container-getty.service but skips authentication, so you get a shell prompt directly. Usecase: wsl-like stuff (they have something pretty much like that). Question: how to pick user for this. Instance parameter? somehow from credential (would probably require some binary that converts credential to User= parameter?
maybe introduce xattrs that can be set on the root dir of the root fs partition that declare the volatility mode to use the image in. Previously I thought marking this via GPT partition flags but that's not ideal since that's outside of the LUKS encryption/verity verification, and we probably shouldn't operate in a volatile mode unless we got told so from a trusted source.
maybe prohibit setuid() to the nobody user, to lock things down, via seccomp. the nobody is not a user any code should run under, ever, as that user would possibly get a lot of access to resources it really shouldn't be getting access to due to the userns + nfs semantics of the user. Alternatively: use the seccomp log action, and allow it.
maybe reconsider whether virtualization consoles (hvc1) are considered local or remote. i.e. are they more like an ssh login, or more like a /dev/tty1 login? Lennart used to believe the former, but maybe the latter is more appropriate? This has effect on polkit interactivity, since it would mean questions via hvc0 would suddenly use the local polkit property. But this also raises the question whether such sessions shall be considered active or not
Maybe rename pkcs7 and public verbs of systemd-keyutil to be more verb like.
maybe rework systemd-modules-load to be a generator that just instantiates [email protected] a bunch of times
maybe teach repart.d/ dropins a new setting MakeMountNodes= or so, which is just like MakeDirectories=, but uses an access mode of 0000 and sets the +i chattr bit. This is useful as protection against early uses of /var/ or /tmp/ before their contents is mounted.
maybe trigger a uevent "change" on a device if "systemctl reload xyz.device" is issued.
maybe: in PID1, when we detect we run in an initrd, make superblock read-only early on, but provide opt-out via kernel cmdline.
measure GPT and LUKS headers somewhere when we use them (i.e. in systemd-gpt-auto-generator/systemd-repart and in systemd-cryptsetup?)
measure some string via pcrphase whenever we end up booting into emergency mode.
measure some string via pcrphase whenever we resume from hibernate
Merge systemd-creds options --uid= (which accepts user names) and --user.
merge unit_kill_common() and unit_kill_context()
MessageQueueMessageSize= (and suchlike) should use parse_iec_size().
mount /tmp/ and /var/tmp with a uidmap applied that blocks out "nobody" user among other things such as dynamic uid ranges for containers and so on. That way no one can create files there with these uids and we enforce they are only used transiently, never persistently.
mount /var/ from initrd, so that we can apply sysext and stuff before the initrd transition. Specifically:
mount most file systems with a restrictive uidmap. e.g. mount /usr/ with a uidmap that blocks out anything outside 0…1000 (i.e. system users) and similar.
mount the root fs with MS_NOSUID by default, and then mount /usr/ without both so that suid executables can only be placed there. Do this already in the initrd. If /usr/ is not split out create a bind mount automatically.
mount: turn dependency information from /proc/self/mountinfo into dependency information between systemd units.
MountFlags=shared acts as MountFlags=slave right now.
mountfsd/nsresourced:
move documentation about our common env vars (SYSTEMD_LOG_LEVEL, SYSTEMD_PAGER, …) into a man page of its own, and just link it from our various man pages that so far embed the whole list again and again, in an attempt to reduce clutter and noise a bid.
move multiseat vid/pid matches from logind udev rule to hwdb
Move RestrictAddressFamily= to the new cgroup create socket
networkd's resolved hook: optionally map all lease IP addresses handed out to the same hostname which is configured on the .network file. Optionally, even derive this single name from the network interface name (i.e. probably altname or so). This way, when spawning a VM the host could pick the hostname for it and the client gets no say.
networkd/machined: implement reverse name lookups in the resolved hook
networkd: maintain a file in /run/ that can be symlinked into /run/issue.d/ that always shows the current primary IP address
networkd:
nspawn/vmspawn/pid1: add ability to easily insert fully booted VMs/FOSC into shell pipelines, i.e. add easy to use switch that turns off console status output, and generates the right credentials for systemd-run-generator so that a program is invoked, and its output captured, with correct EOF handling and exit code propagation
nspawn/vmspawn: define hotkey that one can hit on the primary interface to ask for a friendly, acpi style shutdown.
nspawn:
oci: add support for "importctl import-oci" which implements the "OCI layout" spec (i.e. acquiring via local fs access), as opposed to the current "importctl pull-oci" which focuses on the "OCI image spec", i.e. downloads from the web (i.e. acquiring via URLs).
oci: add support for blake hashes for layers
oci: support "data" in any OCI descriptor, not just manifest config.
On boot, auto-generate an asymmetric key pair from the TPM, and use it for validating DDIs and credentials. Maybe upload it to the kernel keyring, so that the kernel does this validation for us for verity and kernel modules
on shutdown: move utmp, wall, audit logic all into PID 1 (or logind?)
once swtpm's sd_notify() support has landed in the distributions, remove the invocation in tpm2-swtpm.c and let swtpm handle it.
Once the root fs LUKS volume key is measured into PCR 15, default to binding credentials to PCR 15 in "systemd-creds"
optionally, also require WATCHDOG=1 notifications during service start-up and shutdown
optionally, collect cgroup resource data, and store it in per-unit RRD files, suitable for processing with rrdtool. Add bus API to access this data, and possibly implement a CPULoad property based on it.
optionally: turn on cgroup delegation for per-session scope units
pam_systemd: on interactive logins, maybe show SUPPORT_END information at login time, à la motd
pam_systemd_home: add module parameter to control whether to only accept only password or only pcks11/fido2 auth, and then use this to hook nicely into two of the three PAM stacks gdm provides. See discussion at https://github.com/authselect/authselect/pull/311
paranoia: whenever we process passwords, call mlock() on the memory first. i.e. look for all places we use free_and_erasep() and augment them with mlock(). Also use MADV_DONTDUMP. Alternatively (preferably?) use memfd_secret().
pcrextend/tpm2-util: add a concept of "rotation" to event log. i.e. allow trailing parts of the logs if time or disk space limit is hit. Protect the boot-time measurements however (i.e. up to some point where things are settled), since we need those for pcrlock measurements and similar. When deleting entries for rotation, place an event that declares how many items have been dropped, and what the hash before and after that.
pcrextend: after measuring get an immediate quote from the TPM, and validate it. if it doesn't check out, i.e. the measurement we made doesn't appear in the PCR then also reboot.
pcrextend: maybe add option to disable measurements entirely via kernel cmdline
pcrextend: when we fail to measure, reboot the system (at least optionally). important because certain measurements are supposed to "destroy" tpm object access.
pcrlock: add support for multi-profile UKIs
pcrlock:
per-service sandboxing option: ProtectIds=. If used, will overmount /etc/machine-id and /proc/sys/kernel/random/boot_id with synthetic files, to make it harder for the service to identify the host. Depending on the user setting it should be fully randomized at invocation time, or a hash of the real thing, keyed by the unit name or so. Of course, there are other ways to get these IDs (e.g. journal) or similar ids (e.g. MAC addresses, DMI ids, CPU ids), so this knob would only be useful in combination with other lockdown options. Particularly useful for portable services, and anything else that uses RootDirectory= or RootImage=. (Might also over-mount /sys/class/dmi/id/*{uuid,serial} with /dev/null).
Permit masking specific netlink APIs with RestrictAddressFamily=
pick up creds from EFI vars
PID 1 should send out sd_notify("WATCHDOG=1") messages (for usage in the --user mode, and when run via nspawn)
pid1:
PidRef conversion work:
port copy.c over to use LabelOps for all labelling.
portable services: attach not only unit files to host, but also simple binaries to a tmpfs path in $PATH.
portabled: when extracting unit files and copying to system.attached, if a .p7s is available in the image, use it to protect the system.attached copy with fs-verity, so that it cannot be tampered with
print a nicer explanation if people use variable/specifier expansion in ExecStart= for the first word
Process credentials in:
properly handle loop back mounts via fstab, especially regards to fsck/passno
properly serialize the ExecStatus data from all ExecCommand objects associated with services, sockets, mounts and swaps. Currently, the data is flushed out on reload, which is quite a limitation.
ProtectClock= (drops CAP_SYS_TIMES, adds seccomp filters for settimeofday, adjtimex), sets DeviceAllow o /dev/rtc
ProtectKeyRing= to take keyring calls away
ProtectMount= (drop mount/umount/pivot_root from seccomp, disallow fuse via DeviceAllow, imply Mountflags=slave)
ProtectReboot= that masks reboot() and kexec_load() syscalls, prohibits kill on PID 1 with the relevant signals, and makes relevant files in /sys and /proc (such as the sysrq stuff) unavailable
ProtectTracing= (drops CAP_SYS_PTRACE, blocks ptrace syscall, makes /sys/kernel/tracing go away)
ptyfwd: use osc context information in vmspawn/nspawn/… to optionally only listen to ^]]] key when no further vmspawn/nspawn context is allocated
ptyfwd: usec osc context information to propagate status messages from vmspawn/nspawn to service manager's "status" string, reporting what is currently in the fg
pull-oci: progress notification
redefine /var/lib/extensions/ as the dir one can place all three of sysext, confext as well is multi-modal DDIs that qualify as both. Then introduce /var/lib/sysexts/ which can be used to place only DDIs that shall be used as sysext
refcounting in sd-resolve is borked
refuse boot if /usr/lib/os-release is missing or /etc/machine-id cannot be set up
remove any syslog support from log.c — we probably cannot do this before split-off udev is gone for good
remove tomoyo support, it's obsolete and unmaintained apparently
RemoveKeyRing= to remove all keyring entries of the specified user
repart + cryptsetup: support file systems that are encrypted and use verity on top. Usecase: confexts that shall be signed by the admin but also be confidential. Then, add a new --make-ddi=confext-encrypted for this.
repart/gpt-auto/DDIs: maybe introduce a concept of "extension" partitions, that have a new type uuid and can "extend" earlier partitions, to work around the fact that systemd-repart can only grow the last partition defined. During activation we'd simply set up a dm-linear mapping to merge them again. A partition that is to be extended would just set a bit in the partition flags field to indicate that there's another extension partition to look for. The identifying UUID of the extension partition would be hashed in counter mode from the uuid of the original partition it extends. Inspiration for this is the "dynamic partitions" concept of new Android. This would be a minimalistic concept of a volume manager, with the extents it manages being exposes as GPT partitions. I a partition is extended multiple times they should probably grow exponentially in size to ensure O(log(n)) time for finding them on access.
repart: introduce concept of "ghost" partitions, that we setup in almost all ways like other partitions, but do not actually register in the actual gpt table, but only tell the kernel about via BLKPG ioctl. These partitions are disk backed (hence can be large), but not persistent (as they are invisible on next boot). Could be used by live media and similar, to boot up as usual but automatically start at zero on each boot. There should also be a way to make ghost partitions properly persistent on request.
repart: introduce MigrateFileSystem= or so which is a bit like CopyFiles=/CopyBlocks= but operates via btrfs device logic: adds target as new device then removes source from btrfs. Usecase: a live medium which uses "ghost" partitions as suggested above, which can become persistent on request on another device.
replace all \x1b, \x1B, \033 C string escape sequences in our codebase with a more readable \e. It's a GNU extension, but a ton more readable than the others, and most importantly it doesn't result in confusing errors if you suffix the escape sequence with one more decimal digit, because compilers think you might actually specify a value outside the 8bit range with that.
replace all uses of fopen_temporary() by fopen_tmpfile_linkable() + flink_tmpfile() and then get rid of fopen_temporary(). Benefit: use O_TMPFILE pervasively, and avoid rename() wherever we can.
replace bootctl's PE version check to actually use APIs from pe-binary.[ch] to find binary version.
replace symlink_label(), mknodat_label(), btrfs_subvol_make_label(), mkdir_label() and related calls by flags-based calls that use label_ops_pre()/label_ops_post().
report: have something that requests cloud workload identity bearer tokens and includes it in the report
report:
Reset TPM2 DA bit on each successful boot
resolved:
revisit how we pass fs images and initrd to the kernel. take uefi http boot ramdisks as inspiration: for any confext/sysext/initrd erofs/DDI image simply generate a fake pmem region in the UEFI memory tables, that Linux then turns into /dev/pmemX. Then turn of cpio-based initrd logic in linux kernel, instead let kernel boot directly into /dev/pmem0. In order to allow our usual cpio-based parameterization, teach PID 1 to just uncompress cpio ourselves early on, from another pmem device. (Related to this, maybe introduce a new PE section .ramdisk that just synthesizes pmem devices from arbitrary blobs. Could be particularly useful in add-ons)
rework ExecOutput and ExecInput enums so that EXEC_OUTPUT_NULL loses its magic meaning and is no longer upgraded to something else if set explicitly.
rework fopen_temporary() to make use of open_tmpfile_linkable() (problem: the kernel doesn't support linkat() that replaces existing files, currently)
rework journalctl -M to be based on a machined method that generates a mount fd of the relevant journal dirs in the container with uidmapping applied to allow the host to read it, while making everything read-only.
rework loopback support in fstab: when "loop" option is used, then instantiate a new [email protected] for the source path, set the lo_file_name field for it to something recognizable derived from the fstab line, and then generate a mount unit for it using a udev generated symlink based on lo_file_name.
rework recursive read-only remount to use new mount API
rework seccomp/nnp logic that even if User= is used in combination with a seccomp option we don't have to set NNP. For that, change uid first while keeping CAP_SYS_ADMIN, then apply seccomp, the drop cap.
rewrite bpf-devices in libbpf/C code, rather than home-grown BPF assembly, to match bpf-restrict-fs, bpf-restrict-ifaces, bpf-socket-bind
rewrite bpf-firewall in libbpf/C code
rfkill,backlight: we probably should run the load tools inside of the udev rules so that the state is properly initialized by the time other software sees it
rough proposed implementation design for remote attestation infra: add a tool that generates a quote of local PCRs and NvPCRs, along with synchronous log snapshot. use "audit session" logic for that, so that we get read-outs and signature in one step. Then turn this into a JSON object. Use the "TCG TSS 2.0 JSON Data Types and Policy Language" format to encode the signature. And CEL for the measurement log.
run0: maybe enable utmp for run0 sessions, so that they are easily visible.
sd-boot:
sd-bus:
sd-device:
sd-event:
sd-journal puts a limit on parallel journal files to view at once. journald should probably honour that same limit (JOURNAL_FILES_MAX) when vacuuming to ensure we never generate more files than we can actually view.
sd-lldp: pick up 802.3 maximum frame size/mtu, to be able to detect jumbo frame capable networks
sd-rtnl:
sd-stub:
sd_notify/vsock: maybe support binding to AF_VSOCK in Type=notify services, then passing $NOTIFY_SOCKET and $NOTIFY_GUESTCID with PID1's cid (typically fixed to "2", i.e. the official host cid) and the expected guest cid, for the two sides of the channel. The latter env var could then be used in an appropriate qemu cmdline. That way qemu payloads could talk sd_notify() directly to host service manager.
seccomp:
seems that when we follow symlinks to units we prefer the symlink destination path over /etc and /usr. We should not do that. Instead /etc should always override /run+/usr and also any symlink destination.
.service with invalid Sockets= starts successfully.
services: add support for cryptographically unlocking per-service directories via TPM2. Specifically, for StateDirectory= (and related dirs) use fscrypt to set up the directory so that it can only be accessed if host and app are in order.
shared/wall: Once more programs are taught to prefer sd-login over utmp, switch the default wall implementation to wall_logind (https://github.com/systemd/systemd/pull/29051#issuecomment-1704917074)
show whether a service has out-of-date configuration in "systemctl status" by using mtime data of ConfigurationDirectory=.
shutdown logging: store to EFI var, and store to USB stick?
signed bpf loading: to address need for signature verification for bpf programs when they are loaded, and given the bpf folks don't think this is realistic in kernel space, maybe add small daemon that facilitates this loading on request of clients, validates signatures and then loads the programs. This daemon should be the only daemon with privs to do load BPF on the system. It might be a good idea to run this daemon already in the initrd, and leave it around during the initrd transition, to continue serve requests. Should then live in its own fs namespace that inherits from the initrd's fs tree, not from the host, to isolate it properly. Should set PR_SET_DUMPABLE so that it cannot be ptraced from the host. Should have CAP_SYS_BPF as only service around.
SIGRTMIN+18 and memory pressure handling should still be added to: hostnamed, localed, oomd, timedated.
socket units: allow creating a udev monitor socket with ListenDevices= or so, with matches, then activate app through that passing socket over
special case some calls of chase() to use openat2() internally, so that the kernel does what we otherwise do.
Split vconsole-setup in two, of which the second is started via udev (instead of the "restart" job it currently fires). That way, boot becomes purely positive again, and we can nicely order the two against each other.
start making use of the new --graceful switch to util-linux' umount command
start using STATX_SUBVOL in btrfs_is_subvol(). Also, make use of it generically, so that image discovery recognizes bcachefs subvols too.
storagetm: maybe also serve the specified disk via HTTP? we have glue for microhttpd anyway already. Idea would also be serve currently booted UKI as separate HTTP resource, so that EFI http boot on another system could directly boot from our system, with full access to the hdd.
storagetm:
support boot into nvme-over-tcp: add generator that allows specifying nvme devices on kernel cmdline + credentials. Also maybe add interactive mode (where the user is prompted for nvme info), in order to boot from other system's HDD.
support crash reporting operation modes (https://live.gnome.org/GnomeOS/Design/Whiteboards/ProblemReporting)
support projid-based quota in machinectl for containers
Support ReadWritePaths/ReadOnlyPaths/InaccessiblePaths in systemd --user instances via the new unprivileged Landlock LSM (https://landlock.io)
support specifying download hash sum in systemd-import-generator expression to pin image/tarball.
sync dynamic uids/gids between host+portable service (i.e. if DynamicUser=1 is set for a service, make sure that the selected user is resolvable in the service even if it ships its own /etc/passwd)
synchronize console access with BSD locks: https://lists.freedesktop.org/archives/systemd-devel/2014-October/024582.html
sysext: before applying a sysext, do a superficial validation run so that things are not rearranged to wildy. I.e. protect against accidental fuckups, such as masking out /usr/lib/ or so. We should probably refuse if existing inodes are replaced by other types of inodes or so.
sysext: measure all activated sysext into a TPM PCR
system BPF LSM policy that enforces that block device backed mounts may only be established on top of dm-crypt or dm-verity devices, or an allowlist of file systems (which should probably include vfat, for compat with the ESP)
system BPF LSM policy that prohibits creating files owned by "nobody" system-wide
system BPF LSM policy that prohibits creating or opening device nodes outside of devtmpfs/tmpfs, except if they are the pseudo-devices /dev/null, /dev/zero, /dev/urandom and so on.
"systemctl preset-all" should probably order the unit files it operates on lexicographically before starting to work, in order to ensure deterministic behaviour if two unit files conflict (like DMs do, for example)
systemctl, machinectl, loginctl: port "status" commands over to format-table.c's vertical output logic.
systemctl:
systemd-analyze inspect-elf should show other notes too, at least build-id.
systemd-analyze netif that explains predictable interface (or networkctl)
systemd-analyze: port "pcrs" verb to talk directly to TPM device, instead of using sysfs interface (well, or maybe not, as that would require privileges?)
systemd-boot: maybe add support for collapsing menu entries of the same OS into one item that can be opened (like in a "tree view" UI element) or collapsed. If only a single OS is installed, disable this mode, but if multiple OSes are installed might make sense to default to it, so that user is not immediately bombarded with a multitude of Linux kernel versions but only one for each OS.
systemd-creds: extend encryption logic to support asymmetric encryption/authentication. Idea: add new verb "systemd-creds public-key" which generates a priv/pub key pair on the TPM2 and stores the priv key locally in /var. It then outputs a certificate for the pub part to stdout. This can then be copied/taken elsewhere, and can be used for encrypting creds that only the host on its specific hw can decrypt. Then, support a drop-in dir with certificates that can be used to authenticate credentials. Flow of operations is then this: build image with owner certificate, then after boot up issue "systemd-creds public-key" to acquire pubkey of the machine. Then, when passing data to the machine, sign with privkey belonging to one of the dropped in certs and encrypted with machine pubkey, and pass to machine. Machine is then able to authenticate you, and confidentiality is guaranteed.
systemd-cryptenroll: add --firstboot or so, that will interactively ask user whether recovery key shall be enrolled and do so
systemd-dissect: add --cat switch for dumping files such as /etc/os-release
systemd-dissect: show available versions inside of a disk image, i.e. if multiple versions are around of the same resource, show which ones. (in other words: show partition labels).
systemd-firstboot: optionally install an ssh key for root for offline use.
systemd-gpt-auto-generator: add kernel cmdline option to override block device to dissect. also support dissecting a regular file. useccase: include encrypted/verity root fs in UKI.
systemd-inhibit: make taking delay locks useful: support sending SIGINT or SIGTERM on PrepareForSleep()
systemd-measure tool:
systemd-mount should only consider modern file systems when mounting, similar to systemd-dissect
systemd-path: Add "private" runtime/state/cache dir enum, mapping to $RUNTIME_DIRECTORY, $STATE_DIRECTORY and such
systemd-pcrextend:
systemd-repart:
systemd-stub: maybe store a "boot counter" in the ESP, and pass it down to userspace to allow ordering boots (for example in journalctl). The counter would be monotonically increased on every boot.
systemd-sysext: add "exec" command or so that is a bit like "refresh" but runs it in a new namespace and then just executes the selected binary within it. Could be useful to run one-off binaries inside a sysext as a CLI tool.
systemd-tpm2-setup should support a mode where we refuse booting if the SRK changed. (Must be opt-in, to not break systems which are supposed to be migratable between PCs)
systemd-tpm2-support: add a some logic that detects if system is in DA lockout mode, and queries the user for TPM recovery PIN then.
add a networking provider API, inspired by the StorageProvider. Make networkd a provider that exposes interfaces for adding tap, tun, veth via the api, base this on .netdev logic somehow.
$SYSTEMD_EXECPID that the service manager sets should be augmented with $SYSTEMD_EXECPIDFD (and similar for other env vars we might send).
sysupdate:
sysusers: allow specifying a path to an inode and a literal UID in the UID column, so that if the inode exists it is used, and if not the literal UID is used. Use this for services such as the imds one, which run under their own UID in the initrd, and whose data should survive to the host, properly owned.
teach ConditionKernelCommandLine= globs or regexes (in order to match foobar={no,0,off})
teach nspawn/machined a new bus call/verb that gets you a shell in containers that have no sensible pid1, via joining the container, and invoking a shell directly. Then provide another new bus call/vern that is somewhat automatic: if we detect that pid1 is running and fully booted up we provide a proper login shell, otherwise just a joined shell. Then expose that as primary way into the container.
teach parse_timestamp() timezones like the calendar spec already knows it
teach systemd-nspawn the boot assessment logic: hook up vpick's try counters with success notifications from nspawn payloads. When this is enabled, automatically support reverting back to older OS version images if newer ones fail to boot.
test/:
The bind(AF_UNSPEC) construct (for resetting sockets to their initial state) should be blocked in many cases because it punches holes in many sandboxes.
the pub/priv key pair generated on the TPM2 should probably also be one you can use to get a remote attestation quote.
The udev blkid built-in should expose a property that reflects whether media was sensed in USB CF/SD card readers. This should then be used to control SYSTEMD_READY=1/0 so that USB card readers aren't picked up by systemd unless they contain a medium. This would mirror the behaviour we already have for CD drives.
There's currently no way to cancel fsck (used to be possible via C-c or c on the console)
there's probably something wrong with having user mounts below /sys, as we have for debugfs. for example, src/core/mount.c handles mounts prefixed with /sys generally special. https://lists.freedesktop.org/archives/systemd-devel/2015-June/032962.html
think about requeuing jobs when daemon-reload is issued? use case: the initrd issues a reload after fstab from the host is accessible and we might want to requeue the mounts local-fs acquired through that automatically.
timer units:
timesyncd: add ugly bus calls to set NTP servers per-interface, for usage by NM
timesyncd: when saving/restoring clock try to take boot time into account. Specifically, along with the saved clock, store the current boot ID. When starting, check if the boot id matches. If so, don't do anything (we are on the same boot and clock just kept running anyway). If not, then read CLOCK_BOOTTIME (which started at boot), and add it to the saved clock timestamp, to compensate for the time we spent booting. If EFI timestamps are available, also include that in the calculation. With this we'll then only miss the time spent during shutdown after timesync stopped and before the system actually reset.
tiny varlink service that takes a fd passed in and serves it via http. Then make use of that in networkd, and expose some EFI binary of choice for DHCP/HTTP base EFI boot.
tmpfiles:
To mimic the new tpm2-measure-pcr= crypttab option and tpm2-measure-nvpcr= veritytab option, add the same to integritytab (measuring the HMAC key if one is used)
tpm2-setup: reboot if we detect SRK changed
tpm2: add (optional) support for generating a local signing key from PCR 15 state. use private key part to sign PCR 7+14 policies. stash signatures for expected PCR7+14 policies in EFI var. use public key part in disk encryption. generate new sigs whenever db/dbx/mok/mokx gets updated. that way we can securely bind against SecureBoot/shim state, without having to renroll everything on each update (but we still have to generate one sig on each update, but that should be robust/idempotent). needs rollback protection, as usual.
TPM2: auto-reenroll in cryptsetup, as fallback for hosed firmware upgrades and such
track the per-service PAM process properly (i.e. as an additional control process), so that it may be queried on the bus and everything.
transient units: don't bother with actually setting unit properties, we reload the unit file anyway
transient units:
Turn systemd-networkd-wait-online into a small varlink service that people can talk to and specify exactly what to wait for via a method call, and get a response back once that level of "online" is reached.
tweak journald context caching. In addition to caching per-process attributes keyed by PID, cache per-cgroup attributes (i.e. the various xattrs we read) keyed by cgroup path, and guarded by ctime changes. This should provide us with a nice speed-up on services that have many processes running in the same cgroup.
tweak sd-event's child watching: keep a prioq of children to watch and use waitid() only on the children with the highest priority until one is waitable and ignore all lower-prio ones from that point on
udev-link-config:
udev:
udevadm: to make symlink querying with udevadm nicer:
udevd: extend memory pressure logic: also kill any idle worker processes
unify how blockdev_get_root() and sysupdate find the default root block device
unify on openssl:
unit files:
unit install:
update HACKING.md to suggest developing systemd with the ideas from: https://0pointer.net/blog/testing-my-system-code-in-usr-without-modifying-usr.html https://0pointer.net/blog/running-an-container-off-the-host-usr.html
use name_to_handle_at() with AT_HANDLE_FID instead of .st_ino (inode number) for identifying inodes, for example in copy.c when finding hard links, or loop-util.c for tracking backing files, and other places.
use sd-event ratelimit feature optionally for journal stream clients that log too much
userdb: allow existence checks
userdb: when synthesizing NSS records, pick "best" password from defined passwords, not just the first. i.e. if there are multiple defined, prefer unlocked over locked and prefer non-empty over empty.
userdbd: implement an additional varlink service socket that provides the host user db in restricted form, then allow this to be bind mounted into sandboxed environments that want the host database in minimal form. All records would be stripped of all meta info, except the basic UID/name info. Then use this in portabled environments that do not use PrivateUsers=1.
validatefs: validate more things: check if image id + os id of initrd match target mount, so that we refuse early any attempts to boot into different images with the wrong kernels. check min/max kernel version too. all encoded via xattrs in the target fs.
Varlinkification of the following command line tools, to open them up to other programs via IPC:
verify that the AF_UNIX sockets of a service in the fs still exist when we start a service in order to avoid confusion when a user assumes starting a service is enough to make it accessible
vmspawn switch default swtpm PCR bank to SHA384-only (away from SHA256), at least on 64bit archs, simply because SHA384 is typically double the hashing speed than SHA256 on 64bit archs (since based on 64bit words unlike SHA256 which uses 32bit words).
vmspawn:
vmspawn disk hotplug:
virtio-blk-pci — the simplest path. Each disk is an independent PCI device. QMP sequence (two steps):
blockdev-add {driver: "raw", node-name: "disk1",
file: {driver: "file", filename: "/path/to/img"}}
device_add {driver: "virtio-blk-pci", id: "disk1", drive: "disk1"}
Removal (three steps):
device_del {id: "disk1"}
... wait for DEVICE_DELETED event (guest acknowledges unplug) ...
blockdev-del {node-name: "disk1"}
Works on both i440fx (legacy PCI) and q35 (PCIe) machine types. PCI address auto-assigned by QEMU — no topology pre-configuration needed. Each disk independently hotpluggable. Guest sees a virtio block device (/dev/vdX). Well-tested path — used by libvirt, Incus, and all major VM managers. No special boot-time setup required.
NVMe — two-level model: controller + namespace(s). The controller is a PCIe device; namespaces live on an internal NVMe bus attached to the controller. Key limitation: namespaces are NOT hotpluggable — TYPE_NVME_BUS has no HotplugHandler, so device_add of nvme-ns at runtime fails with "Bus does not support hotplugging". The only option is hotplugging the entire controller, which embeds one namespace via its "drive" property:
blockdev-add {driver: "raw", node-name: "disk1",
file: {driver: "file", filename: "/path/to/img"}}
device_add {driver: "nvme", id: "disk1", drive: "disk1", serial: "disk1"}
Same two-step pattern as virtio-blk, with these limitations:
virtio-scsi — shared virtio-scsi-pci controller with individual scsi-hd devices attached. Incus uses this as its default bus. The controller must exist at boot, but individual disks (LUNs) can be hotplugged onto it without burning PCI slots. Scales better than virtio-blk when many disks are needed, but adds complexity (controller management, LUN assignment).
vmspawn AcquireQMP(): implement as id-rewriting proxy with FD passing. vmspawn acts as a QMP multiplexer. When a client calls AcquireQMP():
This keeps vmspawn in full control of the QMP connection — VMControl handlers and multiple AcquireQMP clients can coexist without id collisions. The server needs SD_VARLINK_SERVER_ALLOW_FD_PASSING_OUTPUT (already set in machined's pattern).
AcquireQMP() also requires server-side Varlink protocol upgrades. mvo's WIP branch: https://github.com/systemd/systemd/compare/main...mvo5:varlink-protocol-upgrade-server-side?expand=1
we probably needs .pcrpkeyrd or so as additional PE section in UKIs, which contains a separate public key for PCR values that only apply in the initrd, i.e. in the boot phase "enter-initrd". Then, consumers in userspace can easily bind resources to just the initrd. Similar, maybe one more for "enter-initrd:leave-initrd" for resources that shall be accessible only before unprivileged user code is allowed. (we only need this for .pcrpkey, not for .pcrsig, since the latter is a list of signatures anyway). With that, when you enroll a LUKS volume or similar, pick either the .pcrkey (for coverage through all phases of the boot, but excluding shutdown), the .pcrpkeyrd (for coverage in the initrd only) and .pcrpkeybt (for coverage until users are allowed to log in).
we probably should have some infrastructure to acquire sysexts with drivers/firmware for local hardware automatically. Idea: reuse the modalias logic of the kernel for this: make the main OS image install a hwdb file that matches against local modalias strings, and adds properties to relevant devices listing names of sysexts needed to support the hw. Then provide some tool that goes through all devices and tries to acquire/download the specified images.
We should probably replace /etc/rc.d/README with a symlink to doc content. After all it is constant vendor data.
We should start measuring all services, containers, and system extensions we activate. probably into PCR 13. i.e. add --tpm2-measure-pcr= or so to systemd-nspawn, and MeasurePCR= to unit files. Should contain a measurement of the activated configuration and the image that is being activated (in case verity is used, hash of the root hash).
what to do about udev db binary stability for apps? (raw access is not an option)
when importing an fs tree with machined, complain if image is not an OS
when importing an fs tree with machined, optionally apply userns-rec-chown
when isolating, try to figure out a way how we implicitly can order all units we stop before the isolating unit...
when killing due to service watchdog timeout maybe detect whether target process is under ptracing and then log loudly and continue instead.
when mounting disk images: if IMAGE_ID/IMAGE_VERSION is set in os-release data in the image, make sure the image filename actually matches this, so that images cannot be misused.
when switching root from initrd to host, set the machine_id env var so that if the host has no machine ID set yet we continue to use the random one the initrd had set.
when systemd-sysext learns mutable /usr/ (and systemd-confext mutable /etc/) then allow them to store the result in a .v/ versioned subdir, for some basic snapshot logic
when we detect that there are waiting jobs but no running jobs, do something
whenever we receive fds via SCM_RIGHTS make sure none got dropped due to the reception limit the kernel silently enforces.
after option+verb introspection is added, add a test to verify that the list in proc-cmdline.c matches the actual option list in systemd and shutdown.
write a document explaining how to write correct udev rules. Mention things such as: