Forum: >>> Magnum BBS <<<

Reviving schroot as used by sbuild

From Helmut Grohne@21:1/5 to All on Tue Jun 25 10:20:01 2024

Hi,

sbuild is our primary tool for constructing a build environment to build
Debian packages. It is used on all buildds and for a long time, the
backend used with sbuild has always been schroot. More recently, a
number of buildds have been moved away from schroot towards --chroot-mode=unshare thanks to the work of at least Aurelien Jarno and
Jochen Sprickerhof and a few more working more behind the scenes for me
to spot them directly.

In this work, limitations with --chroot-mode=unshare became apparent and
that lead to Johannes, Jochen and me sitting down in Berlin pondering
ideas on how to improve the situation. That is a longer story, but
eventually Timo R�hling asked the innocuous question of why we cannot
just use schroot and make it work with namespaces.

That lead me to sit down and write a proof of concept. As a result, we
now have a little script called unschroot.py that vaguely can be used as
a drop-in replacement for schroot when used with sbuild. In trixie and bookworm-backports it can now be plugged into sbuild by setting $schroot
= "path/to/unschroot.py" thanks to Johannes. It's not that long and can
be viewed at https://git.subdivi.de/~helmut/python-linuxnamespaces.git/tree/examples/unschroot.py.
It is vaguely close to reaching feature-parity with sbuild --chroot-mode=unshare and operates in a very similar way. As it is now,
it doesn't bring us any benefits beyond separating the containment
aspect from the build aspect into different tools.

The split into different tools is important in my view. I argue that it
allows easier experimentation and its architecture may enable features
that were difficult to implement using sbuild --chroot-mode=unshare as
sbuild is significantly becoming a container runtime of its own and
there things start to get messy.

Is this a path worth pursuing further? Would we actually consider moving
back from sbuild --chroot-mode=unshare to sbuild --chroot-mode=schroot
with a different schroot implementation?

Related to that, what would be compelling features to switch?

Let me go a bit further into detail. There are two approaches to
managing an ephemeral build container using namespaces. In one approach,
we create a directory hierarchy of a container root filesystem and for
each command and hook that we invoke there, we create new namespaces on
demand. In particular, there are no background processes when nothing is running in that container and all that remains is its directory
hierarchy. Such a container session can easily survive a reboot (unless
stored on tmpfs). Both sbuild --chroot-mode=unshare and unschroot.py
follow this approach. For comparison, schroot sets up mounts (e.g /proc)
when it begins a session and cleans them up when it ends. No such
persistent mounts exist in either sbuild --chroot-mode=unshare or
unschroot.py.

The other approach is using one set of namespaces for the entire
session. Practically, this implies having a background process keeping
this namespace alive for the duration of the session and talking to it
via some IPC mechanism. We may still spawn a new pid namespace for each
command to get reliable process cleanup, but the use of a persistent
mount namespace enables the use of fuse2fs, squashfuse, overlayfs and
bindfs to construct the root directory of the container by other means
than unpacking a tar into a directory. In particular, the use of bindfs
allows sharing e.g. the user's ccache with the build container in
principle (with proper id shifting). At the time of this writing, this
second approach is wishful thinking and not implemented at all. I merely believe that it is implementable with the schroot API already
implemented by unschroot.py above.

Another possible extension is a hooking mechanism. Regular schroot has
hooks already and I've seen requests for sbuild to use package-specific chroots. For instance, one may have a separate Haskell or Rust container
that already has a basic set of ecosystem-specific dependencies to speed
up the installation of Build-Depends. On-demand updating chroots also
have been requested. However, it's not clear to me what a useful
interface e.g. unschroot.py could provide for such hooking yet and I
invite you to provide more use cases for such hooking. Also sketching
how you imagine interfacing with this would be helpful. For instance,
you may explain what kind of configuration files or options you'd like
to use and how you imagine them to work.

I note that this is not a promise that I am going to implement your
wishes. I intend to do more work on this and barring really useful
extensions, my next goal would be moving to that other approach.

Please allow me to thank Freexian for supporting part of this work
financially even though it has been my initiative and is not otherwise influenced by Freexian at the time of this writing.

Let me also explain the relation between "unschroot.py" and the
containing repository "python-linuxnamespaces". linuxnamespaces is a
(probably) distribution-agnostic Python module providing plumbing
functions for constructing container runtimes written by myself for lack
of better alternatives. As such unschroot.py in large parts uses linuxnamespaces (the Python module) to plug together the various parts
needed to arrive at a container useful for building with sbuild. If
unschroot takes off, it likely needs to get its own home.
linuxnamespaces is supposed to enable constructing a systemd-as-pid-1
container as a regular user, but doesn't do that as of yet. While podman
and docker allow running unprivileged application containers, they still require privileged containers when you want to run systemd-as-pid-1.

Helmut

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johannes Schauer Marin Rodrigues@21:1/5 to All on Tue Jun 25 11:40:01 2024

Hi,

Quoting Helmut Grohne (2024-06-25 10:16:20)

In this work, limitations with --chroot-mode=unshare became apparent and that lead to Johannes, Jochen and me sitting down in Berlin pondering ideas on how to improve the situation. That is a longer story, but eventually Timo Röhling
asked the innocuous question of why we cannot just use schroot and make it work with namespaces.

for those who are interested in the longer story behind all of this, let me explain this here. Maybe this background helps to give a bit more of a context about why we were thinking about this. If my memory serves me right, I think the initial trigger for all of this was an idea that Julian Andres Klode had: instead of having to manually run

$ mmdebstrap --variant=buildd unstable ~/.cache/sbuild/unstable-arm64.tar

manually before running sbuild for the first time and every once in a while after that, could the sbuild unshare backend not run this command automatically if the chroot tarball doesn't exist yet or became too old? If that were the default, setting up a package builder on a new system would be as simple as running:

$ sudo apt install sbulid
$ sbuild --chroot-mode=unshare -d unstable my_cool_package

Just install the package, no setup required, and just start building things. So I wrote this MR:

https://salsa.debian.org/debian/sbuild/-/merge_requests/59

Besides automatically creating the chroot, updating the chroot and setting the maximum age of the chroot tarball, this MR also allows passing custom options to mmdebstrap depending on the chroot name. So somebody who wants an ubuntu buildd chroot could have in their ~/.sbuildrc:

$UNSHARE_MMDEBSTRAP_EXTRA_ARGS = {
"focal" => ["--components=universe,multiverse"]
}

And if you want a custom chroot for the rust packages you build, you could have (using %d and %a as percent escapes for distribution and architecture):

$UNSHARE_MMDEBSTRAP_EXTRA_ARGS = {
"debcargo-%d-%a" => [--include ccache,gnupg,dh-cargo,cargo,lintian,perl-openssl-defaults"]
}

But the big question now is: should all of this functionality and complexity live in sbuild? Or should it be moved out of sbuild so that for example the rust team can just have their custom sbuild chroot script doing all the setup and customization they require without sbuild carrying that functionality? The question of how to best allow sbuild to allow an external chroot manager lead to Timo's idea of just replacing 'schroot' with something else which provides the interface that sbuild uses to communicate with schroot but then in the back does its own thing. We have such a thing now and that is Helmut's unschroot.py.

But this is not the only option forward. Another option would be to re-purpose the sbuild autopkgtest backend for this.

Ultimately, what do we want to achieve? We want to make package building easier, less tedious and more customizable. We are thinking about what the best architecture would be to achieve this. We have multiple options:

1. bolt the functionality we want into sbuild as extensions to the unshare
backend or by creating new backends with the desired functionality -- this
is https://salsa.debian.org/debian/sbuild/-/merge_requests/59
2. replace the schroot binary with something else which shares the schroot
interface -- this is unschroot.py
3. move functionality into autopkgtest backends and then make sbuild just a
wrapper around autopkgtest

Choosing one of these three options as the correct software engineering approach becomes even more tricky when we start thinking of persistence. Providing a persistent user and mount namespace (for example to be able to use overlay filesystems on top of a unpacked chroot directory) will be *very* tricky with the current design of the unshare backend and would probably need to become its own backend, if choice 1. is the one we want to pursue. Persistence is also something that would be useful for a backend around qemu. Currently, there exists sbuild-qemu maintained by Christian Kastner which is not a new backend but is a convenience wrapper on top of sbuild driving it with the autopkgtest backend and autopkgtest-virt-qemu as the virt server. It would be great if building packages inside a qemu vm became easier and option 2 would allow users to create a backend that starts qemu and then communicates with a process inside qemu efficiently via an AF_VSOCK. Lastly, persistence is also a requirement for building packages inside a system container which runs an actual init process when it spins up.

These last two bits (qemu and system containers) also become a very interesting topic to think about because in contrast to minimal application containers or a simple chroot directory, we do not want to build packages directly in them (because we want to build packages in a minimal setup). So now sbuild would have to manage 3 environments:

1. the system on the outside
2. the system inside qemu or the system container
3. the system in the minimal build chroot inside the VM

This is also why option 3 from above (autopkgtest) is not an obvious solution because shoving this understanding into autopkgtest will also be some effort. Why thinking about how sbuild needs to have an understanding how it can interact from one of these environments with the next inner environment is important can be shown by one very simple example: why do we have apt inside our build chroots? The main reason is: because the schroot interface (and that is a reason why option 2, unschroot.py is not the fits-all solution) does not provide a facility to let a tool from the outside (environment 1) work on inside the chroot (environment 3). The unshare backend has this ability. If sbuild only were using the unshare backend and we drop the schroot backend, I could easily allow build chroot containing nothing else than essential, build-essential and build dependencies.

So, this was a big brain dump. Thank you for getting this far. I've had this in my head for several weeks now and even though I'm very exciting to see where Helmut's unschroot.py can take us, I do not yet see one way forward that I am entirely happy with.

Maybe you can share your thoughts.

Thanks!

cheers, josch
--==============w91154018912350913=MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Description: signature
Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEElFhU6KL81LF4wVq58sulx4+9g+EFAmZ6jz4ACgkQ8sulx4+9 g+Go+g//ZiN9W5fvVeHbGEJWTHHg+ZVDfa0HVzgDaMDpKSiB8gLBDnhYxKaE5Xs+ ouD7q/HFr+SwLnbKC9CMz+M+j4jPZGgMXQoAj9LCNSG3YxJlWE840Zk9xrxYI/0v m0dd95momJAoxAErP9fGsrArxIDVZo9a28g5+evcdXFEc79fQO+r4IYuIZYAxWFL 7BTZsxEeY3CpI9AAvQhLF6+q/Um4hmlnDyWXLV3eucEQ2E9lsj7Ij5C9UoW7bT+k V/9iWzJ1+1rUYl3x5109tTrDJMh39zhtbg9A6mF0yhTP81pT+oFSuZatcWlzk9RU fuAroTcejgGzng5FIFmwZF3dL5rvKp0CUb75H0as8pLbluzeLcK3yOJ3jevHBIfE 5qt0rYG7rQRbSniAncWrOQhuD/dw5hGrj+mOqS+sc/03VRoQYsI/u/lfmMC0MeBN CZzirhvgCSbVnTf6bGPp9DIhf7j+cu5HK79scoU2sa/bY+W8NMKVfAh8bfVOF1yV 0x+RZKwnFTNQwAIVj692gyaHB4Kh0VQeWasnFaggIvvSq3hPz0w755bpyNfnrLPy 8igAJl5xfkPGriQ8WeUrwOXXcFiHLcQTl7vriTHKMwUTc11wznIaGgPIAMlY/o+0 62wuHLMkXURTRLpuzbmUj8tsOs6ITe645MP8k7GQtQNqt7ceHDc=
=x1FO
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Helmut Grohne on Tue Jun 25 15:10:01 2024

On Tue, 25 Jun 2024 at 10:16:20 +0200, Helmut Grohne wrote:

In this work, limitations with --chroot-mode=unshare became apparent and
that lead to Johannes, Jochen and me sitting down in Berlin pondering
ideas on how to improve the situation. That is a longer story, but
eventually Timo R�hling asked the innocuous question of why we cannot
just use schroot and make it work with namespaces.

I have to ask:

Could we use a container framework that is also used outside the Debian
bubble, rather than writing our own from first principles every time, and ending up with a single-maintainer project being load-bearing for Debian *again*? I had hoped that after sbuild's history with schroot becoming unmaintained, and then being revived by a maintainer-of-last-resort who
is one of the same few people who are critical-path for various other
important things, we would recognise that as an anti-pattern that we
should avoid if we can.

At the moment, rootless Podman would seem like the obvious choice. As far
as I'm aware, it has the same user namespaces requirements as the unshare backends in mmdebstrap, autopkgtest and schroot (user namespaces enabled, setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in /etc/subgid).

Podman uses the same OCI images as Docker, so it can either pull from a
trusted OCI registry, or use images that were built by importing a tarball generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for
Debian we would want to do the latter, at least initially, to avoid
being forced to either trust an external registry like hub.docker.com
or operate our own.

Here's the Dockerfile/Containerfile to turn a sysroot tarball into an
OCI image (obviously it can be extended with LABELs and other
customizations, but this is fairly close to minimal):

FROM scratch
ADD sysroot.tar.gz /
CMD ["/bin/bash"]

The reason I suggest Podman rather than Docker is that Podman is normally "daemonless" (the container is an ordinary process tree, like schroot,
rather than being launched by command-execution RPC to dockerd) and
is normally used "rootless" (whereas Docker *can* be configured to be "rootless" but in practice it seems that's very uncommon).

podman is also supported as a backend by autopkgtest-virt-podman, Toolbx (podman-toolbox in Debian) and distrobox. autopkgtest's autopkgtest-build-podman does not yet support starting from a tarball
as described above, but it easily could (contributions welcome).

Or, if Podman is too "not invented here" for Debian's use, using rootless lxd/Incus is another option - although that introduces a dependency
on projects and formats that are rarely used outside the Debian/Ubuntu
bubble, which risks them becoming another schroot (and also requires us to decide whether we follow Canonical's lxd or the community fork Incus
post-fork, which could get somewhat political).

There are two approaches to
managing an ephemeral build container using namespaces. In one approach,
we create a directory hierarchy of a container root filesystem and for
each command and hook that we invoke there, we create new namespaces on demand. In particular, there are no background processes when nothing is running in that container and all that remains is its directory
hierarchy. Such a container session can easily survive a reboot (unless stored on tmpfs). Both sbuild --chroot-mode=unshare and unschroot.py
follow this approach. For comparison, schroot sets up mounts (e.g /proc)
when it begins a session and cleans them up when it ends. No such
persistent mounts exist in either sbuild --chroot-mode=unshare or unschroot.py.

Persisting a container root filesystem between multiple operations comes
with some serious correctness issues if there are "hooks" that can modify
it destructively on each operation: see <https://bugs.debian.org/499014>
and <https://bugs.debian.org/994836>. As a result of that, I think the
only model that should be used in new systems is to have some concept of
a session (like schroot type=file, but unlike schroot type=directory)
so that those "hooks" only run once, on session creation, preventing
them from arbitrarily reverting/overwriting changes that are subsequently
made by packages installed into the chroot/container (for example dbus' creation of the messagebus uid/gid in #499014, and exim4's creation of Debian-exim in #994836).

I don't know whether creating new namespaces multiple times (but without running external integration hooks the second and subsequent times)
will also lead to practical problems, but I note that outside the Debian bubble, everything that enters a new container environment seems to
operate by creating a process that encapsulates the container, and then
either letting it run to completion interactively or non-interactively
(`docker run`, etc.), or letting it run in the background (perhaps with
an init system or `sleep infinity` as its "payload" process) and then repeatedly injecting code into that pre-existing namespace
(either `docker exec`, etc., or something like ssh).

autopkgtest's Docker, Podman, lxc, lxd backends all operate by creating
a namespaced init or sleep process with `docker run` or equivalent, and
then injecting subsequent commands into the namespace that was created
for that long-running process with `docker exec` or equivalent.
I think unshare is the outlier here, and I think it would be good to
consider whether it really needs to be.

The more like other container managers a new container manager is, the
less likely it is to break reasonable expectations in future, like
schroot regularly does.

While podman
and docker allow running unprivileged application containers, they still require privileged containers when you want to run systemd-as-pid-1.

What do you mean by "privileged containers" exactly? Do you mean a system service that runs with CAP_SYS_ADMIN and other scary privileges in the
init namespace, like the typical use of dockerd, or are you also counting
uses of the setuid newuidmap as being privileged?

If you are happy to use the setuid newuidmap (which I believe the unshare backends for schroot, mmdebstrap, autopkgtest also rely on) then my understanding is that "rootless" podman is essentially equivalent:
you need a setuid newuidmap, a range of 65536 uids in /etc/subuid,
a range of 65536 gids in /etc/subgid, and a kernel that will allow
unprivileged users to create new user namespaces, but beyond that there
are no special privileges required.

Please see /usr/share/doc/podman/README.Debian for details of what it needs.

For systemd-as-pid-1 specifically,
`autopkgtest-build-podman --init=systemd` and
`autopkgtest-virt-podman --init` demonstrate how this can be done, and
last time I tried, it was possible to run them unprivileged (other than
needing access to the setuid newuidmap, as above). systemd is able to
detect that it's running in a container and turn off functionality like
udev that would only be appropriate in a VM or on bare metal, and podman
knows how to tell systemd that it should do this.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Faidon Liambotis@21:1/5 to Simon McVittie on Tue Jun 25 15:30:01 2024

On Tue, Jun 25, 2024 at 02:02:11PM +0100, Simon McVittie wrote:

Could we use a container framework that is also used outside the Debian bubble, rather than writing our own from first principles every time, and ending up with a single-maintainer project being load-bearing for Debian *again*? I had hoped that after sbuild's history with schroot becoming unmaintained, and then being revived by a maintainer-of-last-resort who
is one of the same few people who are critical-path for various other important things, we would recognise that as an anti-pattern that we
should avoid if we can.

Absolutely agreed, strong +1 on this.

At the moment, rootless Podman would seem like the obvious choice. As far
as I'm aware, it has the same user namespaces requirements as the unshare backends in mmdebstrap, autopkgtest and schroot (user namespaces enabled, setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in /etc/subgid).

I am perhaps a little biased, but I, too, think rootless Podman would be
the best for the job :)

Podman uses the same OCI images as Docker, so it can either pull from a trusted OCI registry, or use images that were built by importing a tarball generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for
Debian we would want to do the latter, at least initially, to avoid
being forced to either trust an external registry like hub.docker.com
or operate our own.

Here's the Dockerfile/Containerfile to turn a sysroot tarball into an
OCI image (obviously it can be extended with LABELs and other
customizations, but this is fairly close to minimal):

Note that podman run also has --rootfs, that accepts the path to an
exploded container, and it supports both idmap and overlayfs on top of
it as well. So that's another option, one that skips image management, Dockerfiles etc. entirely, allowing for an even closer experience to the existing tooling.

Faidon

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrey Rakhmatullin@21:1/5 to Simon McVittie on Tue Jun 25 15:30:01 2024

On Tue, Jun 25, 2024 at 02:02:11PM +0100, Simon McVittie wrote:

In this work, limitations with --chroot-mode=unshare became apparent and that lead to Johannes, Jochen and me sitting down in Berlin pondering
ideas on how to improve the situation. That is a longer story, but eventually Timo Röhling asked the innocuous question of why we cannot
just use schroot and make it work with namespaces.

I have to ask:

Could we use a container framework that is also used outside the Debian bubble, rather than writing our own from first principles every time, and ending up with a single-maintainer project being load-bearing for Debian *again*? I had hoped that after sbuild's history with schroot becoming unmaintained, and then being revived by a maintainer-of-last-resort who
is one of the same few people who are critical-path for various other important things, we would recognise that as an anti-pattern that we
should avoid if we can.

100%

--
WBR, wRAR

-----BEGIN PGP SIGNATURE-----

iQJhBAABCgBLFiEEolIP6gqGcKZh3YxVM2L3AxpJkuEFAmZ6xhUtFIAAAAAAFQAP cGthLWFkZHJlc3NAZ251cGcub3Jnd3JhckBkZWJpYW4ub3JnAAoJEDNi9wMaSZLh EVMQAIyMAh7oxG41KevYXQeKINCvXHCsZtqGspMRVVP8KbBZ1LRSxHqrl95fq6PI ozH8M6rRN8tBM+/EsozqHqi8sdbkDy+kRo0z13nj1Lh9lehDQRmuqVY8DTKbuKay lPzNfeJ7Vh5IHhtSjZDM2Sgf+UppEo51/Q35v77bz8ZCTiGCe25uKMHBIOHwdZYJ ERz+AyR+yJvfRzyw1TR3ycSyQ/rAE+rA5jtNquw9r3/r1DHauZDU+zFvw+C1GTXH xZrACVtj91DEkuzDgnH+hcgv5BmJhfryg8oLvjdBYuZcvqLG5NHP5oConBP3arIp l23Mt7a/2ioqlNs92CUfd1xwk2BGOMnjWDDty1U1kN2MSYoqiQrjbmMsICUezxuj wB7ttWBPC7biMj5HTet3hg2HR6fOU8TO+w7V+mLbdv0xZCd+eUdBuCazb2EapfI3 YKrMrc6M0Mt64OG4iXRKXHeWEvDssJcQ/tzD9PkNyljfOT0VVQY/T0ElOL7ZjzQ+ PM0TLfMDv1SyQ866RfCCBkQSO6+LCkQAjmH/drjERcuKpDzqB7bOBwd/B/bm9YQd KxsLFEiZAN8tmIEDpru0QzcVZv3kkQdCrbk1RFz65pQKy0Nx3Domba5SRj+woDJZ d5YMAi4hqTxX0uLLNG2FhGwL+jjpI+V71NPXBBzo1CilrtwL
=Oy2N
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Antonio Terceiro@21:1/5 to Simon McVittie on Tue Jun 25 17:30:01 2024

On Tue, Jun 25, 2024 at 02:02:11PM +0100, Simon McVittie wrote:

On Tue, 25 Jun 2024 at 10:16:20 +0200, Helmut Grohne wrote:

In this work, limitations with --chroot-mode=unshare became apparent and that lead to Johannes, Jochen and me sitting down in Berlin pondering
ideas on how to improve the situation. That is a longer story, but eventually Timo R�hling asked the innocuous question of why we cannot
just use schroot and make it work with namespaces.

I have to ask:

Could we use a container framework that is also used outside the Debian bubble, rather than writing our own from first principles every time, and ending up with a single-maintainer project being load-bearing for Debian *again*? I had hoped that after sbuild's history with schroot becoming unmaintained, and then being revived by a maintainer-of-last-resort who
is one of the same few people who are critical-path for various other important things, we would recognise that as an anti-pattern that we
should avoid if we can.

At the moment, rootless Podman would seem like the obvious choice. As far
as I'm aware, it has the same user namespaces requirements as the unshare backends in mmdebstrap, autopkgtest and schroot (user namespaces enabled, setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in /etc/subgid).

Podman uses the same OCI images as Docker, so it can either pull from a trusted OCI registry, or use images that were built by importing a tarball generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for
Debian we would want to do the latter, at least initially, to avoid
being forced to either trust an external registry like hub.docker.com
or operate our own.

Yes, please.

FWIW, I want to switch ci.d.n from lxc to podman at some point as well.

-----BEGIN PGP SIGNATURE-----

iQIzBAABCAAdFiEEst7mYDbECCn80PEM/A2xu81GC94FAmZ64cgACgkQ/A2xu81G C97+ig//XbwVcY7F+97DPtPRvU//vINo07xTuyFKwvFA7fa0CUaQaYi+zPahLEEu 36lC0eUkjCFhb1hJh/lHH5G/5MrgvcWy7tOWNEJKtbRp2gb7fnFM3hzgu8W3Nwlr q8mL6FWvM+80Mo3Wsuiqf6gYRzZAzzY6TXLPcutBKw7sdjMPdpo9kfBLtljqsxsx I3qO3Ok4ihg+0BPR5bca9gXTzgX83JA7ARi3USUa9z+vU1Lh+8R3q6KIDdSf1oL0 Odji5Fur5C/n1lsjmLwPGCb6E1GDXZVPCN8K4OESHIm1N5AZODxtr+twceu5G+fm JWubSDNFzLnyD3b9eRgKvkW+EQiiNKFei6IKk2Lrpqxd981XDdDfcA+aPSx/xkNc GcjVtVbiYPQwcJqn2bBfLEZeeFO6YNon0aP75gVBKfsbfP+FrT2tG+4vZP2xvOVF J5qMikNA9+Ftz7tfRBMw/NXuJ6C4HR9rxNtr5XmsLNuVBCeX/CVdMiSzuCmQc/gk MhnSrcbX389NsiVZLopUh/i+T8wOAI9EtT1RC7nYrj/bNifMGJ0Z/At6Pt+ze5v6 OI2GkgRr5F5T/SlcjcQV2mENEs23RQV9dsAzpNqBs6E1hCyMo9GAwUmGNbbvhG+U Vst0sCGwnllOkRHgbfa1pStzsD8jcwkXZaFbYpKK0/g9+7SGrjc=
=HtjC
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Simon McVittie on Tue Jun 25 18:40:01 2024

Simon McVittie <[email protected]> writes:

Persisting a container root filesystem between multiple operations comes
with some serious correctness issues if there are "hooks" that can modify
it destructively on each operation: see <https://bugs.debian.org/499014>
and <https://bugs.debian.org/994836>. As a result of that, I think the
only model that should be used in new systems is to have some concept of
a session (like schroot type=file, but unlike schroot type=directory)
so that those "hooks" only run once, on session creation, preventing
them from arbitrarily reverting/overwriting changes that are subsequently made by packages installed into the chroot/container (for example dbus' creation of the messagebus uid/gid in #499014, and exim4's creation of Debian-exim in #994836).

I'm not entirely sure that I'm following the nuances of this discussion,
so this may be irrelevant, but I think type=btrfs-snapshot provides the
ideal properties for container file systems. This unfortunately require
file system support and therefore cannot be used unless you've already
embraced a file system with subvolumes, but if you have, you get all of
the speed of a persistent container root file system with none of the correctness issues, because you get a fresh (and almost instant) clone of
a canonical root file system that is discarded after each build.

I use that in combination with a cron job to update the source subvolume
daily to ensure that it's fully patched.

Unfortunately, there's no way that we can rely on this, but it would be
nice to continue to support it for those who are using a supported
underlying file system already.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Guillem Jover@21:1/5 to Russ Allbery on Tue Jun 25 18:50:01 2024

Hi!

On Tue, 2024-06-25 at 09:32:21 -0700, Russ Allbery wrote:

Simon McVittie <[email protected]> writes:

Persisting a container root filesystem between multiple operations comes with some serious correctness issues if there are "hooks" that can modify it destructively on each operation: see <https://bugs.debian.org/499014> and <https://bugs.debian.org/994836>. As a result of that, I think the
only model that should be used in new systems is to have some concept of
a session (like schroot type=file, but unlike schroot type=directory)
so that those "hooks" only run once, on session creation, preventing
them from arbitrarily reverting/overwriting changes that are subsequently made by packages installed into the chroot/container (for example dbus' creation of the messagebus uid/gid in #499014, and exim4's creation of Debian-exim in #994836).

I'm not entirely sure that I'm following the nuances of this discussion,
so this may be irrelevant, but I think type=btrfs-snapshot provides the
ideal properties for container file systems. This unfortunately require
file system support and therefore cannot be used unless you've already embraced a file system with subvolumes, but if you have, you get all of
the speed of a persistent container root file system with none of the correctness issues, because you get a fresh (and almost instant) clone of
a canonical root file system that is discarded after each build.

I use that in combination with a cron job to update the source subvolume daily to ensure that it's fully patched.

Unfortunately, there's no way that we can rely on this, but it would be
nice to continue to support it for those who are using a supported
underlying file system already.

I manage my chroots with schroot (but not via sbuild, for dog fooding
purposes :), and use type=directory and union-type=overlay so that I
get a fast and persistent base, independent of the underlying filesystem,
with fresh instances per session. (You can access the base via the
source:<id> names.) I never liked the type=file stuff, as it's slow to
setup and maintain.

Regards,
Guillem

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Russ Allbery on Tue Jun 25 19:00:01 2024

On Tue, 25 Jun 2024 at 09:32:21 -0700, Russ Allbery wrote:

Simon McVittie <[email protected]> writes:

I think the
only model that should be used in new systems is to have some concept of
a session (like schroot type=file, but unlike schroot type=directory)

I'm not entirely sure that I'm following the nuances of this discussion,
so this may be irrelevant, but I think type=btrfs-snapshot provides the
ideal properties for container file systems.

That's another of the "good" schroot types which don't generally cause bugs like #499014 and #994836. As of Debian 12, I believe the situation is:

Good (session-based): file, btrfs-snapshot, zfs-snapshot, lvm-snapshot

Bad by default, can be good if combined with a non-trivial union-type: directory, loopback, block-device

Usually a mistake: plain

I mentioned file because it's the only one of the "good" choices that can
works on any system, without a specific filesystem or storage management mechanism, but the others are fine too if you happen to have the right filesystem or storage management. If you have enough RAM, the file
backend unpacked into a tmpfs also completely avoids any possible
performance issue involving fsync(), whether in dpkg or elsewhere :-)

I would also suggest not using the "source chroot" associated with one
of the good (session-based) options, and instead re-bootstrapping the
chroot from first principles whenever that's desired.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Grohne@21:1/5 to Simon McVittie on Tue Jun 25 19:00:02 2024

Hi Simon,

On Tue, Jun 25, 2024 at 02:02:11PM +0100, Simon McVittie wrote:

Could we use a container framework that is also used outside the Debian bubble, rather than writing our own from first principles every time, and ending up with a single-maintainer project being load-bearing for Debian *again*? I had hoped that after sbuild's history with schroot becoming unmaintained, and then being revived by a maintainer-of-last-resort who
is one of the same few people who are critical-path for various other important things, we would recognise that as an anti-pattern that we
should avoid if we can.

This is a reasonable concern. I contend that while unschroot.py is very Debian-specific, the underlying plumbing layer is not. I would not have
started working on this if what I wanted to do was doable with existing
code, but maybe it was not the code didn't do it, but me not using the
existing code correctly.

Please allow me to point out that right now, sbuild contains a custom
container framework that is subject to eventually becoming a starving single-maintainer project and I am trying to extract and separate this
existing container framework from sbuild into more reusable components. Likewise, mmdebstrap contains another custom container framework that is similar but not equal to the one in sbuild.

At the moment, rootless Podman would seem like the obvious choice. As far
as I'm aware, it has the same user namespaces requirements as the unshare backends in mmdebstrap, autopkgtest and schroot (user namespaces enabled, setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in /etc/subgid).

I concur, the privilege requirements for rootless podman are exactly the
ones I am interested in. Indeed, podman was the thing investigated most thoroughly, but evidently not thoroughly enough.

Podman uses the same OCI images as Docker, so it can either pull from a trusted OCI registry, or use images that were built by importing a tarball generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for
Debian we would want to do the latter, at least initially, to avoid
being forced to either trust an external registry like hub.docker.com
or operate our own.

At least for me, building container images locally is a requirement. I
have no interest in using a container registry. Faidon pointing at
--roofs goes further into this direction.

podman is also supported as a backend by autopkgtest-virt-podman, Toolbx (podman-toolbox in Debian) and distrobox. autopkgtest's autopkgtest-build-podman does not yet support starting from a tarball
as described above, but it easily could (contributions welcome).

Thank you for pointing at these. I need to familiarize myself with them.

Or, if Podman is too "not invented here" for Debian's use, using rootless lxd/Incus is another option - although that introduces a dependency
on projects and formats that are rarely used outside the Debian/Ubuntu bubble, which risks them becoming another schroot (and also requires us to decide whether we follow Canonical's lxd or the community fork Incus post-fork, which could get somewhat political).

lxd/incus also was on my list, but my understanding is that they do not
work without their system services at all and being able to operate
containers (i.e. being incus-admin or the like) roughly becomes
equivalent to being full root on the system defeating the purpose of the exercise. If anything is "not invented here", that'd be unschroot rather
than podman.

There are two approaches to
managing an ephemeral build container using namespaces. In one approach,
we create a directory hierarchy of a container root filesystem and for
each command and hook that we invoke there, we create new namespaces on demand. In particular, there are no background processes when nothing is running in that container and all that remains is its directory
hierarchy. Such a container session can easily survive a reboot (unless stored on tmpfs). Both sbuild --chroot-mode=unshare and unschroot.py
follow this approach. For comparison, schroot sets up mounts (e.g /proc) when it begins a session and cleans them up when it ends. No such persistent mounts exist in either sbuild --chroot-mode=unshare or unschroot.py.

Persisting a container root filesystem between multiple operations comes
with some serious correctness issues if there are "hooks" that can modify
it destructively on each operation: see <https://bugs.debian.org/499014>
and <https://bugs.debian.org/994836>. As a result of that, I think the
only model that should be used in new systems is to have some concept of
a session (like schroot type=file, but unlike schroot type=directory)
so that those "hooks" only run once, on session creation, preventing
them from arbitrarily reverting/overwriting changes that are subsequently made by packages installed into the chroot/container (for example dbus' creation of the messagebus uid/gid in #499014, and exim4's creation of Debian-exim in #994836).

I guess you understood my explanation differently than it was meant.
While the container is persisted into the filesystem, this is being done
for each package build individually. sbuild --chroot-mode=unshare and
unschroot use a tarball as their source and opening the session amounts
to extracting it. At the end of the session, the tree is disposed. The
session concept of schroot is being reused in unschroot and it very much behaves like a type=file chroot except that you can begin a session,
reboot and continue using it until you end it without requiring a system service to recover your sessions during boot.

The main difference to how everyone else does this is that in a typical
sbuild interaction it will create a new user namespace for every single
command run as part of the session. sbuild issues tens of commands
before launching dpkg-buildpackage and each of them creates new
namespaces in the Linux kernel (all of them using the same uid mappings, performing the same bind mounts and so on). The most common way to think
of containers is different: You create those namespaces once and reuse
the same namespace kernel objects for multiple commands part of the same session (e.g. installation of build dependencies and dpkg-buildpackage).
You describe this other approach in more detail:

I don't know whether creating new namespaces multiple times (but without running external integration hooks the second and subsequent times)
will also lead to practical problems, but I note that outside the Debian bubble, everything that enters a new container environment seems to
operate by creating a process that encapsulates the container, and then either letting it run to completion interactively or non-interactively (`docker run`, etc.), or letting it run in the background (perhaps with
an init system or `sleep infinity` as its "payload" process) and then repeatedly injecting code into that pre-existing namespace
(either `docker exec`, etc., or something like ssh).

Exactly, this is how everyone but sbuild --chroot-mode=unshare and
unschroot do it.

autopkgtest's Docker, Podman, lxc, lxd backends all operate by creating
a namespaced init or sleep process with `docker run` or equivalent, and
then injecting subsequent commands into the namespace that was created
for that long-running process with `docker exec` or equivalent.

Please allow me to do a tangential excursion here. There two ways of interacting with containers that use one set of namespaces for their
entire existence. One is setting up some IPC mechanism and receiving
commands to be run inside (for instance spawning a shell and piping
commands into it or driving the container via ssh) or an external
process joins (setns) the existing container (namespaces) and injects
code into it (docker exec). That latter approach has a history of vulnerabilities closely related to vulnerabilities in setuid binaries,
because we are transitioning a process (and all of its context) from
outside the container into it and thus expose all of its context (memory
maps, open file descriptors and so on) to contained processes. As such,
I think that an approach based on an IPC mechanism should be preferred.
I am not sure whether podman exec operates in this way, but a quick
codesearch did not exhibit obvious uses of setns inside the podman
source code. Would anyone be able to tell how podman exec is
implemented here?

I think unshare is the outlier here, and I think it would be good to
consider whether it really needs to be.

Absolutely! Did you observe that I suggested moving unschroot to that
other model where the namespace objects are reused for the entire
session? Indeed, moving sbuild --chroot-mode=unshare in this direction
was one of the primary motivations for starting this work, but doing
this inside sbuild is very difficult due to its architecture, so my
approach was first separating the container framework from sbuild and
that's how I arrived at unschroot.

The more like other container managers a new container manager is, the
less likely it is to break reasonable expectations in future, like
schroot regularly does.

Yes! I very much used the systemd container interface documentation to
avoid exactly this problem.

While podman
and docker allow running unprivileged application containers, they still require privileged containers when you want to run systemd-as-pid-1.

What do you mean by "privileged containers" exactly? Do you mean a system service that runs with CAP_SYS_ADMIN and other scary privileges in the
init namespace, like the typical use of dockerd, or are you also counting uses of the setuid newuidmap as being privileged?

I'm sorry for being imprecise here. Privileged is an overloaded term in
the container context. I was trying to use it with the "not rootless"
meaning here. The interest is in running containers with user privileges available on common installations (i.e. unprivileged user namespaces,
newuidmap being setuid, subuid allocation and systemd being your cgroup
manager and handing out delegated cgroups).

If you are happy to use the setuid newuidmap (which I believe the unshare backends for schroot, mmdebstrap, autopkgtest also rely on) then my understanding is that "rootless" podman is essentially equivalent:
you need a setuid newuidmap, a range of 65536 uids in /etc/subuid,
a range of 65536 gids in /etc/subgid, and a kernel that will allow unprivileged users to create new user namespaces, but beyond that there
are no special privileges required.

Cool. I think you really need one more non-trivial (but very commonly available) privilege. You need a cgroup manager (such as systemd) that
allows creating and delegating a cgroup hierarchy to you. You may call
this a non-special privilege.

Please see /usr/share/doc/podman/README.Debian for details of what it needs.

It could use updating as swapaccount=1 is the default.

For systemd-as-pid-1 specifically,
`autopkgtest-build-podman --init=systemd` and
`autopkgtest-virt-podman --init` demonstrate how this can be done, and
last time I tried, it was possible to run them unprivileged (other than needing access to the setuid newuidmap, as above). systemd is able to
detect that it's running in a container and turn off functionality like
udev that would only be appropriate in a VM or on bare metal, and podman knows how to tell systemd that it should do this.

This is very cool. Running autopkgtests in system containers without
being root (or incus-admin) very much is what I'd like to do. And it's
much better if I don't have to write my own container framework for
doing it. I couldn't get it to work locally yet (facing non-obvious
error messages).

Would someone be able to document (mail/wiki/blog/...) how to set up and
use podman for running autopkgtests. Thus far, I failed to figure out
how to plug a local Debian mirror (as opposed to a container registry)
into autopkgtest-build-podman. It is quite difficult to locate podman documentation that is applicable under the assumption that you don't
want to use any container registry.

So thank you very much for pointing me hard at podman again. My podman
research dates back quite a bit and I can already tell that podman is
quite a bit different now.

Let me circle back to the question of whether podman solves the needs of sbuild. We learned that sbuild --chroot-mode=unshare and unschroot spawn
a new set of namespaces for every command. What you point out as a
limitation also is a feature. Technically, it is a lie that the
namespaces are always constructed in the same way. During installation
of build depends the network namespace is not unshared while package
builds commonly use an unshared network namespace with no interfaces but
the loopback interface. In a similar vein, constructing a pid namespace
for every command ensures reliable process cleanup: Once your build has
exited, all background processes are reliably disposed. These aspects
are very useful to how we use containers in sbuild, but the way most
container runtimes work with a single set of namespaces makes this
non-trivial. We really want to change the set of namespaces throughout
the session.

So I think the needs of sbuild (and piuparts) about container frameworks
are quite specific and not easily met by existing tools. Ultimately,
this is what lead me into writing a reusable Python module providing
container plumbing and a relatively thin implementation of schroot using namespaces on top of it.

If we can get the requested features from podman, choosing it is the
better choice to me for the maintainability reasons that you started
with. It is not clear though whether podman can be made to address our requirements.

Thank you for having taken one step back and questioning my context
instead of going into my actual questions.

Helmut

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marco d'Itri@21:1/5 to Guillem Jover on Tue Jun 25 19:10:01 2024

On Jun 25, Guillem Jover <[email protected]> wrote:

I manage my chroots with schroot (but not via sbuild, for dog fooding purposes :), and use type=directory and union-type=overlay so that I
get a fast and persistent base, independent of the underlying filesystem, with fresh instances per session. (You can access the base via the source:<id> names.) I never liked the type=file stuff, as it's slow to
setup and maintain.

Same. So I implemented overlayfs support in pbuilder:

https://salsa.debian.org/pbuilder-team/pbuilder/-/merge_requests/28

If a tmpfs is mounted on /var/cache/pbuilder/build/ then all the actual
action will happen in RAM.

--
ciao,
Marco

-----BEGIN PGP SIGNATURE-----

iHUEABYIAB0WIQQnKUXNg20437dCfobLPsM64d7XgQUCZnr4EAAKCRDLPsM64d7X gaCAAP4ttAZ2Zyp3h2VMJoyoEM3pbnhbP0V7n6V1abkqTQGc9QD+LJ2CmJ6n3fe0 jeq2w1+lYzJyU7Tup3CJJA+LTSZrVAs=
=Ss9c
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrey Rakhmatullin@21:1/5 to Russ Allbery on Tue Jun 25 19:40:01 2024

On Tue, Jun 25, 2024 at 10:24:12AM -0700, Russ Allbery wrote:

Guillem Jover <[email protected]> writes:

I manage my chroots with schroot (but not via sbuild, for dog fooding purposes :), and use type=directory and union-type=overlay so that I get
a fast and persistent base, independent of the underlying filesystem,
with fresh instances per session. (You can access the base via the source:<id> names.) I never liked the type=file stuff, as it's slow to setup and maintain.

Ah, thank you, I didn't realize that existed. That sounds like a nice generalization of the file system snapshot approach.

(Unless I'm missing something it's the default setup for e.g. sbuild-createchroot(8))

--
WBR, wRAR

-----BEGIN PGP SIGNATURE-----

iQJhBAABCgBLFiEEolIP6gqGcKZh3YxVM2L3AxpJkuEFAmZ7ABwtFIAAAAAAFQAP cGthLWFkZHJlc3NAZ251cGcub3Jnd3JhckBkZWJpYW4ub3JnAAoJEDNi9wMaSZLh qLQP/1Kmi1XcUL04PVQszKhQD9BpcU44JLKOH72dcIuBfPEd0tuzT7jEjyHfTg0e wGLzRThdPogbSlKRJe+G/oR7Q4Hwaau22w8oq/TeoGhnEUD/ANUPindPxdo/LlO4 LQaBDtVHpY7MaObaowGGL+Qo6gguyoZqBf4qBVf3v3M5vbEcRY2v6JlVSUUC2yEd oIO/s3UuyQrYnGwqo4K2thBUd+V+lZLQqvs1Pu7hvDIpJU+BjYjHjDrEAdVb7w+R EPj2O9YGxluMLI0CVCPMnAYfiuFtMM9mDyiVCtm81ayfj/joshTUJuolvybKi5GL RDOZqIponlkgwWGYEywslkpr8qpcuSuUOd94nVSqMNfbjzkyfpSPo26wMYQUnhQu DsPsxv5z6zJjMuSakG7+xdo9pxpSG++WS2ikmoI8w/cLkZXPc16kWGfo0c2aTI8H 2LUvXV77nnOdkjkjHeA2yN/4m6cpiZJQ74qAen3IXWRDP4RFuTyx7i8naVPwhfg1 Kp7vKjb44iNDmRp3C+DzploLwzhkRdvLqV4efLJGn83nMFp63d62vmPo4diV+jo6 5EJJE51UmIEAopjTVgGiZGuSUO0KUSUSQORfPUFa2sAvorSW8dl3VpmYh2Ihx0qp C++utWLdyMXFJDMqXjUkdSh1+DT9s09wgscSVmvQPWbI0HRR
=MXKP
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to Guillem Jover on Tue Jun 25 19:30:01 2024

Guillem Jover <[email protected]> writes:

I manage my chroots with schroot (but not via sbuild, for dog fooding purposes :), and use type=directory and union-type=overlay so that I get
a fast and persistent base, independent of the underlying filesystem,
with fresh instances per session. (You can access the base via the source:<id> names.) I never liked the type=file stuff, as it's slow to
setup and maintain.

Ah, thank you, I didn't realize that existed. That sounds like a nice generalization of the file system snapshot approach.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From PICCA Frederic-Emmanuel@21:1/5 to All on Tue Jun 25 19:50:01 2024

Ah, thank you, I didn't realize that existed. That sounds like a nice generalization of the file system snapshot approach.

I think that this how the

sbuild-debian-developer-setup

script, setup chroots

Fred

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Russ Allbery@21:1/5 to All on Tue Jun 25 19:50:01 2024

PICCA Frederic-Emmanuel <[email protected]>
writes:

Ah, thank you, I didn't realize that existed. That sounds like a nice
generalization of the file system snapshot approach.

I think that this how the

sbuild-debian-developer-setup

script, setup chroots

Yeah, I think all that my contribution to this thread accomplished was to demonstrate that I set up sbuild years ago based on a wiki article for
btrfs and don't know what I'm talking about. :) Apologies for that.

--
Russ Allbery ([email protected]) <https://www.eyrie.org/~eagle/>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gioele Barabucci@21:1/5 to Helmut Grohne on Tue Jun 25 20:20:01 2024

On 25/06/24 18:55, Helmut Grohne wrote:

For systemd-as-pid-1 specifically,
`autopkgtest-build-podman --init=systemd` and
`autopkgtest-virt-podman --init` demonstrate how this can be done, and
last time I tried, it was possible to run them unprivileged (other than
needing access to the setuid newuidmap, as above). systemd is able to
detect that it's running in a container and turn off functionality like
udev that would only be appropriate in a VM or on bare metal, and podman
knows how to tell systemd that it should do this.

This is very cool. Running autopkgtests in system containers without
being root (or incus-admin) very much is what I'd like to do. And it's
much better if I don't have to write my own container framework for
doing it. I couldn't get it to work locally yet (facing non-obvious
error messages).

Would someone be able to document (mail/wiki/blog/...) how to set up and
use podman for running autopkgtests.

I'd like to take this chance to suggest, instead of writing more
documentation, changing the autopkgtest packaging so that it is split
into various per-backend packages, each of which provides a ready-to-go pre-configured environment. See <https://bugs.debian.org/1039958#22>.

Currently, in order to get a working autopkgtest + podman setup, one has to:

1) install autopkgtest
2) install podman
3) install a non-clearly-defined set of additional packages (including, surprisingly, dbus-user-session)
4) change various configuration files
5) learn how to use autopkgtest-build-podman
5a) BONUS: realize that, instead, you'd like use mmdebstrap to create
the base images, but mmdebstrap-autopkgtest-build-podman does not exit.
6) learn how to properly invoke autopkgtest $dir -- podman

It would be great if the user experience on a freshly installed system
were instead more like:

$ apt install autopkgtest-podman
$ autopkgtest $dir
$ # done

I believe achieving this right now is just a matter of better packaging.
(Plus some improvements to deal with the few packages whose test have
non ordinary and taxing requirements.)

Regards,

--
Gioele Barabucci

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul Gevers@21:1/5 to All on Tue Jun 25 22:00:01 2024

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------fb7kzJVZqQ8fY09e8GD8HYva
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

SGkNCg0KT24gMjUtMDYtMjAyNCA2OjU1IHAubS4sIEhlbG11dCBHcm9obmUgd3JvdGU6DQo+ IFRoaXMgaXMgdmVyeSBjb29sLiBSdW5uaW5nIGF1dG9wa2d0ZXN0cyBpbiBzeXN0ZW0gY29u dGFpbmVycyB3aXRob3V0DQo+IGJlaW5nIHJvb3QgKG9yIGluY3VzLWFkbWluKSB2ZXJ5IG11 Y2ggaXMgd2hhdCBJJ2QgbGlrZSB0byBkby4gQW5kIGl0J3MNCj4gbXVjaCBiZXR0ZXIgaWYg SSBkb24ndCBoYXZlIHRvIHdyaXRlIG15IG93biBjb250YWluZXIgZnJhbWV3b3JrIGZvcg0K PiBkb2luZyBpdC4gSSBjb3VsZG4ndCBnZXQgaXQgdG8gd29yayBsb2NhbGx5IHlldCAoZmFj aW5nIG5vbi1vYnZpb3VzDQo+IGVycm9yIG1lc3NhZ2VzKS4NCg0KTWF5YmUgYnVnICMxMDU5 NzI1Pw0KDQpQYXVsDQo=

--------------fb7kzJVZqQ8fY09e8GD8HYva--

-----BEGIN PGP SIGNATURE-----

wsB5BAABCAAjFiEEWLZtSHNr6TsFLeZynFyZ6wW9dQoFAmZ7H6UFAwAAAAAACgkQnFyZ6wW9dQol 0QgAleYsHj5QDAYOFvOBk38ufWctUTcIcQYDeSSunMbN3UinZNNtef/8ZJ07mv7dGRwHHvhEBqq7 y+KwRsgtOIjvvLqT/mhW5W3040+yEsQsiPgy4SYQH3Fd6/FajY9Byt6ArY3Myj+tYUBgxAfV78w9 WfHak9DjfHafcStoOFB06J17w6DEV0VeETmqGg8cY8gJfMWq/LDdTl4WVIQSN0+g79+0bvT4nZkw 3XWkSR1T9o3bNsLlLJGlO3DzMl5tzVOXpwcFxXge3Hwg1XiuoUoEL6TlMkbGJST8vh9nshM8EU65 Owji5GNsvGw6yj8Gm6PBagdvUB+7oUb5b2omYgNteg==
=bmI2
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Holger Levsen@21:1/5 to Simon McVittie on Wed Jun 26 11:20:01 2024

hi,

On Tue, Jun 25, 2024 at 02:02:11PM +0100, Simon McVittie wrote:

I have to ask:

Could we use a container framework that is also used outside the Debian bubble, rather than writing our own from first principles every time, and ending up with a single-maintainer project being load-bearing for Debian *again*? [...]

+1

Podman uses the same OCI images as Docker, so it can either pull from a trusted OCI registry, or use images that were built by importing a tarball generated by e.g. mmdebstrap or sbuild-createchroot. I assume that for
Debian we would want to do the latter, at least initially, to avoid
being forced to either trust an external registry like hub.docker.com
or operate our own.

I'd just like to mention the less known fact, that https://docker.debian.net/ provides reproducible images for nine Debian architectures today...

--
cheers,
Holger

⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ holger@(debian|reproducible-builds|layer-acht).org
⢿⡄⠘⠷⠚⠋⠀ OpenPGP: B8BF54137B09D35CF026FE9D 091AB856069AAA1C
⠈⠳⣄

🔥 - this is fine.

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEuL9UE3sJ01zwJv6dCRq4VgaaqhwFAmZ72uQACgkQCRq4Vgaa qhzpkQ//T9iH9u6gqtvS4z3hMCW4uZjMJFOkA2Iq6v9PUPWbMaAe/sKWWaR+ShuK iRtZ5BmZ2mu/JEvQY1blJ9vrgB555I1smyUlpXHaUXIY4hKlgMdpo8SN/HDG5CfH iKw9TztqBXUT3O5fQnXGEpbcqbfOW2NgR2drgb8Hlz1uDNfxviIGW8i7CO3+yG/h bdwA6DVE/T1axpSpV2CdCmSU5ikAerCsBOPVnOIbCJA4+sF+GoAAVb5d8JU0RTmt tFWtCswhCZL7iesVcXy9fFvYv5VJNfBq1ipLYkQg7URLkbKEXHkI0CihfVbjUUh/ 4OYil8PlaOk1KOdgl4QPmLq3P1NSUNgGp83oR2AGfcXnozDeOzNdryp17sU6r1Lp ph7H0z8qSWjz8uKyfa99aGKyz6L6+I+3s0stHgMAj0POAoQAI7B4mDW67yqbcaQM 3af/EtBCCRYlDk1oUIsn5qiM4qw5Mgd1xyF9wodhRd9U803vNJzZyQiD8RnRoHpF C32LDeh7tPs6+gV8rJl13WA8YPjfUWY89bcHcM6M735VJ8TaZAwlbJwNuymZDPWc F4e8TsRFLRd39foyIE2RAjEEs+UG1Y9ZLjf1ufTBSesVsVj4DRzgQd875KCBPvop ZpcBNMeMUTrEn5HcCtdjXWldxe

From Simon McVittie@21:1/5 to Guillem Jover on Wed Jun 26 18:10:02 2024

On Tue, 25 Jun 2024 at 18:47:49 +0200, Guillem Jover wrote:

I manage my chroots with schroot (but not via sbuild, for dog fooding purposes :), and use type=directory and union-type=overlay so that I
get a fast and persistent base, independent of the underlying filesystem, with fresh instances per session.

type=directory *with a union-type* is OK, and avoids the persistence
issues I mentioned: it has many of the same properties as type=file
(but different performance characteristics).

type=directory *without* a union-type can trigger bugs like the ones
I mentioned.

You can access the base via the source:<id> names

This is the same as with type=file. If you do this, be careful to avoid installing software that creates/relies on new uids existing inside
the chroot, such as dbus or exim4, if a corresponding username does not
already exist outside the chroot. That's what causes bugs like the ones
I mentioned.

I would recommend usually re-bootstrapping the base instead of modifying
it in-place, to avoid having differences between a freshly-bootstrapped
base and the current state of your base chroot building up over time
(for example packages that are removed from the transitively Essential set remaining installed in your base chroot indefinitely, or non-dpkg-managed configuration files being different for new installations and upgraded
older installations), which can result in a harder-to-reproduce build environment.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefano Rivera@21:1/5 to All on Wed Jun 26 18:40:01 2024

Hi Helmut (2024.06.25_16:55:45_+0000)

lxd/incus also was on my list,

Personally, I have been using LXD (and now Incus, as it made it into
Debian, yay) for my experimentation and local package builds, for a
number of years now. They have native support for btrfs snapshots,
locally built images, and make it relatively simple to block network
access for my builds. The autopkgtest-virt backed is a bit klunky, but I
don't miss schroot at all.

but my understanding is that they do not work without their system
services at all

Correct. LXC containers are essentially VMs without their own kernel.
They run their own systemd. This does mean that I build packages in a
fatter system than necessary. But that has yet to be an issue for me.

and being able to operate containers (i.e. being incus-admin or the
like) roughly becomes equivalent to being full root on the system
defeating the purpose of the exercise.

You don't have to be incus-admin to use Incus. Users get their own incus project (see the incus-user.service). But I've never played with this
much, on a single-user system, incus-admin is just much simpler (if less secure).

Of course incus still has to be root itself to add network interfaces to bridges. It's nice to be able to control networking for the containers,
but it would be even nicer for sbuild to not need setup that requires
root.

Stefano

--
Stefano Rivera
http://tumbleweed.org.za/
+1 415 683 3272

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Helmut Grohne on Wed Jun 26 19:20:01 2024

On Tue, 25 Jun 2024 at 18:55:45 +0200, Helmut Grohne wrote:

At least for me, building container images locally is a requirement. I
have no interest in using a container registry.

I expected you'd say that. podman --rootfs is one way to use it without
a registry; a trivially short Dockerfile like the one I mentioned,
to convert a tarball into a container image locally, is another.
(Debian's pseudo-official Docker images on Dockerhub use the latter.)

But I think it would be great if some part of Debian - perhaps the
cloud team? - could periodically publish genuinely official minbase
sysroot tarballs and/or OCI images from Debian infrastructure, like the
cloud team already does for VM images, which would avoid relying on a third-party registry while also avoiding requiring every developer to
spend thought and CPU time on building their own before they can start
on their actual development.

lxd/incus also was on my list, but my understanding is that they do not
work without their system services at all and being able to operate containers (i.e. being incus-admin or the like) roughly becomes
equivalent to being full root on the system defeating the purpose of the exercise.

Perhaps, I haven't looked into lxd/incus in detail (podman seems to have
the properties I wanted so I stopped there). I might have been misled by
the fact that lxd can run rootless containers - but maybe it can only
do that by making IPC requests to a privileged service, a bit like the
way snapd operates.

I guess you understood my explanation differently than it was meant.
While the container is persisted into the filesystem, this is being done
for each package build individually. sbuild --chroot-mode=unshare and unschroot use a tarball as their source and opening the session amounts
to extracting it. At the end of the session, the tree is disposed. The session concept of schroot is being reused in unschroot and it very much behaves like a type=file chroot except that you can begin a session,
reboot and continue using it until you end it without requiring a system service to recover your sessions during boot.

OK, good: this is "the same shape" as schroot type=file, which is not
one of the modes that has the problems I described. If you're carrying
over the underlying on-disk directory across reboots, you'll have to
be a little careful about persisting state into that directory (only
things that will still be true after a reboot can safely be stored),
but I'm sure you're doing that.

The main difference to how everyone else does this is that in a typical sbuild interaction it will create a new user namespace for every single command run as part of the session. sbuild issues tens of commands
before launching dpkg-buildpackage and each of them creates new
namespaces in the Linux kernel (all of them using the same uid mappings, performing the same bind mounts and so on). The most common way to think
of containers is different: You create those namespaces once and reuse
the same namespace kernel objects for multiple commands part of the same session (e.g. installation of build dependencies and dpkg-buildpackage).

Yes. My concern here is that there might be non-obvious reasons why
everyone else is doing this the other way, which could lead to behavioural differences between unschroot and all the others that will come back to
bite us later.

There two ways of
interacting with containers that use one set of namespaces for their
entire existence. One is setting up some IPC mechanism and receiving
commands to be run inside (for instance spawning a shell and piping
commands into it or driving the container via ssh) or an external
process joins (setns) the existing container (namespaces) and injects
code into it (docker exec). That latter approach has a history of vulnerabilities closely related to vulnerabilities in setuid binaries, because we are transitioning a process (and all of its context) from
outside the container into it and thus expose all of its context (memory maps, open file descriptors and so on) to contained processes. As such,
I think that an approach based on an IPC mechanism should be preferred.

An IPC-based approach is certainly going to provide better security
hardening (especially if setuid helpers are used), and potentially better functionality as well.

In Flatpak (which uses namespaces too, but is not really the same sort
of container), the debugging command `flatpak enter` currently uses the
setns approach (which comes with various limitations), and one of the
items on my infinite to-do list is to make that be IPC-based instead,
possibly by reusing code written for steam-runtime-tools during $dayjob.

For whole-system containers running an OS image from init upwards,
or for virtual machines, using ssh as the IPC mechanism seems
pragmatic. Recent versions of systemd can even be given a ssh public
key via the systemd.system-credentials(7) mechanism (e.g. on the kernel
command line) to set it up to be accepted for root logins, which avoids
needing to do this setup in cloud-init, autopkgtest's setup-testbed,
or similar.

For "application" containers like the ones you would presumably want
to be using for sbuild, presumably something non-ssh is desirable.

I am not sure whether podman exec operates in this way, but a quick codesearch did not exhibit obvious uses of setns inside the podman
source code. Would anyone be able to tell how podman exec is
implemented here?

I don't know the answer to this.

I think you really need one more non-trivial (but very commonly
available) privilege. You need a cgroup manager (such as systemd) that
allows creating and delegating a cgroup hierarchy to you.

Quite possibly, yes. I don't think I ever tried running
autopkgtest-virt-podman --init on a system that didn't have
systemd-as-pid-1 and a working `systemd --user`.

Would someone be able to document (mail/wiki/blog/...) how to set up and
use podman for running autopkgtests. Thus far, I failed to figure out
how to plug a local Debian mirror (as opposed to a container registry)
into autopkgtest-build-podman. It is quite difficult to locate podman documentation that is applicable under the assumption that you don't
want to use any container registry.

If you build an image by importing a tarball that you have built in
whatever way you prefer, minimally something like this:

$ cat > Dockerfile <<EOF
FROM scratch
ADD minbase.tar.gz /
EOF
$ podman build -f Dockerfile -t local-debian:sid .

then you should be able to use localhost/local-debian:sid
as a substitute for debian:sid in the examples given in autopkgtest-virt-podman(1), either using it as-is for testing:

$ autopkgtest -U hello*.dsc -- podman localhost/local-debian:sid

or making an image that has been pre-prepared with some essentials like dpkg-source, and testing in that:

$ autopkgtest-build-podman --image localhost/local-debian:sid
...
Successfully tagged localhost/autopkgtest/localhost/local-debian:sid
$ autopkgtest hello*.dsc -- podman autopkgtest/localhost/local-debian:sid
(tests run)

Adding a mode for "start from this pre-prepared minbase tarball" to all
of the autopkgtest-build-* tools (so that they don't all need to know
how to run debootstrap/mmdebstrap from first principles, and then duplicate
the necessary options to make it do the right thing), has been on my
to-do list for literally years. Maybe one day I will get there.

We could certainly also benefit from some syntactic sugar to make the
automatic choice of an image name for localhost/* podman images nicer,
with fewer repetitions of localhost/.

By default (as per /etc/containers/registries.conf.d/shortnames.conf),
podman considers debian:sid to be short for docker.io/library/debian,
which is the closest thing we have to "official" Debian OCI images. If we
had our own self-hosted container registry with suitable scalability and security, like Red Hat and SUSE do, that file could point there instead.
Salsa does in fact provide us with a self-hosted container registry,
but probably not one that is sufficiently scalable?

podman is unlikely to provide you with a way to generate a minbase
tarball without first creating or downloading some sort of container
image in which you can run debootstrap or mmdebstrap, because you have
to be able to start from somewhere. But you can run mmdebstrap unprivileged
in unshare mode, so that's enough to get you that starting point.

We learned that sbuild --chroot-mode=unshare and unschroot spawn
a new set of namespaces for every command. What you point out as a
limitation also is a feature. Technically, it is a lie that the
namespaces are always constructed in the same way. During installation
of build depends the network namespace is not unshared while package
builds commonly use an unshared network namespace with no interfaces but
the loopback interface.

I don't think podman can do this within a single run. It might be feasible
to do the setup (installing build-dependencies) with networking enabled;
leave the root filesystem of that container intact; and reuse it as the
root filesystem of the container in which the actual build runs, this time
with --network=none?

Or the "install build-dependencies" step (and other setup) could perhaps
even be represented as a `podman build` (with a Dockerfile/Containerfile,
FROM the image you had as your starting point), outputting a temporary container image, in which the actual dpkg-buildpackage step can be invoked
by `podman run --network=none --rmi`?

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bastian Venthur@21:1/5 to Simon McVittie on Thu Jun 27 11:00:02 2024

On 25.06.24 15:02, Simon McVittie wrote:

I have to ask:

Could we use a container framework that is also used outside the Debian bubble, rather than writing our own from first principles every time, and ending up with a single-maintainer project being load-bearing for Debian *again*? I had hoped that after sbuild's history with schroot becoming unmaintained, and then being revived by a maintainer-of-last-resort who
is one of the same few people who are critical-path for various other important things, we would recognise that as an anti-pattern that we
should avoid if we can.

Great proposal!

Here's the Dockerfile/Containerfile to turn a sysroot tarball into an
OCI image (obviously it can be extended with LABELs and other
customizations, but this is fairly close to minimal):

FROM scratch
ADD sysroot.tar.gz /
CMD ["/bin/bash"]

I had the idea to build my Debian packages in a clean docker container
instead of using cowbuilder etc for some time now. But due to lack of
time and complexity of available solutions never got really far. Do you
happen to have a minimal example that would work for most projects and
does not depend too much on opinionated Debian specific tooling?

Cheers!

Bastian

--
Dr. Bastian Venthur https://venthur.de
Debian Developer venthur at debian org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Grohne@21:1/5 to Simon McVittie on Thu Jun 27 13:50:01 2024

Hi Simon,

Thanks for having taken the time to do another extensive writeup. Much appreciated.

On Wed, Jun 26, 2024 at 06:11:09PM +0100, Simon McVittie wrote:

On Tue, 25 Jun 2024 at 18:55:45 +0200, Helmut Grohne wrote:

The main difference to how everyone else does this is that in a typical sbuild interaction it will create a new user namespace for every single command run as part of the session. sbuild issues tens of commands
before launching dpkg-buildpackage and each of them creates new
namespaces in the Linux kernel (all of them using the same uid mappings, performing the same bind mounts and so on). The most common way to think
of containers is different: You create those namespaces once and reuse
the same namespace kernel objects for multiple commands part of the same session (e.g. installation of build dependencies and dpkg-buildpackage).

Yes. My concern here is that there might be non-obvious reasons why
everyone else is doing this the other way, which could lead to behavioural differences between unschroot and all the others that will come back to
bite us later.

I do not share this concern (but other concerns of yours). The risk of behavioural differences is fairly low, because we do not expect any non-filesystem state to transition from one command to the next. Much to
the contrary, the use of a pid namespace for each command ensures
reliable process cleanup, so no background processes can accidentally
stick around.

I am concerned about behavioural differences due to the reimplementation
from first principles aspect though. Jochen and Aurelien will know more
here, but I think we had a fair number of ftbfs due to such differences.
None of them was due to the architecture of creating a namespaces for
each command and most of them were due to not having gotten right
containers in general. Some were broken packages such as skipping tests
when detecting schroot.

Also note that just because I do not share your concern here does not
imply that I'd be favouring sticking to that architecture. I expressed elsewhere that I see benefits in changing it for other reasons. At this
point I more and more see this as a non-boolean question. There is a
spectrum between "create namespaces once and use them for the entire
session" and "create new namespaces for each command" and more and more
I start to believe that what would be best for sbuild is somewhere in
between.

For whole-system containers running an OS image from init upwards,
or for virtual machines, using ssh as the IPC mechanism seems
pragmatic. Recent versions of systemd can even be given a ssh public
key via the systemd.system-credentials(7) mechanism (e.g. on the kernel command line) to set it up to be accepted for root logins, which avoids needing to do this setup in cloud-init, autopkgtest's setup-testbed,
or similar.

Another excursion: systemd goes beyond this and also provides the ssh
port via an AF_VSOCK (in case of VMs) or a unix domain socket on the
outside (in case of containers) to make safe discovery of the ssh access easier.

For "application" containers like the ones you would presumably want
to be using for sbuild, presumably something non-ssh is desirable.

I partially concur, but this goes into the larger story I hinted at in
my initial mail. If we move beyond containers and look into building
inside a VM (e.g. sbuild-qemu) we are in a difficult spot, because we
need e.g. systemd for booting, but we may not want it in our build
environment. So long term, I think sbuild will have to differentiate
between three contexts:
* The system it is being run on
* The containment or virtualisation environment used to perform the
build
* The system where the build is being performed inside the containment
or virtualisation environment

At present, sbuild does not distinguish the latter two and always treats
them equal. When building inside a VM, we may eventually want to create
a chroot inside the VM to arrive at a minimal environment. The same
technique is applicable to system containers. When doing this, we
minimize the build environment and do not mind the extra ssh dependency
in the container or virtualisation environment. For now though, this is
all wishful thinking. As long as this distinction does not exist, we
pretty much want minimal application containers for building as you
said.

If you build an image by importing a tarball that you have built in
whatever way you prefer, minimally something like this:

$ cat > Dockerfile <<EOF
FROM scratch
ADD minbase.tar.gz /
EOF
$ podman build -f Dockerfile -t local-debian:sid .

I don't quite understand the need for a Dockerfile here. I suspect that
this is the obvious way that works reliably, but my impression was that
using podman import would be easier. I had success with this:

mmdebstrap --format=tar --variant=apt unstable - | podman import --change CMD=/bin/bash - local-debian/sid

then you should be able to use localhost/local-debian:sid
as a substitute for debian:sid in the examples given in autopkgtest-virt-podman(1), either using it as-is for testing:

$ autopkgtest -U hello*.dsc -- podman localhost/local-debian:sid

This did not work for me. autopkgtest failed to create a user account. I suspect that this has one of two reasons. Either autopkgtest expects
python3 to be installed and it isn't or it expects passwd to be
installed and doesn't install it when missing (as passwd is
non-essential).

or making an image that has been pre-prepared with some essentials like dpkg-source, and testing in that:

$ autopkgtest-build-podman --image localhost/local-debian:sid
...
Successfully tagged localhost/autopkgtest/localhost/local-debian:sid

Works for me.

$ autopkgtest hello*.dsc -- podman autopkgtest/localhost/local-debian:sid
(tests run)

Thank you very much. I got this working for application container based testing, which provides a significant speedup compared to virt-qemu.

I am more interested in providing isolation-container though as a number
of tests require that and I currently tend to resort to virt-qemu for
that. Sure enough, adding --init=systemd to autopkgtest-build-podman
just works and a system container can also be used as an application
container by autopkgtest (so there is no need to build both), but
running the autopkgtest-virt-qemu --init also fails here in non-obvious
ways. It appears that user creation was successful, but the user
creation script is still printed in red.

We're now deep into debugging specific problems in the
autopkgtest/podman integration and this is probably getting off-topic
for d-devel. Is the evidence thus far sufficient for turning this part
of the discussion into a bug report against autopkgtest?

Adding a mode for "start from this pre-prepared minbase tarball" to all
of the autopkgtest-build-* tools (so that they don't all need to know
how to run debootstrap/mmdebstrap from first principles, and then duplicate the necessary options to make it do the right thing), has been on my
to-do list for literally years. Maybe one day I will get there.

From my point of view, this isn't actually necessary. I expect that many people would be fine drawing images from a container registry. Those

stubborn people like me will happily go the extra mile.

We could certainly also benefit from some syntactic sugar to make the automatic choice of an image name for localhost/* podman images nicer,
with fewer repetitions of localhost/.

Let me pose a possibly stupid suggestion. Much of the time when people
interact with autopkgtest, there is a very limited set of backends and
backend options people use frequently. Rather than making the options
shorter, how about introducing an aliasing mechanism? Say I could have
some ~/.config/autopkgtest.conf and whenever I run autopkgtest ... --
$BACKEND such that there is no autopkgtest-virt-$BACKEND, consult that configuration file and if there the value is assigned, expand it the
assigned value. Then, I can just record my commonly used backends and
options there and refer to them by memorable names of my own liking.
Automatic choice of images makes things more magic, which bears negative aspects as well.

podman is unlikely to provide you with a way to generate a minbase
tarball without first creating or downloading some sort of container
image in which you can run debootstrap or mmdebstrap, because you have
to be able to start from somewhere. But you can run mmdebstrap unprivileged in unshare mode, so that's enough to get you that starting point.

I consider this part of the problem space fully solved.

Please allow for another podman question (and more people than Simon
know the answer). Every time I run a podman container (e.g. when I run autopkgtest) my ~/.local/share/containers grows. I think autopkgtest
manages to clean up in the end, but e.g. podman run -it ... seems to
leave stuff behind. Such a growing directory is problematic for multiple reasons, but I was also hoping that podman would be using fuse-overlayfs
+ tmpfs to run my containers instead of writing tons of stuff to my slow
disk. I hoped --image-volume=tmpfs could improve this, but it did not.
Of course, when I skip podman's image management and use --rootfs, I can
side step this problem by choosing my root location on a tmpfs, but
that's not how autopkgtest uses podman.

We learned that sbuild --chroot-mode=unshare and unschroot spawn
a new set of namespaces for every command. What you point out as a limitation also is a feature. Technically, it is a lie that the
namespaces are always constructed in the same way. During installation
of build depends the network namespace is not unshared while package
builds commonly use an unshared network namespace with no interfaces but the loopback interface.

I don't think podman can do this within a single run. It might be feasible
to do the setup (installing build-dependencies) with networking enabled; leave the root filesystem of that container intact; and reuse it as the
root filesystem of the container in which the actual build runs, this time with --network=none?

Do I understand correctly that in this variant, you intend to use podman without its image management capabilities and rather just use --rootfs
spawning two podman containers on the same --rootfs (one after another)
where the first one installs dependencies and the second one isolates
the network for building?

Or the "install build-dependencies" step (and other setup) could perhaps
even be represented as a `podman build` (with a Dockerfile/Containerfile, FROM the image you had as your starting point), outputting a temporary container image, in which the actual dpkg-buildpackage step can be invoked
by `podman run --network=none --rmi`?

In this case, we build a complete container image for the purpose of
building a package. This has interesting consequences. For one thing, we
often build the same package twice, so caching such an image for some
time is an obvious feature to look into.

If you go that way, you may as well use mmdebstrap to construct
containers with precisely your relevant build-dependencies on demand
(for every build). The mmdebstrap ... | podman import ... rune would
roughly work for that.

Let me try to go one step back here. The podman model (and that of many
other runtimes) is that one session equates one set of namespaces, but
network isolation requires another set of namespaces. Your two
approaches cleverly side-step this, by doing two containers on the same directory hierarchy or on-demand construction of containers (in one
namespace) and running them (in other namespaces).

These approaches come with limitations. The first approach requires
reinventing podman's image management and doing that by hand. In
particular, that prohibits us from using overlays as a means to avoid extraction or doing the extraction on-demand via e.g. squashfs. In an
ideal world, I think we do want one user and mount namespace for the
entire session and then do pid and network namespaces per-command
as-needed. The second approach requires writing the container to disk
very much degrading build performance. If we want to enable these use
cases, then I fear podman is not the tool of choice as its featureset
does not match these (idealized) requirements. In other words, settling
on podman limits us in what we features we can implement in sbuild, but
it may still allow more features than the status quo, so it still can be
an incremental improvement of the status quo. The question kinda becomes whether it is reasonable to skip that podman step and head over to an architecture that enables more of our use cases.

And then the question becomes whether unschroot is that better
architecture or not and whether trading the risk of maintenance issues
that you correctly identified is worth the additional features that we
expect from it.

Helmut

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Helmut Grohne on Thu Jun 27 16:00:01 2024

On Thu, 27 Jun 2024 at 11:46:51 +0200, Helmut Grohne wrote:

I am concerned about behavioural differences due to the reimplementation
from first principles aspect though. Jochen and Aurelien will know more
here, but I think we had a fair number of ftbfs due to such differences.
None of them was due to the architecture of creating a namespaces for
each command and most of them were due to not having gotten right
containers in general. Some were broken packages such as skipping tests
when detecting schroot.

Right - this is an instance of the more general problem pattern, "if we
don't test a thing regularly, we can't assume it works". We routinely
test sbuild+schroot (on the buildds), and individual developers often
try builds without any particular isolation (on development systems or expendable test systems), but until recently sbuild's unshare backend
was not something that would be routinely tested with most packages,
and similarly most packages are not routinely built with Podman or Docker
or whatever else.

In packages that, themselves, want to do things with containers during
their build or testing (for example bubblewrap and flatpak), there will typically be a code path for "no particular isolation" that actually
runs the tests (otherwise upstream would not find the tests useful), and
a code path for sbuild+schroot that skips the tests (otherwise they'd
fail on our historical buildds), but the detection that we are in a
locked-down environment where some tests need to be skipped might not
be 100% correct. I know I've had to adjust flatpak's test suite several
times to account for things like detecting whether FUSE works (because
on DSA'd machines it intentionally doesn't, as a security hardening step).

If we move beyond containers and look into building
inside a VM (e.g. sbuild-qemu) we are in a difficult spot, because we
need e.g. systemd for booting, but we may not want it in our build environment. So long term, I think sbuild will have to differentiate
between three contexts:
* The system it is being run on
* The containment or virtualisation environment used to perform the
build
* The system where the build is being performed inside the containment
or virtualisation environment

Somewhat prior art for this: https://salsa.debian.org/smcv/vectis uses
a VM (typically running Debian stable), installs sbuild + schroot into it,
and uses sbuild + schroot for the actual build, in an attempt to replicate
the setup of the production buildds on developer machines. In this case
sbuild is in the middle layer instead of the top layer, though.

Similarly, when asked to test packages under lxc (in an attempt to
replicate the setup of ci.debian.net), vectis installs lxc into a VM,
and runs autopkgtest on the VM rather than on the host system.

Of course, I'd prefer it if Debian's production infrastructure was
something that would be easier to replicate "closely enough" on my
development system (such that packages that pass tests on my development
system are very likely to pass tests on the production infra), without
damaging my development system if I use it to build a malicious,
compromised or accidentally-low-quality package that creates side-effects outside the build environment.

I don't quite understand the need for a Dockerfile here. I suspect that
this is the obvious way that works reliably, but my impression was that
using podman import would be easier.

Honestly, the need for a Dockerfile here is: I already knew how to build containers from a Dockerfile, and I didn't read the documentation for
the lower-level `podman import` because `podman build` can already do
what I needed.

I see this as the same design principle as why we encourage package
maintainers to use dh, even when building trivial "toy" packages like
hello, and in preference to implementing debian/rules at a lower level
in trivial cases. To build a non-trivial container with multiple layers,
you'll likely need a Dockerfile (or docker-compose, or some similar thing) *anyway*, so a typical user expectation will be to have a Dockerfile, and anyone building a container will likely already have learned the basics
of how to write one; and then we might as well follow the same procedure
in the trivial case, rather than having the trivial case be different and require different knowledge.

$ autopkgtest -U hello*.dsc -- podman localhost/local-debian:sid

This did not work for me. autopkgtest failed to create a user account.

Please report a bug against autopkgtest with steps to reproduce. It worked
for me, on Debian 12 with a local git checkout of autopkgtest, and it's probably something that ought to work - although it's always going to be non-optimal, because it will waste a bunch of time doing basic setup like installing dpkg-dev and configuring the apt proxy before every test. The
reason why we have autopkgtest-build-podman is to do that setup fewer
times, cache the result, and amortize its cost across multiple runs.

I am more interested in providing isolation-container though as a number
of tests require that and I currently tend to resort to virt-qemu for
that. Sure enough, adding --init=systemd to autopkgtest-build-podman
just works and a system container can also be used as an application container by autopkgtest (so there is no need to build both), but
running the autopkgtest-virt-qemu --init also fails here in non-obvious
ways. It appears that user creation was successful, but the user
creation script is still printed in red.

(I assume you mean a-v-podman --init rather than a-v-qemu --init.
a-v-qemu always needs an init system.)

Please report a (separate) bug against autopkgtest with steps to reproduce. Unfortunately I haven't had been able to spend as much time on autopkgtest
in recent months as I would like to, and I haven't done much with podman
system containers (with init) since the -docker/-podman backend was
originally merged.

I remember that at one point, shortly before the -docker/-podman backend
was merged, I did have a-v-podman --init working successfully on a system
with systemd as pid 1 on the host, and each of the three init systems
known to a-b-podman in the container: systemd, sysvinit with sysv-rc, or sysvinit with openrc (only tested extremely briefly). At the time, I think
I was able to test src:dbus successfully with at least the first two.

When testing my own packages, I usually have to prioritize -lxc because
it's de facto RC (ci.debian.net uses it when not configured otherwise),
and -qemu because it's the only way some of my packages can have good
test coverage (notably bubblewrap and flatpak, which want to create new
user namespaces during testing in a way that a container manager like
podman will not usually allow).

Of course in an ideal world I should be re-running the test suite for
each package in each of the potentially interesting autopkgtest-virt-
backends, but that would only give me fractionally better test coverage,
in exchange for making it take even longer to release a package. I am
sorry for not having been optimally thorough, but one bug that affects
many of my package uploads, which (unusually!) cannot be solved by adding
extra QA steps, is "this update took an unacceptably long time to reach
the archive".

If ci.debian.net moves away from -lxc, resulting in "tests pass under
lxc" no longer being a de facto requirement for inclusion in testing,
then I would prefer to be using -podman for all of the simpler tests
(for example flatpak's debian/tests/build, which just exercises the -dev package), because it has a much, much shorter lead time for per-test
setup than -qemu, while also having a useful level of isolation and
being straightforward to replicate on a developer system for interactive debugging.

Less-isolated backends like -schroot seem like a bad place to invest
time and effort because they have more intrusive system and privilege requirements, while not actually being significantly faster or more
capable.

Let me pose a possibly stupid suggestion. Much of the time when people interact with autopkgtest, there is a very limited set of backends and backend options people use frequently. Rather than making the options shorter, how about introducing an aliasing mechanism? Say I could have
some ~/.config/autopkgtest.conf and whenever I run autopkgtest ... -- $BACKEND such that there is no autopkgtest-virt-$BACKEND, consult that configuration file and if there the value is assigned, expand it the
assigned value. Then, I can just record my commonly used backends and
options there and refer to them by memorable names of my own liking.

That sounds like a reasonable feature request, please open a bug. As
with most reasonable feature requests in projects I maintain, it'll go
on my list, but please don't assume that I will ever get sufficiently
far through the list within my lifetime if left to implement it myself.

A crude way to implement this would be to add something like this
to $PATH:

#!/bin/sh
# Save as ~/bin/autopkgtest-virt-sid and make it executable
set -eu
exec autopkgtest-virt-podman "$@" localhost/autopkgtest/debian:sid

and then use e.g. `autopkgtest ... -- sid`.

(But please note that some backends have more than one place where you
might wish to add arbitrary options, e.g. a-v-podman accepts a-v-podman options, followed by exactly one image, followed by "--" and arbitrary
`podman run` options. It might be better if there was an --image parameter
that can appear first as an alternative to the positional parameter.)

Automatic choice of images makes things more magic, which bears negative aspects as well.

The automatic choice of images is intended to be a matter of "have
reasonable defaults" rather than anything deeper. For example in the
example in the man page, if you tell autopkgtest-build-podman to convert debian:sid into a pre-prepared test container image, it'll default
to outputting autopkgtest/debian:sid because that seems a little more
friendly than forcing the user to choose their own arbitrary name, and establishing a convention via defaults makes it easier to write examples.

(Or if you use --init=systemd to create a bootable system-container,
you'll get autopkgtest/systemd/debian:sid, and so on.)

Every time I run a podman container (e.g. when I run
autopkgtest) my ~/.local/share/containers grows. I think autopkgtest
manages to clean up in the end, but e.g. podman run -it ... seems to
leave stuff behind.

If you are using e.g. `podman run -it debian:sid` then that is expected
to leave the container's root filesystem hanging around for future use
or inspection, even after all of its processes have exited. This is
vaguely analogous to using `schroot --begin-session` followed by
`schroot --run-session`, and then leaving the session open indefinitely.

If you want resources used by the container to be cleaned up automatically
on exit, use the `--rm` option, more like `podman run --rm -it debian:sid`. This is more like `schroot --automatic-session`.

`podman container list -a` will list all the containers that have been
kept around in this way, and `podman container rm` or
`podman container prune` will delete them. This is analogous to
`schroot --end-session`.

Of course, when I skip podman's image management and use --rootfs, I can
side step this problem by choosing my root location on a tmpfs, but
that's not how autopkgtest uses podman.

That seems like a reasonable a-v-podman feature request too. Presumably
it would only allow this when invoked as a-v-podman, and not when invoked
as a-v-docker (I don't think a-v-docker has an equivalent feature).

I don't think podman can do this within a single run. It might be feasible to do the setup (installing build-dependencies) with networking enabled; leave the root filesystem of that container intact; and reuse it as the root filesystem of the container in which the actual build runs, this time with --network=none?

Do I understand correctly that in this variant, you intend to use podman without its image management capabilities and rather just use --rootfs spawning two podman containers on the same --rootfs (one after another)
where the first one installs dependencies and the second one isolates
the network for building?

Maybe that; or maybe use its image management, tell the first podman command not to delete the container's root filesystem (don't use --rm), and then there's probably a way to tell podman to reuse the resulting filesystem
with an additional layer in its overlayfs for the network-isolated run.

Please note that I am far from being an expert on podman or the
"containers" family of libraries that it is based on, and I don't
know everything it is capable of. Because Debian has a lot of pieces
of infrastructure we have built for ourselves from first principles,
I've had to spend time on understanding the finer points of sbuild,
schroot, lxc and so on, so that I can replicate failure modes seen on
the buildds and therefore fix release-critical bugs in the packages that
I've taken responsibility for (and occasionally also try to improve the infrastructure itself, for example #856877 which recently passed its
7th birthday). That comes with an opportunity cost: the time I spent
learning about schroot is time that I didn't spend learning about OCI.

One of the reasons I would like to have fewer Debian-specific pieces in
our stack is so that other Debian developers don't have to do what I
did, and can instead spend their time gaining transferrable knowledge
that will be equally useful inside and outside the Debian bubble (for
example the best ways to use OCI images, and OCI-based tools like
Docker and Podman, which have a lot of overlap in how they are used
even though they are rather different behind the scenes).

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johannes Schauer Marin Rodrigues@21:1/5 to All on Thu Jun 27 17:30:01 2024

Hi,

Quoting Simon McVittie (2024-06-27 15:59:01)

On Thu, 27 Jun 2024 at 11:46:51 +0200, Helmut Grohne wrote:

I don't quite understand the need for a Dockerfile here. I suspect that this is the obvious way that works reliably, but my impression was that using podman import would be easier.

Honestly, the need for a Dockerfile here is: I already knew how to build containers from a Dockerfile, and I didn't read the documentation for
the lower-level `podman import` because `podman build` can already do
what I needed.

I see this as the same design principle as why we encourage package maintainers to use dh, even when building trivial "toy" packages like
hello, and in preference to implementing debian/rules at a lower level
in trivial cases. To build a non-trivial container with multiple layers, you'll likely need a Dockerfile (or docker-compose, or some similar thing) *anyway*, so a typical user expectation will be to have a Dockerfile, and anyone building a container will likely already have learned the basics
of how to write one; and then we might as well follow the same procedure
in the trivial case, rather than having the trivial case be different and require different knowledge.

I have never in my life written a Dockerfile and so far I've only used "podman import" instead. Your explanation makes sense to me. I had no idea that "podman build" is on a higher plumbing level. As a container noob it was always more easy for me to write:

mmdebstrap [my customizations] unstable | podman import - debian

If I understand what you are saying, then what should instead be done is to write a Dockerfile receiving a vanilla tarball and then do the customizations via the Dockerfile?

Can a Dockerfile be read from stdin? It's a small wrinkle to me that I would then need to create a private temporary directory with a Dockerfile first instead of just shoving it in over a pipe.

Do I understand correctly that in this variant, you intend to use podman without its image management capabilities and rather just use --rootfs spawning two podman containers on the same --rootfs (one after another) where the first one installs dependencies and the second one isolates the network for building?

Maybe that; or maybe use its image management, tell the first podman command not to delete the container's root filesystem (don't use --rm), and then there's probably a way to tell podman to reuse the resulting filesystem
with an additional layer in its overlayfs for the network-isolated run.

Please note that I am far from being an expert on podman or the
"containers" family of libraries that it is based on, and I don't
know everything it is capable of. Because Debian has a lot of pieces
of infrastructure we have built for ourselves from first principles,
I've had to spend time on understanding the finer points of sbuild,
schroot, lxc and so on, so that I can replicate failure modes seen on
the buildds and therefore fix release-critical bugs in the packages that
I've taken responsibility for (and occasionally also try to improve the infrastructure itself, for example #856877 which recently passed its
7th birthday). That comes with an opportunity cost: the time I spent
learning about schroot is time that I didn't spend learning about OCI.

One of the reasons I would like to have fewer Debian-specific pieces in
our stack is so that other Debian developers don't have to do what I
did, and can instead spend their time gaining transferrable knowledge
that will be equally useful inside and outside the Debian bubble (for
example the best ways to use OCI images, and OCI-based tools like
Docker and Podman, which have a lot of overlap in how they are used even though they are rather different behind the scenes).

Thank you for this text as well as the one in your initial email in which you caution against more Debian-isms with only very few maintainer(s) maintaining them. As the author of the unshare backend I am guilty of having added another Debian-specific thing instead of re-using existing solutions. Maybe my defense can be that when I wrote that code in 2018, there was no podman in Debian yet? I am not attached to the unshare code. I gladly throw it out for something better. The less code I have to maintain the better for me. I do not dislike podman either and I am happy that in contrast to docker, there is no persistent service running in the background.

What I wanted to mainly bring up in this email are the following things:

Creating build chroots from things that are signed with the Debian archive keyring is important to me. Even though, as Holger pointed out, the Debian images that one can download can be reproduced independently, I rather make sure that I receive what I think I receive by relying on creating my chroot via mmdebstrap/apt verified by my local keyring. Maybe in the future debian.org can publish build chroots signed by the archive keyring at which point I may change my position on this. But until then, I really heavily prefer to download the GPG signed stuff from our mirrors instead of something from an image registry that we do not control.

If we change things around, I'd prefer whatever change is done comes with non-negligible advantages and few (preferably none) regressions. The unshare backend is theoretically (not in practice because the schroot backend cannot do it) able to give you an environment where you have all your namespaces unshared but you are on the outside of the chroot directory. This gives processes on the outside the opportunity to work on those on the inside (because they have the required privileges). One enticing application I see for this feature is to be able to build inside chroots that do not even have apt installed. This can be possible because apt can install build dependencies in the chroot from the outside, if given the chroot directory and the necessary privileges. As far as I'm aware (and please correct me if I'm wrong) podman does not offer this functionality? So with a podman backend, my build chroot has to include apt because the build dependencies have to be installed somehow? One solution/workaround to this problem would be something else you said earlier: create the container as part of build dependency installation. In such a scenario, the podman container could be created by running

mmdebstrap [options installing b-d] unstable | podman import -

And then the resulting container would have essential and the build dependencies installed but would not have apt in it. This would not work with a Dockerfile, right?

Last point: people on this list were very excited about using an established container technology in our tooling instead of cooking something up ourselves and rightly so. I agree with that sentiment. The excitement can probably also be seen by there existing 13 independent software packages that do "debian package building in docker": https://wiki.debian.org/SystemBuildTools#Package_build_tools

But, if everybody is so excited about this, where are the sbuild contributors implementing this? As is hopefully obvious from my above questions, I have no clue about containers, so I'd be the wrong person to work on this. But we have implementations like the one in #867176 since 2017 and nobody stepped up to maintain it since then. This is very curious to me. People (including on this list) are very excited about docker/podman but then there is no code and longterm maintainership that follows. I'd be happy to review patches and help integrating podman into sbuild in any of the three ways I outlined in my other mail. But I need help with that and that help didn't arrive yet. So like the others on this list I am with you Simon, that it would be nice to have less Debian-isms and more use of cross-distro tools. But in practice, what we have are the Debian-isms maintained by only very few and nobody putting their long term efforts into implementing something that re-uses cross-distro approaches in sbuild... :/

In essence: somebody please help! :)

Thanks!

cheers, josch
--==============�59104100143658919=MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Description: signature
Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEElFhU6KL81LF4wVq58sulx4+9g+EFAmZ9hJkACgkQ8sulx4+9 g+Gjhw/8DNpl4Lh/g54Ax1u7fVUyUEHBfVxTd4nBfAY4DV0Oc/Y6cPnzdJZL/9Gv uZuiCpw4InBzTCg5Oh9rSY/3YthjXPg9XP6zn90rNayQrnMvDkcQhJuqBNuNyLx5 ZKL6R/5eqmr1cn/oS3JHPxo/Y7IZHoR+6+8rfutwCQG3FAMIa+4gSfMX39S3Wq8Q MFteVQ0ex1wWZzMkIlrcEY9Lrbj8PZXndOFkus8VUL7H4pYg8ZGtrYzD+aTgw3uB eO2VsdJhbyzvIaHbtOMoy3uDVo6vr5shcPHW7e7Zn4DQ4Bxi5lYnoTaTZdY49Vgy iQDbIsMmjUAaSONbUAFEp4aNso+GoR32mcQRAHFsNhQlaWbBD2q4PozHIIm/P0mN wlZV6Dyx3BV9EcubwMNrs9awgcxkcarH/PJTA11kh1HO61KXqeJrZga65AEXw5Pe R/1dd5j0Y/ApBSXIDwRuMPAbFk0TE83WZ/TvjH7ipC+7g+9x48L3qugZ7lidSJ3W aZnK2yj1wV4PFWS3fFt0vu0q9ga/yRDDke3yB93NaLLUMz5GWeps54Sqyxu74fQP GMGgA4Sv8/Q4UA2Bd3DTwmbW56G9lk3/K0HeOyHsxCrBTVBLgyGZTR73kvia8+Xs hdcWl6rkrDe7kiGnr1W1sHpyRdnP+wNbD1iAKYqO+/lit5NDbP4=
=k4u5
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Johannes Schauer Marin Rodrigues on Thu Jun 27 19:20:01 2024

On Thu, 27 Jun 2024 at 17:26:20 +0200, Johannes Schauer Marin Rodrigues wrote:

But, if everybody is so excited about this, where are the sbuild contributors implementing this?

I'm sorry, consider it added it to my list. As usual, there's no guarantee
that I will get there within my lifetime, but I'll make sure to feel
suitably guilty about my failure to achieve it.

But, having said that:

The excitement can probably also
be seen by there existing 13 independent software packages that do "debian package building in docker"

The reason we don't have 14, one of them from me[1], is the same reason
I would be reluctant to develop a new sbuild backend without knowing that
it's what the maintainers of our production infrastructure want to use:

Packages are de facto unreleasable (which is effectively a higher severity
than any RC bug) if they don't compile successfully and pass tests in the project's official build environment. Until recently, this meant stable's sbuild and schroot (or sometimes oldstable's sbuild and schroot) entering
an unstable chroot; more recently, some official buildds switched to the unshare backend, resulting in build failures in that backend becoming worse-than-RC too.

If I do my test-builds in sbuild + schroot in an (old?)stable VM[3], and
they succeed, then I can be somewhat confident that when I do the upload,
the build on the official buildds will succeed too (at least on x86).

If I do my test-builds in some other way, for example directly in a VM,
or in podman, docker, lxc, pbuilder, deb-build-snapshot or whatever other
thing I might personally prefer or find more convenient, then I run the
risk of having my upload fail to build on the official buildds for a schroot-specific reason, which of course is an unacceptable situation for
which I would rightly be held responsible; and step 1 of resolving that situation would be to try to replicate the official build environment,
so I might as well save some time by *already* attempting to replicate
the official build environment. A lot of my Debian contributions are
already guilt-based ("if I don't get this uploaded then $bug is in
some way my fault"), and I'm sorry but I am reluctant to add to that
by creating new and avoidable opportunities to fail to live up to the
project's expectations.

Ideally of course I should do my test-builds in *both* sbuild + schroot
and whatever container technology I'm (hypothetically) proposing as
the new production infrastructure, but then each package I release will
take twice as long per attempt to release, and "smcv takes too long to
release important fixes" is a failure mode that cannot be fixed by any
number of additional QA checks.

Until recently, my understanding is that DSA's policy was to lock
down all official machines by preventing unprivileged creation of user namespaces system-wide, which rules out podman, making it a poor time investment. This is clearly not entirely true any more, because if
it was, buildds would not be able to use sbuild's unshare backend -
so perhaps now is the time to be proposing a sbuild podman backend,
and I should probably be writing one instead of replying to this message.

Arguably there is already a sbuild podman backend, albeit indirectly:
tell sbuild to use an autopkgtest virt server, and then specify the
podman virt server as the one to use. (This has the limitation that it
can't use the network to install build-dependencies and then disable
networking for the actual build, which is a limitation that it shares
with the current schroot backend.) As I mentioned in another thread, unfortunately I have spent considerably less time on podman in autopkgtest
than it deserves: I have not tested it recently, so it's entirely possible
that it doesn't work. If that's the case, then I apologise.

I'm sorry that I have failed to provide a concrete solution to this
problem, and I will try to do better in future.

smcv

[1] Arguably we *do* have 14, one of them from me, because
deb-build-snapshot[2] has an "in Docker"/"in Podman" mode - although
deb-build-snapshot primarily exists to automate generation of labelled
snapshot test-builds for manual testing, and the fact that it has a
"build over there" mode is only a side-effect. It isn't intended
for production use (for example it always builds both arch-dep and
arch-indep binary packages, which of course is an unacceptably lazy
shortcut for production or QA use) and I don't maintain it with a
production-quality level of service, which is why there is no ITP
and also no wishlist bug against devscripts. I am sorry that this
tool does not yet meet the project's quality standards.

[2] https://salsa.debian.org/smcv/deb-build-snapshot

[3] ... and replicate all the other behaviours that the buildds
have, such as setting an unreachable home directory, building
:any and :all separately, and choosing the same undocumented apt
resolver for experimental and backports that the real buildds do

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Reinhard Tartler on Thu Jun 27 19:40:01 2024

On Wed, 26 Jun 2024 at 18:05:15 -0400, Reinhard Tartler wrote:

I imagine that one could whip up some kind of wrapper
that is building a container either from a tarball created via mmboostrap or similar
using buildah, have it install all necessary build dependencies, and then use podman to run the actual build

Yes, one could, and many have; but not (as far as I know) within the
framework of sbuild, in a way that might be considered acceptable by the operators of our official buildds.

I also briefly started playing with debcraft, which I really like from a usability perspective

On Thu, 27 Jun 2024 at 10:52:27 +0200, Bastian Venthur wrote:

I had the idea to build my Debian packages in a clean docker container instead of using cowbuilder etc for some time now.

There are lots of options for doing this, some of which are listed in <https://wiki.debian.org/SystemBuildTools#Package_build_tools>.

All of these have the same problem as cowbuilder, pbuilder, and any
other solution that is not sbuild + schroot: it isn't (currently) what
the production Debian buildds use, therefore it is entirely possible
(perhaps even likely, depending on what packages you maintain) that your package will build successfully and pass tests in your own local builder,
but then fail to build or fail tests on the buildds as a result of some
quirk of how schroot sets up its chroots, which is a worse-than-RC bug
making the package unreleasable.

I'm sure that a better maintainer than me could avoid this source
of stress by simply recognising situations that could cause a build
failure before they happen, and ensuring that no mistakes are made;
but unfortunately the only way I have found to be able to be somewhat
confident that my packages will build successfully in the real Debian infrastructure, within my own limitations, is to replicate a real
Debian buildd (to the best of my ability) and use that replica for
my test-builds.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Johannes Schauer Marin Rodrigues@21:1/5 to All on Thu Jun 27 23:10:02 2024

Simon,

Quoting Simon McVittie (2024-06-27 19:16:54)

On Thu, 27 Jun 2024 at 17:26:20 +0200, Johannes Schauer Marin Rodrigues wrote:

But, if everybody is so excited about this, where are the sbuild contributors
implementing this?

I'm sorry, consider it added it to my list. As usual, there's no guarantee that I will get there within my lifetime, but I'll make sure to feel
suitably guilty about my failure to achieve it.

if you want to do me a favour, please do not put it on your todo list. Even more importantly: please try to not feel guilty for anything. If at all possible, I'd like to assure you that you were not even close to being on the list of people (if we imagine that such a list existed in the first place) that I would make responsible.

This is clearly not entirely true any more, because if it was, buildds would not be able to use sbuild's unshare backend - so perhaps now is the time to be proposing a sbuild podman backend, and I should probably be writing one instead of replying to this message.

Or you let other people take care of it. There are more than a dozen attempts outside of sbuild. How hard can it be? I consider you one of the most capable and clever people in the project and I greatly value your input into this discussion. But were I to choose where to put your time, it would not be into stretching your resources even more thinly by becoming the sbuild+podman maintainer. If you are really eager I do not want to stop you either. But please, please do not feel pressured by my last email.

I'm sorry that I have failed to provide a concrete solution to this problem, and I will try to do better in future.

Please accept my apology for how I phrased my last email. I did not want you to feel sorry for anything.

I'm sincerely sorry. I did not mean to make you feel guilty.

Sorry.

josch
--==============A08479474909106191=MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Description: signature
Content-Type: application/pgp-signature; name="signature.asc"; charset="us-ascii"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEElFhU6KL81LF4wVq58sulx4+9g+EFAmZ91MAACgkQ8sulx4+9 g+GWmw/9GYEudL8kHy9ad4FDrnn5oISKWaPOiXRj2WlqBYKT6oj4snM+Ac8KKvSr DxGYrRmCxX2wn/S2wxRu1dTZfkIkBIf372NwiRBkKh0vAacvz5DQTJTONuu8bImk GWU57EQEnk2vSoKSBfauj6LTgMzJXCBAmjflZJQ7pmuEU59A1ZuojCj9ZoPZVMjL l9o2DJEStjUdX7afhkipLlF3AoW6Xipe1fe4a1IbAcx19HN/rxUpEWE54Lc/Lq6K ZHayvb/y7cSE3tjAA+w40ORPO7YPftyDZQ2BXefXXHcDxzKePNLv0qF5RdlJ15Ga kthukFoSbKfDWGbnA6Ryk4TPxrs6t2cie8IhjFm3LOqMG0ulOnuSXf/kpfyg5NxK x5PiwqutFHLCYc8WGxrDPFkDLiy3lem5ZwOV4Wq/dE5aql39iLaVa1DMoV4oMoKx xk2jIDA8dkZSRBGnygLIK1i646KpxQf1K7onc0qvTSM+hA3vvOSk5B1daD7A5Ap/ E+5RswFLSW5p5D39sdSO/Wnc9olPoaTRwbVVJN2C8+XfAU2UbAm97tT0Fo4q5AS8 wx7SD/GeWC1V1Ih8KVMUq+NYJphqJ3cp6w4thb5HRFOvhoQGjorHR+EbqdIq3SF7 IxFMegwVP/w5x3a8H5imSPaiS/e8TLqG8KiWD67fRFCK5czPoV0=
=boru
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?UTF-8?B?T3R0byBLZWvDpGzDpGluZW4=?@21:1/5 to All on Fri Jun 28 05:00:01 2024

Hi Simon!

There are lots of options for doing this, some of which are listed in <https://wiki.debian.org/SystemBuildTools#Package_build_tools>.

All of these have the same problem as cowbuilder, pbuilder, and any
other solution that is not sbuild + schroot: it isn't (currently) what
the production Debian buildds use, therefore it is entirely possible
(perhaps even likely, depending on what packages you maintain) that your package will build successfully and pass tests in your own local builder,
but then fail to build or fail tests on the buildds as a result of some
quirk of how schroot sets up its chroots, which is a worse-than-RC bug
making the package unreleasable.

Could you point me to some Debian Bug # or otherwise share examples of
cases when a build succeeded locally but failed on official Debian
builders due to something that is specific for sbuild/schroot?

I have never run in such a situation despite doing Debian packaging
for 10 years with fairly complex C++ software targeting all archs
Debian supports. Also as a member of the Salsa-CI team I don't recall
ever seeing a bug report about something built on Salsa in a container successfully but failed to build on actual buildd.

I am not dismissive of your claim - as a very senior DD you surely
have those experiences - I am just curious to learn what those cases
might have been.

I could imagine that buildd builds fail if they the source was
prepared in a non-hermetic environment that ran as root, or had
network access, or if build environment was unclean and debian/control
was missing some dependencies, but that is elementary hermetic build environment properties and not inherently something that *only*
sbuild/schroot does.

Related, you might want to take a peek at the source code of https://salsa.debian.org/otto/debcraft how it supports both Podman and
Docker, and how it generates the 'root.tar.gz' equivalent container automatically based on debian/control and debian/changelog contents,
and then runs the actual build as a regular non-root user in a
container that has no network access. If I learn about other
requirements for a hermetic build environment I would be happy to
incorporate it.

- Otto

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to All on Fri Jun 28 12:00:01 2024

On Thu, 27 Jun 2024 at 19:56:43 -0700, Otto Kek�l�inen wrote:

Could you point me to some Debian Bug # or otherwise share examples of
cases when a build succeeded locally but failed on official Debian
builders due to something that is specific for sbuild/schroot?

I can't easily point you to a Debian bug number, because I try to only
upload packages that live up to Debian's quality standards, which means
I've been routinely building packages for upload in sbuild/schroot for
several years; so if a package fails in that situation, I do not upload,
and retry as many times as it takes to get it right.

(I'm sure I've failed to do that several times, but I'm sorry, I mostly
can't remember specific instances or bug numbers; I generally try to fix
the regression as quickly as I can.)

But, some examples of packages and the reasons they fail:

- bubblewrap, repeatedly. Its test suite wants to create new user
and filesystem namespaces, which is unconditionally not allowed by
the kernel while inside a chroot (because the kernel doesn't want to
allow filesystem namespaces to be used to escape from a chroot). The
relevant tests have to be skipped in situations where they can't work.

"Real" container managers that use pivot_root() instead of chroot(),
such as Docker and Podman, sometimes allow creation of nested user
namespaces (like bwrap by default, and docker --privileged), sometimes
deny it (like bwrap --disable-userns, and Docker by default), and
sometimes cannot allow it because some larger factor forces their hand:
it's non-obvious what will work.

The conditions for not being allowed to create new namespaces are
relatively complicated and poorly-documented, and the error reporting is
minimal (two or three errno values have to cover every possible failure
mode), so this is something that has to be done by trial and error.

Until recently, DSA'd machines all used
/proc/sys/kernel/unprivileged_userns_clone to disable unprivileged
creation of user namespaces anyway. This restriction has presumably
been lifted for the buildds that use sbuild in unshare mode.

- xdg-desktop-portal, repeatedly. Its test suite uses FUSE, which is
disabled (the module is prevented from loading) on official Debian
buildds as a security hardening mechanism, even though on typical
end-user or server Debian systems it works fine.

This is one that I did have to find out via FTBFS, because I don't yet
have a local build environment that replicates this restriction. I know
that I should, and it's on my list.

- ostree, at least once. The test suite historically assumed that /var/tmp
supports extended attributes, which is not true on all buildds (ordinary
on-disk filesystems usually do support them, but tmpfs doesn't or didn't
until recently, and some buildds with plenty of RAM operate in a tmpfs
root filesystem to speed up their builds).

- flatpak, repeatedly. Same as bubblewrap, ostree and x-d-p, combined.

- dbus, historically. For a long time, when using the non-default
DBUS_COOKIE_SHA1 authentication mechanism, libdbus ignored $HOME and
instead used the "official" home directory from /etc/passwd
(the equivalent of `getent passwd $(id -u) | cut -d: -f6`). Official
buildds set the user's home directory to /nonexistent, so this fails.
In production use, dbus normally uses EXTERNAL over AF_UNIX (and doesn't
even allow DBUS_COOKIE_SHA1, as a piece of security hardening), but in
its build-time tests it specifically exercises each auth mechanism and
each transport, including DBUS_COOKIE_SHA1 over TCP (which is a
terrible idea on Unix but is unfortunately necessary on Windows).

- GLib, ongoing (#972151). When the GLib test suite tests interoperability
with libdbus, it (IMO reasonably!) expects ("localhost", AF_INET) to
resolve to 127.0.0.1, but that doesn't work on IPv6-only buildds for
relatively complicated reasons involving subtleties of glibc resolver
behaviour (#952740). My local build environment still doesn't have code
to reproduce this, and I'm sorry that I haven't provided workarounds or
fixes in the GLib test suite or in libdbus' discouraged TCP code paths.
If someone wants to work on this, skipping the interop tests for TCP on
IPV6-only buildds would probably be more proportionate than adjusting
libdbus' name-resolution behaviour for a feature nobody should be
using in production anyway.

- Any package that assumes that if $XDG_RUNTIME_DIR is set, then it is
set to a usable value (because historically schroot would set it to
a value that exists/works on the host system, but does not exist and
cannot be created inside the container). This is worked around by
individual packages unsetting XDG_RUNTIME_DIR or setting it to a more
useful value, or automatically by recent debhelper in a sufficiently
high compat level (#942111).

I have never run in such a situation despite doing Debian packaging
for 10 years with fairly complex C++ software targeting all archs
Debian supports.

If your complex C++ software is doing pure computation without
side-effects, or if it's doing something that's unaffected by being in
a chroot (like file I/O to the build directory, or IPC via AF_UNIX)
then it can be extremely complex and still not hit this sort of thing. Conversely, container-adjacent tools that want to run build-time tests
will hit this sort of thing every time.

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Hartman@21:1/5 to All on Fri Jun 28 15:10:01 2024

"Helmut" == Helmut Grohne <[email protected]> writes:

Helmut> In this work, limitations with --chroot-mode=unshare became
Helmut> apparent and that lead to Johannes, Jochen and me sitting
Helmut> down in Berlin pondering ideas on how to improve the
Helmut> situation. That is a longer story, but eventually Timo
Helmut> Röhling asked the innocuous question of why we cannot just
Helmut> use schroot and make it work with namespaces.

I'll be honest, I think building a new container backend makes no sense
at all.
There's a lot of work that has gone into systemd-nspawn, podman, docker,
crun, runc, and the related ecosystems.

I think an approach that allowed sbuild to actually use a real container backend would be long-term more maintainable and would allow Debian's
DevOps practices to better align with the rest of the world.

I have some work I've been doing in this space which won't be useful to
you because it is not built on top of sbuild.
(Although I'd be happy to share under LGPL-3 for anyone interested.)

But I find that I disagree with the idea of writing a new container
runtime for sbuild so strongly that I can no longer use sbuild for
Debian work, so I started working on my own package building solution.

I realize that I have not done a good job of being constructive here.
I intended to write some blog posts on this topic, but got sucked into
work and tag2upload.

In terms of constructive feedback:

* I think your intuition that sbuild --chroot=unshare is limiting is
good.

* I would move toward a persistent namespace approach because it is
more similar to broadly used container backends.

* overlayfs/fuse-overlayfs are how the rest of the world is solving
these problems (or snapshots and the like). Directories are kind of a
Debian-specific artifact that I find more and more awward to deal with
as the rest of my work uses containers for CI/CD.

--=-=-Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iHUEARYIAB0WIQSj2jRwbAdKzGY/4uAsbEw8qDeGdAUCZn61xAAKCRAsbEw8qDeG dF53AQD9437yLurwqRXX/iVtAardYudwQ/69HCHThSuGbO+bZgD8DCykOFexVgRc BKGO1u1Ft3vftbpqUl6EZBQuGT5u9go=XPeu
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Lewis@21:1/5 to [email protected] on Sat Jun 29 18:20:01 2024

Otto Kekäläinen <[email protected]> writes:

Could you point me to some Debian Bug # or otherwise share examples of
cases when a build succeeded locally but failed on official Debian
builders due to something that is specific for sbuild/schroot?

I believe both these uploads

https://tracker.debian.org/news/1284669/accepted-chkrootkit-055-3-source-into-unstable/
https://tracker.debian.org/news/1288719/accepted-chkrootkit-055-4-source-into-unstable/

were primarily made to fix autopkgtest failures that occurred on debian infrastructure, and were not noticed before because

https://tracker.debian.org/news/1280523/accepted-chkrootkit-055-2-source-into-unstable/

had been only been tested with schroot+sbuild locally using the
--mode=schroot backend

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sam Hartman@21:1/5 to All on Sun Jun 30 00:00:02 2024

"Richard" == Richard Lewis <[email protected]> writes:

Richard> Otto Kekäläinen <[email protected]> writes:
>> Could you point me to some Debian Bug # or otherwise share
>> examples of cases when a build succeeded locally but failed on
>> official Debian builders due to something that is specific for
>> sbuild/schroot?

Until I fixed it, krb5 would not work in a network namespace that only
had a lo interface. It ran getaddrinfo with GAI_ADDRCONFIG in its
tests, because localhost is discouraged/not allowed in krb5 ticket
addresses per RFC 4120. It only talks to itself, but it really wants to
talk to itself not on localhost. I patched the sources to work around
sbuild chroot=unshare.

--=-=-Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iHUEARYIAB0WIQSj2jRwbAdKzGY/4uAsbEw8qDeGdAUCZoCDMAAKCRAsbEw8qDeG dNPyAP91JFtbkU0Nek3R74MENVIblTAt/thHZi7zlseoRsg5sgD7BgY/ItecDiKw vqKdSHmbiZlFKCGj5iT3VbsCfDaJ9A4=7Su9
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Philipp Kern@21:1/5 to Christian Kastner on Mon Jul 1 09:20:01 2024

Hi,

On 2024-06-29 22:21, Christian Kastner wrote:

At the moment, rootless Podman would seem like the obvious choice. As
far
as I'm aware, it has the same user namespaces requirements as the
unshare
backends in mmdebstrap, autopkgtest and schroot (user namespaces
enabled,
setuid newuidmap, 65536 uids in /etc/subuid, 65536 gids in
/etc/subgid).

As a datapoint, I use rootless podman containers extensively both for autopkgtest and as an sbuild backend (though the latter is affected by #1033352 for which I still need to implement a cleaner workaround).

I think the only problem I encountered was a corner case when passing
in
a device into a container: at some point, autopkgtest runs su which
uses
the setgroups() syscall, and group permissions get lost. The solution
was to setup up the proper gidmaps. I documented my findings here [1].

Though this latter issue shouldn't be a problem on buildds, where
devices aren't passed in.

How well does this setup nest? I had a lot of trouble trying to run the
unshare backend within an unprivileged container as setup by
systemd-nspawn - mostly with device nodes. In the end I had to give up
and replaced the container with a full-blown VM. I understand that some
of the things compose a little if the submaps are set up correctly, with
less IDs allocated to the nested child. Is there a way to make this work properly, or would you always run into setup issues with device nodes at
this point?

Specifically I'm concerned about what this means for tests and if they
should be able to use unprivileged containers themselves to test things.
I guess we made the decision that we just assume "root" for testing. But
right now you could - presumably - also setup more things under that
assumption that would not work in an unprivileged setup. Is that a
problem?

Relatedly it'd be great if we actually had a VM in-between us and the
build. But that only works well on some architectures, only composes
well on even less (e.g. arm64 not having nested virtualization yet), and
only provides a marginal benefit if you execute the build outside of the
VM as well. But it'd shield us more from supply chain issues.

Kind regards
Philipp Kern

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Philipp Kern on Mon Jul 1 18:00:02 2024

On Mon, 01 Jul 2024 at 09:18:19 +0200, Philipp Kern wrote:

Specifically I'm concerned about what [advocating use of podman]
means for tests and if they
should be able to use unprivileged containers themselves to test things.

tl;dr: There's no regression here, because you already can't run those
tests on a buildd.

There's no unified definition of "container" in the Linux kernel, only
a selection of different mechanisms that are used by container managers
to do what they want to do according to their individual security models
and desired functionality, so the only fully general answer we can give
to this is: there are containers, and there are containers, so you'll
need to be more specific about which specific things you want.

One use-case that I'm familiar with is bwrap (bubblewrap, as used by
flatpak) nested inside podman. bwrap is a relatively limited container technology with relatively "light" requirements, at the cost of imposing
harsh restrictions on the code inside the container: you only get access
to one uid, and all other uids get mapped to the overflow uid ("nobody).
You can think of as having two possible identities, "me" and "not me".
Even with that limitation, bwrap inside podman doesn't normally work,
because podman forbids most nested container operations. I'm unsure
whether this is a functional requirement to prevent attacks where the
podman container "payload" escapes from the container and gets arbitrary
code execution on the host, or whether this is merely non-essential
security hardening to make it harder to exploit possible vulnerabilities
that podman aims to already prevent in some other way. Either way,
I would expect that buildd operators would not want to allow it.

podman nested inside podman is "more difficult" than bwrap nested inside
podman (because it's more capable and imposes fewer restrictions on the payload, therefore needs a larger-than-default block of uids to be made available, whereas bwrap only needs one uid), and almost certainly also
won't work.

But neither of these is a regression, because we can't normally do either
of those things inside schroot anyway! So packages like bubblewrap and
flatpak have no choice but to skip most of their regression tests at build-time. This is obviously not ideal, but it's better than not being
able to ship these packages in Debian at all.

On ci.debian.net, the bubblewrap and flatpak test suites are re-run as "as-installed" tests, and those *can* be run, using autopkgtest's qemu
backend - although I believe that's currently disabled because of some technical issues with the qemu backend or the infrastructure, so those
tests might end up being skipped (again) on the lxc backend.

I believe bwrap nested inside `podman --privileged` *does* work. As I
said above, I don't know where that falls on the scale between "believed
to be secure, but less well-hardened" and "definitely not secure".

Relatedly it'd be great if we actually had a VM in-between us and the build.

Prior art for this includes `sbuild --chroot-mode=autopkgtest --autopkgtest-virt-server=qemu` (which uses qemu instead of schroot
or podman as the "container" for the actual build), openSUSE's
Open Build Service (which uses a new VM for each build in at
least some configurations), and my own experimental build wrapper <https://salsa.debian.org/smcv/vectis> (which runs the whole sbuild
instance inside the VM, in an attempt to be bug-for-bug compatible with Debian's production infrastructure as mentioned earlier in this thread).

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Grohne@21:1/5 to Philipp Kern on Sat Jul 6 18:40:01 2024

Hi Philipp,

Let me go into some detail that is tangential to the larger discussion.

On Mon, Jul 01, 2024 at 09:18:19AM +0200, Philipp Kern wrote:

How well does this setup nest? I had a lot of trouble trying to run the unshare backend within an unprivileged container as setup by systemd-nspawn
- mostly with device nodes. In the end I had to give up and replaced the container with a full-blown VM. I understand that some of the things compose a little if the submaps are set up correctly, with less IDs allocated to the nested child. Is there a way to make this work properly, or would you always run into setup issues with device nodes at this point?

Technically speaking, nesting is possible. The individual container implementation may limit you, but that's an implementation limit and not
a fundamental one. I'm assuming that you want to nest a rootless
container in a rootless container as that tends to be the most difficult
one. Roughly speaking your unprivileged container wants access to your
user id and a 64k allocation of subuids. This applies to the nested
container. If your outer container maps two 64k ranges (one to 0 to
65535 and the other to whatever your user has in its contained
/etc/subuid), your contained user should actually be able to spawn a
podman container unless I am missing something important. Devices
usually are not a problem (for rootless containers) as you cannot create
them anyway so you end up bind mounting them and the bind mounting
technique nests well.

A typical Debian installation only allocates a single 64k range to each
user. Your first step here is growing that range or adding another one.
(Yes, you may have multiple lines for your user in /etc/subuid.) Then
the podman-run documentation hints at --uidmap and it says that you can
specify it multiple times to map multiple ranges. This is how you
construct your outer container. Then inside, nesting should just work. Admittedly, I've not tried this.

The takeaway should be that if your outer container is constructed in
the right way, you should be able to nest other containers (e.g. podman, mmdebstrap, sbuild unshare, ...) without issues. It's not like this just
works out of the box, but it should be feasible.

Helmut

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Helmut Grohne@21:1/5 to Sam Hartman on Sat Sep 14 12:00:01 2024

Hi Sam and others,

On Fri, Jun 28, 2024 at 07:08:20AM -0600, Sam Hartman wrote:

I'll be honest, I think building a new container backend makes no sense
at all.

I looked hard at this as it was voiced by many. I have to say, I remain unconvinced of the arguments brought forward.

There's a lot of work that has gone into systemd-nspawn, podman, docker, crun, runc, and the related ecosystems.

I consider myself an expert user of systemd-nspawn. One thing that it
really lacks on bookworm is unprivileged execution. If you run your
builds as root, there is debspawn. In future, systemd-nspawn shall work unprivileged - if your image is dm-verity signed. Bummer. I do not see
it as meeting our technical requirements in any way.

podman is a much more sensible suggestion and Simon gave a lot of
feedback on how to integrate it. Still its architecture is limiting in
multiple central aspects. For one thing, podman works with a static set
of namespaces per container instance, but what we want here is use
different network namespaces for installing build-depends and performing
a build. Another aspect is that people are already complaining about the tarball-unpack approach taken by sbuild --chroot-mode=unshare being
slow. podman will make it slower due to requiring the unpack to happen
inside the users $HOME. My initial experiments indicate that we're in
for a factor two whereas we could get this down significantly by using
an overlayfs approach that we cannot shoehorn into podman. podman
upstream insists on CAP_SYS_ADMIN being a no go while systemd upstream
insists on CAP_SYS_ADMIN being a requirement. Whilst this is fine for
building, we also want to run autopkgtests. Running podman also requires
a systemd-logind session - something that is not usually available on a
buildd, in an application container (where you may also want to build a package) or when you su/sudo to a different user. My conclusion is that morphing podman into something usable is more work than writing a
container runtime and that doesn't even account for the political
disagreements involved.

Let me skip docker as it is very similar to podman in all of the aspects
above.

Then you mention crun and runc. These are vaguely API-compatible and
they are the lower level building blocks of both podman and docker. The
issue about CAP_SYS_ADMIN mentioned for podman earlier can be resolved
with ease at this level (at the cost of having containers that do not
contain, which was the reason for podman refuse doing this). The earlier
note about network namespaces fully applies here though. By going down
to this level, we also loose quite a bit of the benefits of image
management that the podman level included.

Your vague mentioning of related tools probably includes slirp4netns,
passt, uidmap and others. Tools at this level do not interfere with our requirements and as such I fully concur with reusing them.

Beyond all of this, I am taking issue with a fundamental design decision
of all the mentioned container runtimes. They all have an architecture
that allows an outside process to "join" a container (podman exec).
Whilst that is a useful feature, it is using the setuid approach of
privilege transitions that we have learned for years to be inherently vulnerable and that systemd folks have been working hard on replacing
with IPC mechanisms. As far as I understand it, a significant portion of container runtime escapes work by exploiting this joining architecture
and the involuntary acquisition of host resources into a container. If
this were implemented via IPC, we could side step an entire class of vulnerabilities.

I think an approach that allowed sbuild to actually use a real container backend would be long-term more maintainable and would allow Debian's
DevOps practices to better align with the rest of the world.

I have a hard time agreeing with this. I have been using rootless
containers far longer than podman supporting them and I still feel very
limited whenever I am supposed to use podman and prefer resorting to
other tools that are more capable and performant.

I have some work I've been doing in this space which won't be useful to
you because it is not built on top of sbuild.
(Although I'd be happy to share under LGPL-3 for anyone interested.)

You can. I'm not sure we'll have to stick to sbuild. If we end up
converting our official buildds to something else, so be it. However,
I'd like to get to a point where building packages just works in a way
that doesn't require root privileges by default. We don't have this "it
just works" experience now.

But I find that I disagree with the idea of writing a new container
runtime for sbuild so strongly that I can no longer use sbuild for
Debian work, so I started working on my own package building solution.

Please bear in mind that effectively, sbuild has gained its own
container runtime already and that what I am looking into here is
extracting it into a separate package interfacing with sbuild. I would therefore rephrase it as refactoring a container runtime rather than
writing a new one.

Then if there was an alternative to sbuild that would allow unprivileged package building in a sane way, I'd readily switch over and stop
bothering about all of this. The problem is that they are all vapor
ware while unschroot barely reached feature parity with sbuild --chroot-mode=unshare.

In terms of constructive feedback:

* I think your intuition that sbuild --chroot=unshare is limiting is
good.

At least something we agree on. :)

* I would move toward a persistent namespace approach because it is
more similar to broadly used container backends.

I agree and agree with the reason you give. I have reached the
conclusion that doing a persistent namespace requires a background
process and an IPC mechanism. (This requirement rules out podman/docker/crun/runc.)

* overlayfs/fuse-overlayfs are how the rest of the world is solving
these problems (or snapshots and the like). Directories are kind of a
Debian-specific artifact that I find more and more awward to deal with
as the rest of my work uses containers for CI/CD.

I don't think this is fully accurate. In particular, podman performs
extraction for every container instantiation and thus requires a lot of
storage on $HOME. I agree that overlayfs is preferable, but
unfortunately, this is not how podman works. In any case, the important
piece here is not whether to use directories or overlayfs (mind the
performance difference) but hiding the storage backend behind an
abstraction that enables a user not to think about it. And that's really
what podman and docker (but not runc and crun) do.

So the more I have let this settle and experiment with podman and stuff,
the more I am reaching the conclusion that none of the existing
container runtimes provide an architecture that solves the requirements
I would like to see met. While the unschroot approach does not provide persistent namespaces at this time, it demonstrates that we technically
can plug a container runtime into sbuild. It is not so much that I have
to write my own (I'd rather prefer not to), but trying to plug something
into sbuild and make it work practically. And that's not because sbuild
would be the best tool around, but because so many other tools integrate
into sbuild that it effectively has become a really complex API that I
don't want to reimplement. I tried plugging podman and that just
wouldn't fit, so I'll continue looking for other options despite
everyone else telling me that this is a bad idea. Maintaining a
container runtime is hard, because maintaining a code base of hundred
thousand lines is hard. What if you'd merely need a few thousand? Thus
far, unschroot has 0.3 thousand lines (plus libraries). This also hints
that podman is solving a lot of problems just not the ones we face.

And then as we disagreed so much about container runtimes, I think the
end goal is not building in a container, but building inside a kvm as
that provides a far better isolation between guest and host. One step at
a time.

Helmut

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Simon McVittie@21:1/5 to Helmut Grohne on Sat Sep 14 12:40:01 2024

On Fri, 13 Sep 2024 at 11:15:55 +0200, Helmut Grohne wrote:

My initial experiments indicate that we're in
for a factor two [slowdown] whereas we could get this down significantly
by using an overlayfs approach that we cannot shoehorn into podman.

Er, podman does use overlayfs, in at least some circumstances?

$ podman run --rm -it debian:sid-slim grep ' / / ' /proc/self/mountinfo
464 131 0:101 / / rw,relatime - overlay overlay rw,lowerdir=/home/smcv/.local/share/containers/storage/overlay/l/[…],upperdir=/home/smcv/.local/share/containers/storage/overlay/[…]/diff,workdir=/home/smcv/.local/share/containers/storage/overlay/[…]/
work,redirect_dir=nofollow,uuid=on,volatile,userxattr

In unstable (and I think also bookworm but I haven't checked
recently), /usr/share/containers/storage.conf defaults to the
"overlay" driver - but the real default is whatever already exists in ~/.local/share/containers/storage, with the configured driver only used
for new setups, unless forced.

I think the performance characteristics you describe probably mean that
you have container storage that is already using the "vfs" driver, which
is indeed based on quite a lot of copying.

podman
upstream insists on CAP_SYS_ADMIN being a no go while systemd upstream insists on CAP_SYS_ADMIN being a requirement

Sorry, this is just not true, in either direction.

podman can be configured to allow CAP_SYS_ADMIN inside the container
(podman run --cap-add=CAP_SYS_ADMIN), but it isn't the default, because
it likely[1] means that "containers don't contain" (no effective security boundary between root in the container, and the user whose uid was mapped
to the container's uid 0). I suspect the same is going to be equally
true for anything that retains CAP_SYS_ADMIN and maps your real uid to
a container uid, but having a uid in common is usually desirable if you
want to be able to provide files to the container, or provide a place
where the container can write files back out.

systemd doesn't "insist on" CAP_SYS_ADMIN either - it specifically
doesn't require it! - but some individual systemd features do require
it. At the moment, it will fail closed (services like polkitd whose security-hardening settings need CAP_SYS_ADMIN fail to start), which
surprised me, because other systemd security-hardening settings tend to
fail open (if systemd doesn't have all of the necessary capabilities
or kernel features then the service still starts, but the rest of the containerized system is less protected from the service than it could
have been).

[1] I asked podman upstream and the answer can be summarized as
"it's complicated, but probably"

I have reached the
conclusion that doing a persistent namespace requires a background
process and an IPC mechanism. (This requirement rules out podman/docker/crun/runc.)

podman/docker can certainly run a background process that accepts
commands via IPC. They don't do this by default, sure, but if you
make the container payload include a process that accepts commands -
perhaps on an AF_UNIX or TCP socket, or through pipes - then they won't
stand in the way of doing that.

(Proof of concept 1: a podman container with an init system and
sshd. Proof of concept 2: the persistent process is a shell inside the container, and the IPC mechanism is a pipe on stdin and another pipe on
stdout. Obviously an interactive shell makes a really bad IPC protocol,
as we already knew from autopkgtest-virt-qemu and LAVA, and for production
use it would be better to use a more structured protocol with proper
framing and error handling, like the D-Bus interface that systemd-run
uses - but that's an implementation detail.)

smcv

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Krenn
  Fri Jul 31 10:41:58 2026
  from Sydney, Nsw via Telnet
- Krenn
  Fri Jul 31 10:34:35 2026
  from Sydney, Nsw via Telnet
- Shift
  Fri Jul 31 06:46:34 2026
  from Leeds, England via SSH
- Centurion
  Fri Jul 31 00:59:56 2026
  from Berea, Ohio via Telnet
- Rixter
  Fri Jul 31 00:00:46 2026
  from Madison, Nc via Telnet
- Bob Worm
  Thu Jul 30 20:01:55 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	112:17:57
Calls:	12,463
Calls today:	5
Files:	15,200
Messages:	6,538,186

Reviving schroot as used by sbuild

Who's Online

Recent Visitors

System Info