Forum: >>> Magnum BBS <<<

Bug#1106083: scilab: FTBFS: autobuilder hangs when using the kernel of

From Santiago Vila@21:1/5 to All on Fri Jul 4 10:10:01 2025

Hello.

For some reason, the hang does not happen (0% failure rate in bookworm)
or it happens a lot less (10% failure rate in trixie) when building
on single-CPU systems.

So, I'm going to try adding --max-parallel=1 to dh call to see
if that improves things. If it does, I'll make a team upload
with such minor change.

(But help is still welcome to investigate the underlying problem,
please contact me privately if you need a VM to reproduce).

Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?utf-8?Q?=C3=89tienne?= Mollier@21:1/5 to All on Wed Jul 9 17:50:01 2025

Hi Santiago,

Waving from IMT Atlantique, I have attempted multiple builds of
scilab 2024 in trixie, but I haven't managed to reproduce the
build hanging that you observed; I have run ten builds which
took their time but perhaps I have not gathered enough samples.
The kernel in trixie is currently Linux 6.12.33+deb13-amd64. By
contrast, I saw that your build of 2024 scilab was running on
Linux 6.12.25-cloud-amd64. Do you still observe the problem on
your side?

In hope this helps, :)
--
.''`. Étienne Mollier <[email protected]>
: :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from my alarm clock
`-

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEj5GyJ8fW8rGUjII2eTz2fo8NEdoFAmhujt4ACgkQeTz2fo8N EdpV+A//VeH4IeFF82yvSp7I9YeThAGWqWdfYPdp0I/msl3WGDhj8w4eLU0dCDju Grg+qd7s/0vkMVC2e5kGvrRWsMXFlDjtZO8MAM7BROhL9zL38DJMU8CgcBaBkTjR qbN7ZhNQtMnhUaFdZSB9rR4N+PrUUf2HH+IY4owwuxa/BR4vXhaliDcpoiNRp67Q jEKl5pnG8vVIMxXTZ6Yq3NpmrG10oItF5ajWWhhdHhPjU0Wo8lZ99US+29x+OeET PlOfOkuuZNVuvZoUXhtyGtM99tCePX4UWBj4MD4WTUOcG9vCUGA66/tHUnMIxukj qYVO1VWrQY9USddA/oHWD38V1JOenE4wA/HB5+WGB/st76J65tPOxKKKkWsW9YDV mTQNIcYQli1y/Fkokez21WzeydmODfQX3bBVlAnI65NmryLS6FW1cd+yuuhEXsSD 4Pj3USmkx6sMbOP6puYZGDhjcDUnXWpMwrSN11BXzAvWn+9m36Ts5a0UIvabfIuO TD0AmNVH60PqZ23pCYE/6YBcMQ1L/uZjfeha0ktnysMfySArJxfMkJFO7hj9kwlh 3TSdRvfRlP4iKU6Zw2QtXMQFvnXc7iJ1fdDELJznr9YWQ3+tUq9zYzRineyF44Mb s6pPIJkkHQczxLv5pWZqM3nHwlZPkntmvZPDkJGsm4zR7dqrvYM=
=70fw
-----END PGP SIGNATURE-----

--- SoupG

From Santiago Vila@21:1/5 to All on Wed Jul 9 18:20:01 2025

On Wed, Jul 09, 2025 at 05:46:38PM +0200, Étienne Mollier wrote:

Hi Santiago,

Waving from IMT Atlantique, I have attempted multiple builds of
scilab 2024 in trixie, but I haven't managed to reproduce the
build hanging that you observed; I have run ten builds which
took their time but perhaps I have not gathered enough samples.
The kernel in trixie is currently Linux 6.12.33+deb13-amd64. By
contrast, I saw that your build of 2024 scilab was running on
Linux 6.12.25-cloud-amd64. Do you still observe the problem on
your side?

Yes. The last kernel I tried was 6.12.32-cloud-amd64 and
it still failed (randomly). I've put my recent build logs here:

https://people.debian.org/~sanvila/build-logs/202507/

I'm going to try a few more times with the current kernel to be sure
and if it fails, I hope somebody can test the VM which I always offer
to debug this.

[ Note: I don't think it's an issue with the kernel but maybe some particularity of the kernel which just makes the problem to happen
more easily. For that reason, I actually expect this to happen with
any kernel of trixie ].

Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?utf-8?Q?=C3=89tienne?= Mollier@21:1/5 to All on Wed Jul 9 18:40:01 2025

Hi Santiago,

For what it's worth, I have run the builds on a beefy laptop,
and despite having most of the build process being already
serialized, I noticed that my memory consumption was pretty
high. I wonder if there could also be some resource exhaustion
at play.

I also notice that the kernel run is the cloud variant. Maybe
having a look at differences with the plain kernel could reveal
some clues? (assuming that the problem has not disappeared from
6.12.32 to 6.12.33…)

In hope this helps, :)
--
.''`. Étienne Mollier <[email protected]>
: :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from my alarm clock
`-

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEj5GyJ8fW8rGUjII2eTz2fo8NEdoFAmhumLwACgkQeTz2fo8N Edob4hAAo6d1hvwKlJ0kMZZl2iypdMki28FlrQVJ4VTwfsrpNKapqgl+rbZujfJ/ M3OaUewXqcpkg9Pk7Zz8BuiA3GNgMX/zszsNGfztTdi7lQO+XRvmOtSo5JwX8cRO 2k2R0ovTAIXycxy1++1chG/Vx8NsZG0Fovu3vY4N+Qb0+xUnp9YNoqrzcbM5CQc2 76sYugigDj0vDwj/GXv+laXz1kX1p9ADFetjN17w/yP6d8fC4oCcqT3jr64NfT6k 1NlWRHfnDPSaQhqhssY1XDGWPUdJAthN3grWnAeSVCRDzqnEECd8VwzCiTbJ+lXA osHrCL3JFCqcxJw8jIQWRLBoqWmDf1YCIeAnJacovIQCHMi1mnMloKh+XaEzS+WV EpBJNgdEs+nv9N+MFKx86HyC/6xfvHvFMevSewd/1HnDBbV+C7AOW2wo+PoyfTAn X1qdvCE0Ou2NBm6uDMR4oksD+cZgdlvQum0JJN7YGt6OAvdzmUvsBaWLb49WF84t FQSl9qJvn4/HTSnBT0aicCXbf7WQs3TFSsdDVDkr5IkHx0aTj/KqXe0g9yMReJJK BGZ4L9tI5EdCu8HjMBYdHZgrcIL0kO2kDdTSZ0FqWU5Wg/uV5kKZ4ZuvGumDNmgW SDDBtvxqULkoWz+vg5+HkQ810W7Vlc9ZN+3Zts7BBVCkghsrcIU=
=Qcjy
-----END PGP SIGNATURE-----

--- SoupG

From Santiago Vila@21:1/5 to All on Wed Jul 9 19:00:01 2025

On Wed, Jul 09, 2025 at 06:28:44PM +0200, Étienne Mollier wrote:

For what it's worth, I have run the builds on a beefy laptop,
and despite having most of the build process being already
serialized, I noticed that my memory consumption was pretty
high. I wonder if there could also be some resource exhaustion
at play.

That would be a possibility in theory, but my autobuilders do nothing
else than building packages, always one package at a time, and they
always have enough memory.

To achieve that, I monitor Committed_AS in /proc/meminfo and collect
statistics about all packages. Then, when the central server receives
a request from one of the autobuilders to "build something", a package
which is buildable in the autobuilder which request a jobs
is always assigned, according to its available memory.

For the particular case of scilab, it needs around 981 MB to build
on single-cpu systems and 1321 MB to build on systems with 2 CPUs.
My smallest machine these days have 4GB of RAM, so I don't think
memory is the problem here.

I also notice that the kernel run is the cloud variant. Maybe
having a look at differences with the plain kernel could reveal
some clues? (assuming that the problem has not disappeared from
6.12.32 to 6.12.33…)

That would also be a possibility in theory, but I think it's unlikely.

Lucas Nussbaum has been using the cloud variant for ages for his
archive rebuilds, and so far we have never ever found a package which
failed because of such reason.

So, it is not impossible, but I would prefer to leave that for the
case that we have no other choice.

(If you think about it, that would pose some interesting challenges
and technical issues: What legitimate reason could a package have to
ftbfs if you don't use the standard kernel, and how are we supposed to
express such dependency in debian/control?)

In case it matters, the failure rates that I got recently were:

10% (5 out of 50) on systems with 1 CPU
78% (39 out of 50) on systems with 2 CPUs.

I suspect of a race condition of some kind, so if you are still
willing to try different things (as opposed to directly trying
in my VM after I finish my last test build), I would try
bulding the package on a self-hosted qemu/kvm machine
with exactly 2 CPUs. You can probably achieve the same
effect by using GRUB_CMDLINE_LINUX="nr_cpus=2" (i.e.
modify /etc/default/grub, run update-grub and reboot).

(btw: Building scilab in unstable 100 times as we speak, I believe
that by night I will be able to tell the outcome).

Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?utf-8?Q?=C3=89tienne?= Mollier@21:1/5 to All on Fri Jul 11 00:20:01 2025

Hi Santiago,

I've been banging my head a fair part of the day wondering why
the error seemed to never occur when I was staring at it. My
current conjecture would be to look at what happens when
stressing the entropy pool of the machine, as the random source
seems to behave slightly differently depening on real and
virtual hosts. At the moment, I have had the build hanging
twice in a row at the same stage of documentation building,
while trying to stress the entropy pool with:

$ cat /dev/random >/dev/null

The build froze in deadlock at the following command in the two
cases:

/build/reproducible-path/scilab-2024.1.0+dfsg/scilab/.libs/scilab-bin -nb -noatomsautoload -nouserstartup -quit -f ./modules/helptools/data/configuration/regen_list.sce -nw

I have two reserves:

0. if I trust Helmut Grohne, I cannot empty the entropy pool;
maybe I can just make it more strained, so the behavior may
be unreproducible to some extent: for instance I have not
managed to reproduce the behavior straight on my laptop;
1. the error occurs slightly earlier in the build log than your
own records, so I may have triggered a different issue with
my meddlings;
2. I probably want to try this procedure a couple more times,
to determine a more precise frequency of occurrence (I could
have run into "coincidences" so far, a third iteration in a
row would be "enemy in action").

I may continue after a good night of sleep, to at least
determine whether this is a real clue or just a dead end. If
the former, then this would point to either an implementation
issue in scilab making it too fragile on the random source, or a
problem in the kernel's random source itself, since the behavior
has occurred more often with recent kernels, or a bit of both.

Have a nice day, :)
--
.''`. Étienne Mollier <[email protected]>
: :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from my alarm clock
`-

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEj5GyJ8fW8rGUjII2eTz2fo8NEdoFAmhwOiwACgkQeTz2fo8N EdofxQ//Rwzpcvthji0nkGcfaHAf5Td/ojMgfLS5JUctiBnqvGNeTRTfLGtvB/rm QTHEJvCQ7/4VzXgHL/GUjYfJt9kjQK4IzVOIV8fKi8SbRPHxcdjwCiACE0TN4Fz0 HljtYmafAv51ukpL9UZtA+wY7eAk0f1viVuI1FxrT7dIAt1+R8RafAEtqKIAfEts rBV5SZqZ61+kQqZhfVoTKulWG3v2VPtmd+HExquxnHhKRQxcyy1ilsDgpJ8ktkbw bWxA3JPH3jHWissFYqTDdrG7/vRu8CjzXkuNrAtKd5U5Q0YWRJhzUWXM8hxb7pdj 5Z9fknIk5khxVWw7ohb8EU4oRbzZHLgJeO0pHFKcFqtFCBEswncMhqvb6JvE/v+L 7+Dm5kfU6aWotV+AE5ds7BgR0kiBLS7Eo0ybcNj73E3BH5+iFd+tLeaiJ7NOPz9T ph9eVkZvjbmRKQRHUji5Tkib6Yzzm0Ch2W96EeUC9l4SF6eCSeGHXCzXJkl4sc2U BIgXmyKyBFWy4l2MGIs+ts4sQglw6tOauykdhHmymzwUP1iQtOERwFoueAjhStP7 xyqNTeo+vjJSBq1l/AqcpenboS72cJOuYce1JvSFL5thwuMtOkwFCIn422Jm+VJJ zVY33jnnSRh6aRzIO/1/o6W++8GSHcx5k8r0/oTJ9UnUBzp6kXY=
=TFh4
-----END PGP SIGNATURE-----

--- SoupG

From Santiago Vila@21:1/5 to All on Fri Jul 11 02:20:02 2025

On Fri, Jul 11, 2025 at 12:09:48AM +0200, Étienne Mollier wrote:

0. if I trust Helmut Grohne, I cannot empty the entropy pool;
maybe I can just make it more strained, so the behavior may
be unreproducible to some extent: for instance I have not
managed to reproduce the behavior straight on my laptop;

I would be quite surprised if the entropy pool was the problem.

I remember the time when several packages (mostly
cryptography-related) failed their test suites because of this.

I found haveged to be the perfect workaround for that, and I always
installed it in all my autobuilders, but I stopped using it two years ago
when the kernel had a new implementation of /dev/random making haveged
not necessary anymore (maybe this is what Helmut might have
told you about the entropy pool).

Moreover, as a test, I've built scilab 20 times with haveged
installed, and the failure rate is more or less the same.

I think the entropy pool is unlikely to be the problem.

(But of course I'm glad that you are considering all possible
scenarios).

I guess that at some point we should probably tell upstream about this.

Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Santiago Vila@21:1/5 to All on Fri Jul 11 02:30:01 2025

For the record, I've put more recent failed build logs in the same place as before:

https://people.debian.org/~sanvila/build-logs/202507/

Based on that, this is what we know:

- It fails the same when using 6.12.33+deb13-cloud-amd64.
- It fails the same regardless of using unshare or the old schroot backend.

Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Santiago Vila@21:1/5 to Pierre Gruet on Mon Jul 14 02:50:01 2025

On Sun, Jul 13, 2025 at 08:03:24AM +0200, Pierre Gruet wrote:

However, my feeling is that the bug is not RC, since it shows up only in
VMs: in practice it does not affect the autobuilders nor the developers on their individual machines. What do you think?

Those seem weak reasons to me.

We don't really know if it shows up only in VMs. Just because
it fails on the VMs I've provided does not mean that it
fails *because* I'm using VMs.

But in either case lowering support for VMs looks arbitrary to me.
What will we tell the end user who wants to build this from source?
Will we say "Sorry, this is not supported because you are using a VM?"
What's wrong with using VMs? That would not fit very well with our
idea that Debian is the universal operating system, valid for
servers and desktops, and also valid for VMs and real hardware.

We also don't even know if this will happen in the autobuilders, because
this started to happen when I switched to using trixie (including
the kernel of trixie) and there has not been any build of scilab
in the official autobuilders using the kernel of trixie yet.

In cases like this one, I suggest that we ask for a trixie-ignore tag
while we await for some reply from upstream. As a team member, I can file
the release.debian.org bug if nobody else volunteers.

Thanks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From =?utf-8?Q?=C3=89tienne?= Mollier@21:1/5 to All on Mon Jul 14 09:10:02 2025

Hi both,

Pierre Gruet, on 2025-07-14:

Le 14/07/2025 à 02:36, Santiago Vila a écrit :

In cases like this one, I suggest that we ask for a trixie-ignore tag
while we await for some reply from upstream. As a team member, I can file the release.debian.org bug if nobody else volunteers.

... but certainly, this looks like a good way to go. You may file the bug to ask for trixie-ignore, thanks a lot for this!

That sounds reasonable to me as well.

Have a nice day, :)
--
.''`. Étienne Mollier <[email protected]>
: :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
`. `' sent from my alarm clock
`-

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEEj5GyJ8fW8rGUjII2eTz2fo8NEdoFAmh0rMoACgkQeTz2fo8N EdrKihAAt/KWBm86VoIRNnm9zapCFfNbp/j1zLN2/SFwnHpxkr8HKhF4BM98CwsN 8tCm+IkiLLGmrDQp3XcdcqZzErfaK5m0V2HQ/pk7nY15NbbcX+R14lBVi9d2Hnha yLX8/vlVolcUVKVtAlf3/3UBXD23jVY+pJmwWtHOVDcdlSJbaX4pQ2tLxWLELrpj zAu7pT2wIZB++chgv4s7U/fY39eNcIkArroYtz6+SqLoauKZSIXqGy15cIwliUee c1nQDkX81Fyh8DckgfcdwbbHOeG+7aiWnTooU90lIqVb2QDHxdPmp+0GgCf3AiJz m5Srs9PJEsx8PAyXz1//5u26oQZzaz6xouGbtcxMQd6YlrLL3Ho7Zev/CCz+diC3 jgtKMHx1SNigzQBS6aM48qkxa1BbCOvWx1QnDhO6O2t2ziL+fke2mGeKO9P+/Ich AIoF21noHKeAgfq/IVQaZWzswzVh+d0zBmw/mlX+Jg08voi9qMMtMT30yyH6nrKK 1kRfrIw1TWxvCX0ywHKTTckxsVGPboAX+jZ6RhpQeLNWUiDq18SyySSI1wEXfyne qJy3CdLXddUztGBGiuxv/QOg2frtJT5MPnsmkeo8cbMQHSKZIYNlPFjmnT0eGgE1 sB97QPZhSx42iBlfvLolialBp/F5dPiabTykpGd/zGvHmbKCO8c=
=JDKE
-----END PGP SIGNATURE-----

--- SoupG

Who's Online
Recent Visitors
- Rixter
  Fri Jul 31 12:17:09 2026
  from Madison, Nc via Telnet
- Krenn
  Fri Jul 31 10:41:58 2026
  from Sydney, Nsw via Telnet
- Krenn
  Fri Jul 31 10:34:35 2026
  from Sydney, Nsw via Telnet
- Shift
  Fri Jul 31 06:46:34 2026
  from Leeds, England via SSH
- Centurion
  Fri Jul 31 00:59:56 2026
  from Berea, Ohio via Telnet
- Rixter
  Fri Jul 31 00:00:46 2026
  from Madison, Nc via Telnet
- Bob Worm
  Thu Jul 30 20:01:55 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	114:02:54
Calls:	12,464
Calls today:	6
Files:	15,200
Messages:	6,538,210

Bug#1106083: scilab: FTBFS: autobuilder hangs when using the kernel of

Who's Online

Recent Visitors

System Info