• Bug#1106083: scilab: FTBFS: autobuilder hangs when using the kernel of

    From Santiago Vila@21:1/5 to All on Fri Jul 4 10:10:01 2025
    Hello.

    For some reason, the hang does not happen (0% failure rate in bookworm)
    or it happens a lot less (10% failure rate in trixie) when building
    on single-CPU systems.

    So, I'm going to try adding --max-parallel=1 to dh call to see
    if that improves things. If it does, I'll make a team upload
    with such minor change.

    (But help is still welcome to investigate the underlying problem,
    please contact me privately if you need a VM to reproduce).

    Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?utf-8?Q?=C3=89tienne?= Mollier@21:1/5 to All on Wed Jul 9 17:50:01 2025
    Hi Santiago,

    Waving from IMT Atlantique, I have attempted multiple builds of
    scilab 2024 in trixie, but I haven't managed to reproduce the
    build hanging that you observed; I have run ten builds which
    took their time but perhaps I have not gathered enough samples.
    The kernel in trixie is currently Linux 6.12.33+deb13-amd64. By
    contrast, I saw that your build of 2024 scilab was running on
    Linux 6.12.25-cloud-amd64. Do you still observe the problem on
    your side?

    In hope this helps, :)
    --
    .''`. Étienne Mollier <[email protected]>
    : :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
    `. `' sent from my alarm clock
    `-

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEj5GyJ8fW8rGUjII2eTz2fo8NEdoFAmhujt4ACgkQeTz2fo8N EdpV+A//VeH4IeFF82yvSp7I9YeThAGWqWdfYPdp0I/msl3WGDhj8w4eLU0dCDju Grg+qd7s/0vkMVC2e5kGvrRWsMXFlDjtZO8MAM7BROhL9zL38DJMU8CgcBaBkTjR qbN7ZhNQtMnhUaFdZSB9rR4N+PrUUf2HH+IY4owwuxa/BR4vXhaliDcpoiNRp67Q jEKl5pnG8vVIMxXTZ6Yq3NpmrG10oItF5ajWWhhdHhPjU0Wo8lZ99US+29x+OeET PlOfOkuuZNVuvZoUXhtyGtM99tCePX4UWBj4MD4WTUOcG9vCUGA66/tHUnMIxukj qYVO1VWrQY9USddA/oHWD38V1JOenE4wA/HB5+WGB/st76J65tPOxKKKkWsW9YDV mTQNIcYQli1y/Fkokez21WzeydmODfQX3bBVlAnI65NmryLS6FW1cd+yuuhEXsSD 4Pj3USmkx6sMbOP6puYZGDhjcDUnXWpMwrSN11BXzAvWn+9m36Ts5a0UIvabfIuO TD0AmNVH60PqZ23pCYE/6YBcMQ1L/uZjfeha0ktnysMfySArJxfMkJFO7hj9kwlh 3TSdRvfRlP4iKU6Zw2QtXMQFvnXc7iJ1fdDELJznr9YWQ3+tUq9zYzRineyF44Mb s6pPIJkkHQczxLv5pWZqM3nHwlZPkntmvZPDkJGsm4zR7dqrvYM=
    =70fw
    -----END PGP SIGNATURE-----

    --- SoupG
  • From Santiago Vila@21:1/5 to All on Wed Jul 9 18:20:01 2025
    On Wed, Jul 09, 2025 at 05:46:38PM +0200, Étienne Mollier wrote:
    Hi Santiago,

    Waving from IMT Atlantique, I have attempted multiple builds of
    scilab 2024 in trixie, but I haven't managed to reproduce the
    build hanging that you observed; I have run ten builds which
    took their time but perhaps I have not gathered enough samples.
    The kernel in trixie is currently Linux 6.12.33+deb13-amd64. By
    contrast, I saw that your build of 2024 scilab was running on
    Linux 6.12.25-cloud-amd64. Do you still observe the problem on
    your side?

    Yes. The last kernel I tried was 6.12.32-cloud-amd64 and
    it still failed (randomly). I've put my recent build logs here:

    https://people.debian.org/~sanvila/build-logs/202507/

    I'm going to try a few more times with the current kernel to be sure
    and if it fails, I hope somebody can test the VM which I always offer
    to debug this.

    [ Note: I don't think it's an issue with the kernel but maybe some particularity of the kernel which just makes the problem to happen
    more easily. For that reason, I actually expect this to happen with
    any kernel of trixie ].

    Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?utf-8?Q?=C3=89tienne?= Mollier@21:1/5 to All on Wed Jul 9 18:40:01 2025
    Hi Santiago,

    For what it's worth, I have run the builds on a beefy laptop,
    and despite having most of the build process being already
    serialized, I noticed that my memory consumption was pretty
    high. I wonder if there could also be some resource exhaustion
    at play.

    I also notice that the kernel run is the cloud variant. Maybe
    having a look at differences with the plain kernel could reveal
    some clues? (assuming that the problem has not disappeared from
    6.12.32 to 6.12.33…)

    In hope this helps, :)
    --
    .''`. Étienne Mollier <[email protected]>
    : :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
    `. `' sent from my alarm clock
    `-

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEj5GyJ8fW8rGUjII2eTz2fo8NEdoFAmhumLwACgkQeTz2fo8N Edob4hAAo6d1hvwKlJ0kMZZl2iypdMki28FlrQVJ4VTwfsrpNKapqgl+rbZujfJ/ M3OaUewXqcpkg9Pk7Zz8BuiA3GNgMX/zszsNGfztTdi7lQO+XRvmOtSo5JwX8cRO 2k2R0ovTAIXycxy1++1chG/Vx8NsZG0Fovu3vY4N+Qb0+xUnp9YNoqrzcbM5CQc2 76sYugigDj0vDwj/GXv+laXz1kX1p9ADFetjN17w/yP6d8fC4oCcqT3jr64NfT6k 1NlWRHfnDPSaQhqhssY1XDGWPUdJAthN3grWnAeSVCRDzqnEECd8VwzCiTbJ+lXA osHrCL3JFCqcxJw8jIQWRLBoqWmDf1YCIeAnJacovIQCHMi1mnMloKh+XaEzS+WV EpBJNgdEs+nv9N+MFKx86HyC/6xfvHvFMevSewd/1HnDBbV+C7AOW2wo+PoyfTAn X1qdvCE0Ou2NBm6uDMR4oksD+cZgdlvQum0JJN7YGt6OAvdzmUvsBaWLb49WF84t FQSl9qJvn4/HTSnBT0aicCXbf7WQs3TFSsdDVDkr5IkHx0aTj/KqXe0g9yMReJJK BGZ4L9tI5EdCu8HjMBYdHZgrcIL0kO2kDdTSZ0FqWU5Wg/uV5kKZ4ZuvGumDNmgW SDDBtvxqULkoWz+vg5+HkQ810W7Vlc9ZN+3Zts7BBVCkghsrcIU=
    =Qcjy
    -----END PGP SIGNATURE-----

    --- SoupG
  • From Santiago Vila@21:1/5 to All on Wed Jul 9 19:00:01 2025
    On Wed, Jul 09, 2025 at 06:28:44PM +0200, Étienne Mollier wrote:
    For what it's worth, I have run the builds on a beefy laptop,
    and despite having most of the build process being already
    serialized, I noticed that my memory consumption was pretty
    high. I wonder if there could also be some resource exhaustion
    at play.

    That would be a possibility in theory, but my autobuilders do nothing
    else than building packages, always one package at a time, and they
    always have enough memory.

    To achieve that, I monitor Committed_AS in /proc/meminfo and collect
    statistics about all packages. Then, when the central server receives
    a request from one of the autobuilders to "build something", a package
    which is buildable in the autobuilder which request a jobs
    is always assigned, according to its available memory.

    For the particular case of scilab, it needs around 981 MB to build
    on single-cpu systems and 1321 MB to build on systems with 2 CPUs.
    My smallest machine these days have 4GB of RAM, so I don't think
    memory is the problem here.

    I also notice that the kernel run is the cloud variant. Maybe
    having a look at differences with the plain kernel could reveal
    some clues? (assuming that the problem has not disappeared from
    6.12.32 to 6.12.33…)

    That would also be a possibility in theory, but I think it's unlikely.

    Lucas Nussbaum has been using the cloud variant for ages for his
    archive rebuilds, and so far we have never ever found a package which
    failed because of such reason.

    So, it is not impossible, but I would prefer to leave that for the
    case that we have no other choice.

    (If you think about it, that would pose some interesting challenges
    and technical issues: What legitimate reason could a package have to
    ftbfs if you don't use the standard kernel, and how are we supposed to
    express such dependency in debian/control?)

    In case it matters, the failure rates that I got recently were:

    10% (5 out of 50) on systems with 1 CPU
    78% (39 out of 50) on systems with 2 CPUs.

    I suspect of a race condition of some kind, so if you are still
    willing to try different things (as opposed to directly trying
    in my VM after I finish my last test build), I would try
    bulding the package on a self-hosted qemu/kvm machine
    with exactly 2 CPUs. You can probably achieve the same
    effect by using GRUB_CMDLINE_LINUX="nr_cpus=2" (i.e.
    modify /etc/default/grub, run update-grub and reboot).

    (btw: Building scilab in unstable 100 times as we speak, I believe
    that by night I will be able to tell the outcome).

    Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?utf-8?Q?=C3=89tienne?= Mollier@21:1/5 to All on Fri Jul 11 00:20:01 2025
    Hi Santiago,

    I've been banging my head a fair part of the day wondering why
    the error seemed to never occur when I was staring at it. My
    current conjecture would be to look at what happens when
    stressing the entropy pool of the machine, as the random source
    seems to behave slightly differently depening on real and
    virtual hosts. At the moment, I have had the build hanging
    twice in a row at the same stage of documentation building,
    while trying to stress the entropy pool with:

    $ cat /dev/random >/dev/null

    The build froze in deadlock at the following command in the two
    cases:

    /build/reproducible-path/scilab-2024.1.0+dfsg/scilab/.libs/scilab-bin -nb -noatomsautoload -nouserstartup -quit -f ./modules/helptools/data/configuration/regen_list.sce -nw

    I have two reserves:

    0. if I trust Helmut Grohne, I cannot empty the entropy pool;
    maybe I can just make it more strained, so the behavior may
    be unreproducible to some extent: for instance I have not
    managed to reproduce the behavior straight on my laptop;
    1. the error occurs slightly earlier in the build log than your
    own records, so I may have triggered a different issue with
    my meddlings;
    2. I probably want to try this procedure a couple more times,
    to determine a more precise frequency of occurrence (I could
    have run into "coincidences" so far, a third iteration in a
    row would be "enemy in action").

    I may continue after a good night of sleep, to at least
    determine whether this is a real clue or just a dead end. If
    the former, then this would point to either an implementation
    issue in scilab making it too fragile on the random source, or a
    problem in the kernel's random source itself, since the behavior
    has occurred more often with recent kernels, or a bit of both.

    Have a nice day, :)
    --
    .''`. Étienne Mollier <[email protected]>
    : :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
    `. `' sent from my alarm clock
    `-

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEj5GyJ8fW8rGUjII2eTz2fo8NEdoFAmhwOiwACgkQeTz2fo8N EdofxQ//Rwzpcvthji0nkGcfaHAf5Td/ojMgfLS5JUctiBnqvGNeTRTfLGtvB/rm QTHEJvCQ7/4VzXgHL/GUjYfJt9kjQK4IzVOIV8fKi8SbRPHxcdjwCiACE0TN4Fz0 HljtYmafAv51ukpL9UZtA+wY7eAk0f1viVuI1FxrT7dIAt1+R8RafAEtqKIAfEts rBV5SZqZ61+kQqZhfVoTKulWG3v2VPtmd+HExquxnHhKRQxcyy1ilsDgpJ8ktkbw bWxA3JPH3jHWissFYqTDdrG7/vRu8CjzXkuNrAtKd5U5Q0YWRJhzUWXM8hxb7pdj 5Z9fknIk5khxVWw7ohb8EU4oRbzZHLgJeO0pHFKcFqtFCBEswncMhqvb6JvE/v+L 7+Dm5kfU6aWotV+AE5ds7BgR0kiBLS7Eo0ybcNj73E3BH5+iFd+tLeaiJ7NOPz9T ph9eVkZvjbmRKQRHUji5Tkib6Yzzm0Ch2W96EeUC9l4SF6eCSeGHXCzXJkl4sc2U BIgXmyKyBFWy4l2MGIs+ts4sQglw6tOauykdhHmymzwUP1iQtOERwFoueAjhStP7 xyqNTeo+vjJSBq1l/AqcpenboS72cJOuYce1JvSFL5thwuMtOkwFCIn422Jm+VJJ zVY33jnnSRh6aRzIO/1/o6W++8GSHcx5k8r0/oTJ9UnUBzp6kXY=
    =TFh4
    -----END PGP SIGNATURE-----

    --- SoupG
  • From Santiago Vila@21:1/5 to All on Fri Jul 11 02:20:02 2025
    On Fri, Jul 11, 2025 at 12:09:48AM +0200, Étienne Mollier wrote:
    0. if I trust Helmut Grohne, I cannot empty the entropy pool;
    maybe I can just make it more strained, so the behavior may
    be unreproducible to some extent: for instance I have not
    managed to reproduce the behavior straight on my laptop;

    I would be quite surprised if the entropy pool was the problem.

    I remember the time when several packages (mostly
    cryptography-related) failed their test suites because of this.

    I found haveged to be the perfect workaround for that, and I always
    installed it in all my autobuilders, but I stopped using it two years ago
    when the kernel had a new implementation of /dev/random making haveged
    not necessary anymore (maybe this is what Helmut might have
    told you about the entropy pool).

    Moreover, as a test, I've built scilab 20 times with haveged
    installed, and the failure rate is more or less the same.

    I think the entropy pool is unlikely to be the problem.

    (But of course I'm glad that you are considering all possible
    scenarios).

    I guess that at some point we should probably tell upstream about this.

    Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Santiago Vila@21:1/5 to All on Fri Jul 11 02:30:01 2025
    For the record, I've put more recent failed build logs in the same place as before:

    https://people.debian.org/~sanvila/build-logs/202507/

    Based on that, this is what we know:

    - It fails the same when using 6.12.33+deb13-cloud-amd64.
    - It fails the same regardless of using unshare or the old schroot backend.

    Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Santiago Vila@21:1/5 to Pierre Gruet on Mon Jul 14 02:50:01 2025
    On Sun, Jul 13, 2025 at 08:03:24AM +0200, Pierre Gruet wrote:
    However, my feeling is that the bug is not RC, since it shows up only in
    VMs: in practice it does not affect the autobuilders nor the developers on their individual machines. What do you think?

    Those seem weak reasons to me.

    We don't really know if it shows up only in VMs. Just because
    it fails on the VMs I've provided does not mean that it
    fails *because* I'm using VMs.

    But in either case lowering support for VMs looks arbitrary to me.
    What will we tell the end user who wants to build this from source?
    Will we say "Sorry, this is not supported because you are using a VM?"
    What's wrong with using VMs? That would not fit very well with our
    idea that Debian is the universal operating system, valid for
    servers and desktops, and also valid for VMs and real hardware.

    We also don't even know if this will happen in the autobuilders, because
    this started to happen when I switched to using trixie (including
    the kernel of trixie) and there has not been any build of scilab
    in the official autobuilders using the kernel of trixie yet.

    In cases like this one, I suggest that we ask for a trixie-ignore tag
    while we await for some reply from upstream. As a team member, I can file
    the release.debian.org bug if nobody else volunteers.

    Thanks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?utf-8?Q?=C3=89tienne?= Mollier@21:1/5 to All on Mon Jul 14 09:10:02 2025
    Hi both,

    Pierre Gruet, on 2025-07-14:
    Le 14/07/2025 à 02:36, Santiago Vila a écrit :
    In cases like this one, I suggest that we ask for a trixie-ignore tag
    while we await for some reply from upstream. As a team member, I can file the release.debian.org bug if nobody else volunteers.

    ... but certainly, this looks like a good way to go. You may file the bug to ask for trixie-ignore, thanks a lot for this!

    That sounds reasonable to me as well.

    Have a nice day, :)
    --
    .''`. Étienne Mollier <[email protected]>
    : :' : pgp: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da
    `. `' sent from my alarm clock
    `-

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCgAdFiEEj5GyJ8fW8rGUjII2eTz2fo8NEdoFAmh0rMoACgkQeTz2fo8N EdrKihAAt/KWBm86VoIRNnm9zapCFfNbp/j1zLN2/SFwnHpxkr8HKhF4BM98CwsN 8tCm+IkiLLGmrDQp3XcdcqZzErfaK5m0V2HQ/pk7nY15NbbcX+R14lBVi9d2Hnha yLX8/vlVolcUVKVtAlf3/3UBXD23jVY+pJmwWtHOVDcdlSJbaX4pQ2tLxWLELrpj zAu7pT2wIZB++chgv4s7U/fY39eNcIkArroYtz6+SqLoauKZSIXqGy15cIwliUee c1nQDkX81Fyh8DckgfcdwbbHOeG+7aiWnTooU90lIqVb2QDHxdPmp+0GgCf3AiJz m5Srs9PJEsx8PAyXz1//5u26oQZzaz6xouGbtcxMQd6YlrLL3Ho7Zev/CCz+diC3 jgtKMHx1SNigzQBS6aM48qkxa1BbCOvWx1QnDhO6O2t2ziL+fke2mGeKO9P+/Ich AIoF21noHKeAgfq/IVQaZWzswzVh+d0zBmw/mlX+Jg08voi9qMMtMT30yyH6nrKK 1kRfrIw1TWxvCX0ywHKTTckxsVGPboAX+jZ6RhpQeLNWUiDq18SyySSI1wEXfyne qJy3CdLXddUztGBGiuxv/QOg2frtJT5MPnsmkeo8cbMQHSKZIYNlPFjmnT0eGgE1 sB97QPZhSx42iBlfvLolialBp/F5dPiabTykpGd/zGvHmbKCO8c=
    =JDKE
    -----END PGP SIGNATURE-----

    --- SoupG