• Bug#1108860: linux-image-6.1.0-34-amd64: Wireguard fragmentation fails

    From Charles Bordet@21:1/5 to All on Sun Jul 6 15:20:01 2025
    Package: linux-image-6.1.0-34-amd64
    Severity: important

    Dear Maintainer,

    What led up to the situation?
    We run a production environment using Debian 12 VMs, with a network
    topology involving VXLAN tunnels encapsulated inside Wireguard
    interfaces. This setup has worked reliably for over a year, with MTU set
    to 1500 on all interfaces except the Wireguard interface (set to 1420). Wireguard kernel fragmentation allowed this configuration to function
    without issues, even though the effective path MTU is lower than 1500.

    What exactly did you do (or not do) that was effective (or ineffective)?
    We performed a routine system upgrade, updating all packages include the kernel. After the upgrade, we observed severe network issues (timeouts,
    very slow HTTP/HTTPS, and apt update failures) on all VMs behind the
    router. SSH and small-packet traffic continued to work.

    To diagnose, we:

    * Restored a backup (with the previous kernel): the problem disappeared.
    * Repeated the upgrade, confirming the issue reappeared.
    * Systematically tested each kernel version from 6.1.124-1 up to
    6.1.140-1. The problem first appears with kernel 6.1.135-1; all earlier versions work as expected.
    * Kernel version from the backports (6.12.32-1) did not resolve the
    problem.

    What was the outcome of this action?

    * With kernel 6.1.135-1 or later, network timeouts occur for
    large-packet protocols (HTTP, apt, etc.), while SSH and small-packet
    protocols work.
    * With kernel 6.1.133-1 or earlier, everything works as expected.

    What outcome did you expect instead?
    We expected the network to function as before, with Wireguard handling fragmentation transparently and no application-level timeouts,
    regardless of the kernel version.

    -- System Information:
    Debian Release: 12.9
    APT prefers stable-updates
    APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
    Architecture: amd64 (x86_64)

    Kernel: Linux 6.1.0-29-amd64 (SMP w/1 CPU thread; PREEMPT)
    Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
    Shell: /bin/sh linked to /usr/bin/dash
    Init: systemd (via /run/systemd/system)

    Versions of packages linux-image-6.1.0-34-amd64 depends on:
    ii initramfs-tools [linux-initramfs-tool] 0.142+deb12u1
    ii kmod 30+20221128-1
    ii linux-base 4.9

    Versions of packages linux-image-6.1.0-34-amd64 recommends:
    pn apparmor <none>
    pn firmware-linux-free <none>

    Versions of packages linux-image-6.1.0-34-amd64 suggests:
    pn debian-kernel-handbook <none>
    ii grub-efi-amd64 2.06-13+deb12u1
    pn linux-doc-6.1 <none>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Salvatore Bonaccorso@21:1/5 to Charles Bordet on Sun Jul 6 15:40:01 2025
    Control: reassign -1 src:linux 6.1.135-1
    Control: tags -1 + upstream moreinfo
    Control: found -1 6.12.32-1

    Hi Charles,

    On Sun, Jul 06, 2025 at 12:57:41PM +0000, Charles Bordet wrote:
    Package: linux-image-6.1.0-34-amd64
    Severity: important

    Dear Maintainer,

    What led up to the situation?
    We run a production environment using Debian 12 VMs, with a network
    topology involving VXLAN tunnels encapsulated inside Wireguard
    interfaces. This setup has worked reliably for over a year, with MTU set
    to 1500 on all interfaces except the Wireguard interface (set to 1420). Wireguard kernel fragmentation allowed this configuration to function
    without issues, even though the effective path MTU is lower than 1500.

    What exactly did you do (or not do) that was effective (or ineffective)?
    We performed a routine system upgrade, updating all packages include the kernel. After the upgrade, we observed severe network issues (timeouts,
    very slow HTTP/HTTPS, and apt update failures) on all VMs behind the
    router. SSH and small-packet traffic continued to work.

    To diagnose, we:

    * Restored a backup (with the previous kernel): the problem disappeared.
    * Repeated the upgrade, confirming the issue reappeared.
    * Systematically tested each kernel version from 6.1.124-1 up to
    6.1.140-1. The problem first appears with kernel 6.1.135-1; all earlier versions work as expected.
    * Kernel version from the backports (6.12.32-1) did not resolve the
    problem.

    What was the outcome of this action?

    * With kernel 6.1.135-1 or later, network timeouts occur for
    large-packet protocols (HTTP, apt, etc.), while SSH and small-packet protocols work.
    * With kernel 6.1.133-1 or earlier, everything works as expected.

    What outcome did you expect instead?
    We expected the network to function as before, with Wireguard handling fragmentation transparently and no application-level timeouts,
    regardless of the kernel version.

    Thanks for the report and narrowing down the version where the issue
    is introduced on Debian side.

    Since you seem to reliably reproduce the issue, would it be possible
    that you bisect the changes between 6.1.133 upstream and 6.1.135 now
    that we can find the offending commit and make a report upstream?

    Additionally, would it be possible that you try directly as well the
    kernel from unstable 6.12.35-1 and 6.15.4-1~exp1 from experimental to
    determine if the issue is unresolved there?

    Regards,
    Salvatore

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Salvatore Bonaccorso@21:1/5 to [email protected] on Sun Jul 6 21:10:01 2025
    XPost: linux.debian.kernel

    Hi Charles,

    On Sun, Jul 06, 2025 at 07:43:35PM +0200, [email protected] wrote:
    Hi,

    Thank you for the quick reply.

    We tried kernel versions 6.12.35-1 from unstable and 6.15.4-1 from experimental and the issue still appears on both versions.

    Ack, thanks for confirming that, I just updated the bug metadata to
    reflect that.

    We are currently bisecting the changes to identify the commit. This
    will take several days as the server is used in production and we
    need to minimize downtime during working hours. I will get back to
    this issue as soon as the commit is identified.

    Yes that is fully understandable. Would be ideal if that can be
    reproduced under lab conditions, but then this takes just the time it
    needs.

    Ping us back once you have identified the breaking commit.

    Thanks for your debugging.

    Regards,
    Salvatore

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Salvatore Bonaccorso@21:1/5 to Salvatore Bonaccorso on Thu Jul 10 21:00:02 2025
    XPost: linux.debian.kernel

    control: tags -1 + moreinfo

    Hi Charles,

    On Sun, Jul 06, 2025 at 08:59:46PM +0200, Salvatore Bonaccorso wrote:
    Hi Charles,

    On Sun, Jul 06, 2025 at 07:43:35PM +0200, [email protected] wrote:
    Hi,

    Thank you for the quick reply.

    We tried kernel versions 6.12.35-1 from unstable and 6.15.4-1 from experimental and the issue still appears on both versions.

    Ack, thanks for confirming that, I just updated the bug metadata to
    reflect that.

    We are currently bisecting the changes to identify the commit. This
    will take several days as the server is used in production and we
    need to minimize downtime during working hours. I will get back to
    this issue as soon as the commit is identified.

    Yes that is fully understandable. Would be ideal if that can be
    reproduced under lab conditions, but then this takes just the time it
    needs.

    Ping us back once you have identified the breaking commit.

    Thanks for your debugging.

    I'm not sure where you are right now at the bisect. But yesterday in
    our weekly kernel-team meeting we talked about your issue. And Ben
    pointed out that he saw recently a PMTU related change.

    And in fact htere is 8930424777e4 ("tunnels: Accept PACKET_HOST in skb_tunnel_check_pmtu().") which is from 6.15-rc1. And it got
    backported to various stable series, for your report of interest is
    that it was backported to 6.1.134, which falls exactly in the range
    you noticed of breaking.

    Thus: are you able to test first at all 6.1.y and a revert of the
    given commit on top and see if that fixes your issue?

    Regards,
    Salvatore

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Salvatore Bonaccorso@21:1/5 to [email protected] on Mon Jul 14 21:40:01 2025
    XPost: linux.debian.kernel

    Control: tags -1 - moreinfo
    Control: tags -1 + confirmed

    Hi Charles,

    On Sun, Jul 13, 2025 at 08:41:39AM +0200, [email protected] wrote:
    Hi Salvatore,

    Thank you for your guidance and for pointing out the relevant commit.

    I have tested, checking out from tag 6.1.134:
    - With the revert of commit b88786ea2c8f ("tunnels: Accept PACKET_HOST in skb_tunnel_check_pmtu()"): the issue does not appear and everything works as expected
    - With the commit included (no revert), the issue reappears exactly as before.

    This confirms that the regression is directly linked to this commit.

    Is there anything else I can do or provide to help with the resolution?

    Thanks a lot that is great news, so we have isolated the regression
    commit already. I will try to assemble a regression report upstream
    soon (after checking if it is already known, hopefully not missing a
    report) and keep you in the loop.

    Regards,
    Salvatore

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)