linux-kernel - [BUG] Outgoing ESP traffic stops after several weeks (XFRM state correct, tunnels established, no ESP output)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <CAHVQh5-EGhmpHj=2DK_O4R8ar7_P+62UnYABuFMWFRGmo8bQTw@mail.gmail.com>
Date: Tue, 18 Nov 2025 13:39:48 +0330
From: Hamid Reza Hasani <hr.hasani@...il.com>
To: netdev@...r.kernel.org
Cc: Steffen Klassert <steffen.klassert@...unet.com>, Herbert Xu <herbert@...dor.apana.org.au>, 
	linux-kernel@...r.kernel.org
Subject: [BUG] Outgoing ESP traffic stops after several weeks (XFRM state
 correct, tunnels established, no ESP output)

[1.] One-line summary of the problem:

Outgoing ESP packets stop being transmitted after several weeks of uptime,
even though IPsec tunnels remain established, incoming ESP packets are
decrypted correctly, and XFRM states/policies remain valid.

[2.] Full description of the problem/report:

We maintain around 200 IPsec tunnels across approximately 100 remote sites
using StrongSwan (IKEv2). All remote nodes connect to a central site that
contains three HA clusters (each consisting of two HP servers configured
with Corosync + Pacemaker).
The servers have more than 100 CPU cores and 128 GB+ RAM.

Every 3–4 weeks, one of the cluster nodes stops sending ESP packets.
Incoming encrypted ESP packets continue to arrive and are successfully
decrypted. IKEv2 re-establishes the tunnels correctly, XFRM policies and
states remain intact, routing tables are correct, and nothing unusual
appears in dmesg.
However, **all outbound ESP drops to zero**.

Firewall counters confirm:
- ESP input: normal
- ESP output: zero during the failure state

Restarting the affected HA node triggers failover and temporarily resolves
the issue.

### Additional observation (IMPORTANT):
We capture traffic every 15 minutes on all interfaces. In the two most
recent incidents, immediately before the ESP output failure occurred,
tcpdump mis-reported the input/output interface.
Instead of the correct interface (ETH3), tcpdump reported usb0 (ILO) or
when I disabled usb0 it showed unknown for in/out interface.
Interface counters confirm that usb0 carries almost no traffic, so the
tcpdump interface attribution appears incorrect.

This raises the possibility of:
- an XFRM output path regression,
- an skb device pointer corruption,
- a routing decision inconsistency,
- or a driver-layer issue affecting interface reporting and ESP output.

Upgrading from kernel **6.8.0.52 → 6.8.0.85** did not resolve the issue.

We would appreciate guidance on additional instrumentation or whether this
matches any known recent regressions.

[3.] Keywords:

IPsec, XFRM, ESP, StrongSwan, routing, skb, tcpdump, network stack, HA
cluster

[4.] Kernel information

[4.1.] Kernel version (/proc/version):
Linux version 6.8.0-85-generic (buildd@...02-amd64-024)
(x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0, GNU ld
(GNU Binutils for Ubuntu) 2.38) #85~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri
Sep 19 16:18:59 UTC 2

[4.2.] Kernel .config:
is attached

[5.] Most recent kernel version which did not have the bug:

Unknown.
The issue is present in both:
- 6.8.0.52
- 6.8.0.85

[6.] Output of Oops messages:

None. No crashes or warnings in dmesg.

[7.] Example program/script to reproduce:

No minimal reproducer.
Issue appears after several weeks while handling ~200 active IPsec tunnels.

Periodic tcpdump + XFRM/SA dumps available upon request.

[8.] Environment

[8.1.] Software (ver_linux output):
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.5 LTS
Release: 22.04
Codename: jammy

[8.2.] Processor information (/proc/cpuinfo):
Interl(R) Xeon(R) Gold 6230R CPU @ 2.10GH (104 cores)

[8.3.] Module information (/proc/modules):
...

[8.4.] Loaded driver and hardware information:
....

[8.5.] PCI information (lspci -vvv):
...

[8.6.] SCSI information (/proc/scsi/scsi):
....

[8.7.] Additional relevant information:

- ~200 IKEv2 tunnels via StrongSwan
- XFRM policies/states valid during failure
- Incoming ESP continues to decrypt
- Outgoing ESP stops completely
- tcpdump reports wrong interface (usb0 instead of ETH3) shortly before
failure
- NIC is HP server onboard interface
- HA failover restores functionality temporarily

[X.] Other notes, patches, workarounds:

Restarting the affected node forces HA failover and restores traffic
temporarily.
Kernel upgrade did not solve the issue.
StrongSwan logs show no IKE or CHILD_SA issues.

Content of type "text/html" skipped

Download attachment "config-6.8.0-85-generic" of type "application/octet-stream" (287202 bytes)