[<prev] [next>] [day] [month] [year] [list]
Date: Tue, 12 Dec 2023 10:30:31 -0300
From: Lucas Pereira <lucasvp@...il.com>
To: netdev@...r.kernel.org
Cc: "webguy@...c.info" <webguy@...c.info>, Alex Brahm <alexbrahm@...il.com>
Subject: Stop traffic on an established ipsec tunnel
*Situation:*
I have a project related to a firewall that includes IPSEC VPN.
Intermittently, every x days (the shortest period of time identified was 11
days, and the longest was 133 days), I experience a VPN issue. The problem
detailed below occurs in a few deployments with a large number of VPN
tunnels.
*Problem:*
The issue is that the VPN traffic gets interrupted.
We establish two LAN-to-LAN tunnels and, after a certain period of time,
communication between the endpoints ceases. The system uses strongSwan for
tunnel establishment. StrongSwan successfully installs the Security
Associations (SAs) in the kernel, and everything works fine for several
days.
However, at some point, the following error occurs:
ping -I 10.165.112.248 10.10.55.1
PING 10.10.55.1 (10.10.55.1) from 10.165.112.248: 56(84) bytes of data.
ping: sendmsg: Device or resource busy
ping: sendmsg: Device or resource busy
ping: sendmsg: Device or resource busy
ping: sendmsg: Device or resource busy
When the VPN issue arises, the counter for the following parameter
increases incessantly:
watch -n1 'cat /proc/net/xfrm_stat'
XfrmOutStateProtoError (see attachment)
When the tunnel is functioning, the parameter stabilizes.
*IPs:*
VPN Concentrator: 10.165.112.248
Branch: 10.10.55.1
*Versions original:*
Kernel: Linux 5.4.113-1.el7.elrepo.x86_64
*We try the new kernel versions + path: *
Kernel: Linux 5.4.249-1.el7.elrepo.x86_64
Kernel: Linux 6.4.11.el7.elrepo.x86_64
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?h=v6.5.9&id=de0bfd6026c85de3a0a0db2766ab740733d1631e
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?h=v6.5.9&id=de0bfd6026c85de3a0a0db2766ab740733d1631e
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?h=v6.5.9&id=071bba39638f6532040aca3bdabba469186f631c
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?h=v6.5.9&id=071bba39638f6532040aca3bdabba469186f631c
*StrongSwan:*
strongswan-sqlite-5.9.7-520.x86_64
strongswan-tnc-imcvs-5.9.7-520.x86_64
strongswan-charon-nm-5.9.7-520.x86_64
strongswan-libipsec-5.9.7-520.x86_64
strongswan-5.9.7-520.x86_64
*After update: *
strongswan-sqlite-5.9.11.x86_64
strongswan-tnc-imcvs-5.9.11.x86_64
strongswan-charon-nm-5.9.11.x86_64
strongswan-libipsec-5.9.11.x86_64
strongswan-5.9.11.x86_64
*Tried change some confs:*
we are testing these configurations:
#Disable aesni_intel modeule
lsmod |grep -i aesni_intel
aesni_intel 372736 4
vi /boot/grub2/grub.cfg
Foi adicionado "module_blacklist=aesni_intel"
cat /boot/grub2/grub.cfg |grep -i aesni_intel
linux16 /vmlinuz-5.4.113-1.el7.elrepo.x86_64
root=UUID=222c8741-7ce5-4f57-95b6-f435fce5b9b9 ro
vconsole.font=latarcyrheb-sun16 vconsole.keymap=us
rd.luks.uuid=luks-83730611-a1fc-492a-af40-bf3555dae23f
rd.luks.key=/etc/._key rd.luks.options=allow-discards biosdevname=0
splash=silent maxcpus=2 possible_cpus=2 mem=5G quiet
qat_c3xxx.blacklist=yes qat_c62x.blacklist=yes rdblacklist=qat_c3xxx
rdblacklist=qat_c62x module_blacklist=qat_c3xxx
module_blacklist=qat_c62x module_blacklist=aesni_intel net.ifnames=0
elevator=noop rd.plymouth=0 plymouth.enable=0 console=tty0
console=ttyS0,115200
depmod
reboot
lsmod |grep -i aesni_intel
------
On /etc/strongswan/strongswan.d/charon/kernel-netlink.conf
# Whether to perform concurrent Netlink XFRM queries on a single socket.
parallel_xfrm = yes
# Whether to always use XFRM_MSG_UPDPOLICY to install policies.
policy_update = yes
2 - /etc/strongswan/strongswan.d/charon.conf (mudar de 32 para zero)
replay_window = 0
3 - /etc/strongswan/swanctl/swanctl.conf
# IPsec replay window to configure for this CHILD_SA.
# replay_window = 32
replay_window = 0
systemctl restart strongswan
*Analyses Conducted:*
To resolve the problem, we have identified two workaround solutions:
*Change of the hash protocol in the tunnels.*
When we change the encryption algorithm from MD5 to SHA1, it works, but
after x days the problem reoccurs. (Not necessarily from MD5 to SHA1, but
to any other). It is necessary to switch to an algorithm that is not
currently in use, meaning, if we switch back to MD5, it works. However,
this solution is temporary. If we switch within a short period (1 hour to
the previous), the problem persists, but if we switch after several days,
the problem is temporarily resolved.
*Server reboot.*
The problem is only resolved by restarting the server, as merely restarting
the service is not sufficient. Even if after removing and reinstalling the
xfrm kernel modules, the problem persists.
The firewall rules have already been validated and are correct.
Even with the firewall open, the problem persists, which means, with no
rules in place.
The firewall policy and the frm state are okay.
*Logs and Evidence is attached.*
Best regards,
Lucas
Content of type "text/html" skipped
View attachment "XFRM_Concentrator.txt" of type "text/plain" (1033 bytes)
View attachment "TCPDump_Concentrator.txt" of type "text/plain" (4707 bytes)
View attachment "HW_info.txt" of type "text/plain" (18708 bytes)
View attachment "lsmod.txt" of type "text/plain" (6887 bytes)
View attachment "proc-crypto.txt" of type "text/plain" (70649 bytes)
Powered by blists - more mailing lists