netdev - Stop traffic on an established ipsec tunnel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Date: Tue, 12 Dec 2023 10:30:31 -0300
From: Lucas Pereira <lucasvp@...il.com>
To: netdev@...r.kernel.org
Cc: "webguy@...c.info" <webguy@...c.info>, Alex Brahm <alexbrahm@...il.com>
Subject: Stop traffic on an established ipsec tunnel

*Situation:*


I have a project related to a firewall that includes IPSEC VPN.
Intermittently, every x days (the shortest period of time identified was 11
days, and the longest was 133 days), I experience a VPN issue. The problem
detailed below occurs in a few deployments with a large number of VPN
tunnels.

*Problem:*

The issue is that the VPN traffic gets interrupted.



We establish two LAN-to-LAN tunnels and, after a certain period of time,
communication between the endpoints ceases. The system uses strongSwan for
tunnel establishment. StrongSwan successfully installs the Security
Associations (SAs) in the kernel, and everything works fine for several
days.



However, at some point, the following error occurs:



ping -I 10.165.112.248 10.10.55.1

PING 10.10.55.1 (10.10.55.1) from 10.165.112.248: 56(84) bytes of data.

ping: sendmsg: Device or resource busy

ping: sendmsg: Device or resource busy

ping: sendmsg: Device or resource busy

ping: sendmsg: Device or resource busy

When the VPN issue arises, the counter for the following parameter
increases incessantly:



watch -n1 'cat /proc/net/xfrm_stat'

XfrmOutStateProtoError (see attachment)

When the tunnel is functioning, the parameter stabilizes.



*IPs:*

VPN Concentrator: 10.165.112.248

Branch: 10.10.55.1



*Versions original:*

Kernel: Linux 5.4.113-1.el7.elrepo.x86_64



*We try the new kernel versions + path:  *

Kernel: Linux 5.4.249-1.el7.elrepo.x86_64

Kernel: Linux 6.4.11.el7.elrepo.x86_64

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?h=v6.5.9&id=de0bfd6026c85de3a0a0db2766ab740733d1631e

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?h=v6.5.9&id=de0bfd6026c85de3a0a0db2766ab740733d1631e

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?h=v6.5.9&id=071bba39638f6532040aca3bdabba469186f631c

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?h=v6.5.9&id=071bba39638f6532040aca3bdabba469186f631c



*StrongSwan:*

strongswan-sqlite-5.9.7-520.x86_64

strongswan-tnc-imcvs-5.9.7-520.x86_64

strongswan-charon-nm-5.9.7-520.x86_64

strongswan-libipsec-5.9.7-520.x86_64

strongswan-5.9.7-520.x86_64



*After update: *

strongswan-sqlite-5.9.11.x86_64

strongswan-tnc-imcvs-5.9.11.x86_64

strongswan-charon-nm-5.9.11.x86_64

strongswan-libipsec-5.9.11.x86_64

strongswan-5.9.11.x86_64



*Tried change some confs:*


we are testing these configurations:

#Disable aesni_intel modeule
lsmod |grep -i aesni_intel
aesni_intel           372736  4

vi /boot/grub2/grub.cfg
Foi adicionado "module_blacklist=aesni_intel"

cat /boot/grub2/grub.cfg |grep -i aesni_intel
        linux16 /vmlinuz-5.4.113-1.el7.elrepo.x86_64
root=UUID=222c8741-7ce5-4f57-95b6-f435fce5b9b9 ro
vconsole.font=latarcyrheb-sun16 vconsole.keymap=us
rd.luks.uuid=luks-83730611-a1fc-492a-af40-bf3555dae23f
rd.luks.key=/etc/._key rd.luks.options=allow-discards biosdevname=0
splash=silent maxcpus=2 possible_cpus=2 mem=5G quiet
qat_c3xxx.blacklist=yes qat_c62x.blacklist=yes rdblacklist=qat_c3xxx
rdblacklist=qat_c62x module_blacklist=qat_c3xxx
module_blacklist=qat_c62x module_blacklist=aesni_intel net.ifnames=0
elevator=noop rd.plymouth=0 plymouth.enable=0 console=tty0
console=ttyS0,115200

depmod
reboot

lsmod |grep -i aesni_intel

------
On /etc/strongswan/strongswan.d/charon/kernel-netlink.conf

# Whether to perform concurrent Netlink XFRM queries on a single socket.
parallel_xfrm = yes

# Whether to always use XFRM_MSG_UPDPOLICY to install policies.
policy_update = yes

2 - /etc/strongswan/strongswan.d/charon.conf (mudar de 32 para zero)
replay_window = 0

3 - /etc/strongswan/swanctl/swanctl.conf

# IPsec replay window to configure for this CHILD_SA.
# replay_window = 32
replay_window = 0


systemctl restart strongswan



*Analyses Conducted:*



To resolve the problem, we have identified two workaround solutions:



*Change of the hash protocol in the tunnels.*

When we change the encryption algorithm from MD5 to SHA1, it works, but
after x days the problem reoccurs. (Not necessarily from MD5 to SHA1, but
to any other). It is necessary to switch to an algorithm that is not
currently in use, meaning, if we switch back to MD5, it works. However,
this solution is temporary. If we switch within a short period (1 hour to
the previous), the problem persists, but if we switch after several days,
the problem is temporarily resolved.



*Server reboot.*

The problem is only resolved by restarting the server, as merely restarting
the service is not sufficient. Even if after removing and reinstalling the
xfrm kernel modules, the problem persists.



The firewall rules have already been validated and are correct.

Even with the firewall open, the problem persists, which means, with no
rules in place.

The firewall policy and the frm state are okay.



*Logs and Evidence is attached.*



Best regards,
Lucas

Content of type "text/html" skipped

View attachment "XFRM_Concentrator.txt" of type "text/plain" (1033 bytes)

View attachment "TCPDump_Concentrator.txt" of type "text/plain" (4707 bytes)

View attachment "HW_info.txt" of type "text/plain" (18708 bytes)

View attachment "lsmod.txt" of type "text/plain" (6887 bytes)

View attachment "proc-crypto.txt" of type "text/plain" (70649 bytes)