netdev - kernel panics with Big TCP and Tx ZC with hugepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <5a2eae5c-9ad7-aca5-4927-665db946cfb2@gmail.com>
Date:   Wed, 26 Apr 2023 13:07:06 -0600
From:   David Ahern <dsahern@...il.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Xin Long <lucien.xin@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Nikolay Aleksandrov <razor@...ckwall.org>
Subject: kernel panics with Big TCP and Tx ZC with hugepages

This has been on the back burner for too long now and with v6.3 released
we should get it resolved before reports start rolling in. I am throwing
this data dump out to the mailing list hoping someone else can provide
more insights.

Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety
of panics when combined with the Tx ZC API and buffers backed by
hugepages. I have seen this with mlx5, a driver under development and
veth, so it seems to be a problem with the core stack.

A quick reproducer:

#!/bin/bash
#
# make sure ip is from top of tree iproute2

ip netns add peer
ip li add eth0 type veth peer eth1
ip li set eth0 mtu 3400 up
ip addr add dev eth0 172.16.253.1/24
ip addr add dev eth0 2001:db8:1::1/64

ip li set eth1 netns peer mtu 3400 up
ip -netns peer addr add dev eth1 172.16.253.2/24
ip -netns peer addr add dev eth1 2001:db8:1::2/64

ip netns exec peer iperf3 -s -D

ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size
$((510*1024)) gso_max_size $((510*1024)) gro_max_size  $((510*1024))

ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024))
gro_ipv4_max_size  $((510*1024)) gso_max_size $((510*1024)) gro_max_size
 $((510*1024))

sysctl -w vm.nr_hugepages=2

cat <<EOF
Run either:

    iperf3 -c 172.16.253.2 --zc_api
    iperf3 -c 2001:db8:1::2 --zc_api

where iperf3 is from https://github.com/dsahern/iperf mods-3.10
EOF

iperf3 in my tree has support for buffers using hugepages when using the
Tx ZC API (--zc_api arg above).

I have seen various backtraces based on platform and configuration, but
skb_release_data is typically in the path. This is a common one for the
veth reproducer above (saw it with both v4 and v6):

[   32.167294] general protection fault, probably for non-canonical
address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP
[   32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4
[   32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.13.0-1ubuntu1.1 04/01/2014
[   32.168039] RIP: 0010:skb_release_data+0xf4/0x180
[   32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4
01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f
[   32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202
[   32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX:
0000000000000000
[   32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI:
00000000ffffffff
[   32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09:
ffff88807fd77ec0
[   32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12:
dd8672069ea377aa
[   32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15:
dd8672069ea377aa
[   32.170126] FS:  0000000001720880(0000) GS:ffff88807fd40000(0000)
knlGS:0000000000000000
[   32.170398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4:
0000000000750ee0
[   32.170796] PKRU: 55555554
[   32.170888] Call Trace:
[   32.170975]  <IRQ>
[   32.171039]  skb_release_all+0x2e/0x40
[   32.171152]  napi_consume_skb+0x62/0xf0
[   32.171281]  net_rx_action+0xf6/0x250
[   32.171394]  __do_softirq+0xdf/0x2c0
[   32.171506]  do_softirq+0x81/0xa0
[   32.171608]  </IRQ>


Xin came up with this patch a couple of months ago that resolves the
panic but it has a big impact on performance:

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0fbd5c85155f..6c2c8d09fd89 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
gfp_mask)
 {
        int num_frags = skb_shinfo(skb)->nr_frags;
        struct page *page, *head = NULL;
-       int i, new_frags;
+       int i, new_frags, pagelen;
        u32 d_off;

        if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
@@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
gfp_mask)
                return 0;
        }

-       new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+       pagelen = __skb_pagelen(skb);
+       if (pagelen > GSO_LEGACY_MAX_SIZE) {
+               /* without hugepages, skb frags can only hold 65536 data. */
+               if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE))
+                       return -ENOMEM;
+               pagelen = GSO_LEGACY_MAX_SIZE;
+               num_frags = skb_shinfo(skb)->nr_frags;
+       }
+       new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
        for (i = 0; i < new_frags; i++) {
                page = alloc_page(gfp_mask);
                if (!page) {