netdev - Re: kernel panics with Big TCP and Tx ZC with hugepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <71114ab1-fc2d-a3b3-cc15-68e848b6df46@gmail.com>
Date:   Thu, 27 Apr 2023 17:32:06 +0200
From:   Eric Dumazet <erdnetdev@...il.com>
To:     David Ahern <dsahern@...il.com>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Xin Long <lucien.xin@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Nikolay Aleksandrov <razor@...ckwall.org>, edumazet@...gle.com
Subject: Re: kernel panics with Big TCP and Tx ZC with hugepages


On 4/27/23 00:24, David Ahern wrote:
> On 4/26/23 3:52 PM, Willem de Bruijn wrote:
>> David Ahern wrote:
>>> This has been on the back burner for too long now and with v6.3 released
>>> we should get it resolved before reports start rolling in. I am throwing
>>> this data dump out to the mailing list hoping someone else can provide
>>> more insights.
>>>
>>> Big TCP (both IPv6 and IPv4 versions are affected) can cause a variety
>>> of panics when combined with the Tx ZC API and buffers backed by
>>> hugepages. I have seen this with mlx5, a driver under development and
>>> veth, so it seems to be a problem with the core stack.
>>>
>>> A quick reproducer:
>>>
>>> #!/bin/bash
>>> #
>>> # make sure ip is from top of tree iproute2
>>>
>>> ip netns add peer
>>> ip li add eth0 type veth peer eth1
>>> ip li set eth0 mtu 3400 up
>>> ip addr add dev eth0 172.16.253.1/24
>>> ip addr add dev eth0 2001:db8:1::1/64
>>>
>>> ip li set eth1 netns peer mtu 3400 up
>>> ip -netns peer addr add dev eth1 172.16.253.2/24
>>> ip -netns peer addr add dev eth1 2001:db8:1::2/64
>>>
>>> ip netns exec peer iperf3 -s -D
>>>
>>> ip li set dev eth0 gso_ipv4_max_size $((510*1024)) gro_ipv4_max_size
>>> $((510*1024)) gso_max_size $((510*1024)) gro_max_size  $((510*1024))
>>>
>>> ip -netns peer li set dev eth1 gso_ipv4_max_size $((510*1024))
>>> gro_ipv4_max_size  $((510*1024)) gso_max_size $((510*1024)) gro_max_size
>>>   $((510*1024))
>>>
>>> sysctl -w vm.nr_hugepages=2
>>>
>>> cat <<EOF
>>> Run either:
>>>
>>>      iperf3 -c 172.16.253.2 --zc_api
>>>      iperf3 -c 2001:db8:1::2 --zc_api
>>>
>>> where iperf3 is from https://github.com/dsahern/iperf mods-3.10
>>> EOF
>>>
>>> iperf3 in my tree has support for buffers using hugepages when using the
>>> Tx ZC API (--zc_api arg above).
>>>
>>> I have seen various backtraces based on platform and configuration, but
>>> skb_release_data is typically in the path. This is a common one for the
>>> veth reproducer above (saw it with both v4 and v6):
>>>
>>> [   32.167294] general protection fault, probably for non-canonical
>>> address 0xdd8672069ea377b2: 0000 [#1] PREEMPT SMP
>>> [   32.167569] CPU: 5 PID: 635 Comm: iperf3 Not tainted 6.3.0+ #4
>>> [   32.167742] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>>> 1.13.0-1ubuntu1.1 04/01/2014
>>> [   32.168039] RIP: 0010:skb_release_data+0xf4/0x180
>>> [   32.168208] Code: 7e 57 48 89 d8 48 c1 e0 04 4d 8b 64 05 30 41 f6 c4
>>> 01 75 e1 41 80 7e 76 00 4d 89 e7 79 0c 4c 89 e7 e8 90 f
>>> [   32.168869] RSP: 0018:ffffc900001a4eb0 EFLAGS: 00010202
>>> [   32.169025] RAX: 00000000000001c0 RBX: 000000000000001c RCX:
>>> 0000000000000000
>>> [   32.169265] RDX: 0000000000000102 RSI: 000000000000068f RDI:
>>> 00000000ffffffff
>>> [   32.169475] RBP: ffffc900001a4ee0 R08: 0000000000000000 R09:
>>> ffff88807fd77ec0
>>> [   32.169708] R10: ffffea0000173430 R11: 0000000000000000 R12:
>>> dd8672069ea377aa
>>> [   32.169915] R13: ffff8880069cf100 R14: ffff888011910ae0 R15:
>>> dd8672069ea377aa
>>> [   32.170126] FS:  0000000001720880(0000) GS:ffff88807fd40000(0000)
>>> knlGS:0000000000000000
>>> [   32.170398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   32.170586] CR2: 00007f0f04400000 CR3: 0000000004caa000 CR4:
>>> 0000000000750ee0
>>> [   32.170796] PKRU: 55555554
>>> [   32.170888] Call Trace:
>>> [   32.170975]  <IRQ>
>>> [   32.171039]  skb_release_all+0x2e/0x40
>>> [   32.171152]  napi_consume_skb+0x62/0xf0
>>> [   32.171281]  net_rx_action+0xf6/0x250
>>> [   32.171394]  __do_softirq+0xdf/0x2c0
>>> [   32.171506]  do_softirq+0x81/0xa0
>>> [   32.171608]  </IRQ>
>>>
>>>
>>> Xin came up with this patch a couple of months ago that resolves the
>>> panic but it has a big impact on performance:
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 0fbd5c85155f..6c2c8d09fd89 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -1717,7 +1717,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
>>> gfp_mask)
>>>   {
>>>          int num_frags = skb_shinfo(skb)->nr_frags;
>>>          struct page *page, *head = NULL;
>>> -       int i, new_frags;
>>> +       int i, new_frags, pagelen;
>>>          u32 d_off;
>>>
>>>          if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
>>> @@ -1733,7 +1733,16 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t
>>> gfp_mask)
>>>                  return 0;
>>>          }
>>>
>>> -       new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
>>> +       pagelen = __skb_pagelen(skb);
>>> +       if (pagelen > GSO_LEGACY_MAX_SIZE) {
>>> +               /* without hugepages, skb frags can only hold 65536 data. */
>> This is with CONFIG_MAX_SKB_FRAGS 17 I suppose.
> correct, I did not enable that config so it defaults to 17.
>
>> So is the issue just that new_frags ends up indexing out of bounds
>> in frags[MAX_SKB_FRAGS]?
> yes, I see nr_frags at 32 which is clearly way out of bounds and it
> seems to be the skb_copy_ubufs function causing it.


Solution is to sense which page order is necessary in order to fit the 
skb length into MAX_SKB_FRAGS pages. alloc_page() -> alloc_pages(order); 
alloc_skb_with_frags() has a similar strategy (but different goals)

I can prepare a patch.

>> GSO_LEGACY_MAX_SIZE happens to match the value, but is not not the
>> right constant, as that is a max on the packet length, regardless
>> of whether in linear or frags.
>>
>>> +               if (!__pskb_pull_tail(skb, pagelen - GSO_LEGACY_MAX_SIZE))
>>> +                       return -ENOMEM;
>>> +               pagelen = GSO_LEGACY_MAX_SIZE;
>>> +               num_frags = skb_shinfo(skb)->nr_frags;
>>> +       }
>>> +       new_frags = (pagelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
>>> +
>>>          for (i = 0; i < new_frags; i++) {
>>>                  page = alloc_page(gfp_mask);
>>>                  if (!page) {
>>>
>>