linux-kernel - Re: [PATCH net] net: ensure all external references are released in deferred skbuffs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <68103640-acb0-b2f8-fcde-91ab3ed5beab@ovn.org>
Date:   Fri, 22 Sep 2023 23:09:02 +0200
From:   Ilya Maximets <i.maximets@....org>
To:     netdev@...r.kernel.org
Cc:     i.maximets@....org, "David S. Miller" <davem@...emloft.net>,
        dev@...nvswitch.org, linux-kernel@...r.kernel.org,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Florian Westphal <fw@...len.de>,
        Frode Nordahl <frode.nordahl@...onical.com>,
        Madhu Koriginja <madhu.koriginja@....com>
Subject: Re: [PATCH net] net: ensure all external references are released in
 deferred skbuffs

On 9/22/23 21:07, Ilya Maximets wrote:
> On 9/22/23 15:26, Ilya Maximets wrote:
>> On 6/19/22 02:39, Ilya Maximets wrote:
>>> Open vSwitch system test suite is broken due to inability to
>>> load/unload netfilter modules.  kworker thread is getting trapped
>>> in the infinite loop while running a net cleanup inside the
>>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>>> holding nfct references and not being freed by their CPU cores.
>>>
>>> In general, the idea that we will have an rx interrupt on every
>>> CPU core at some point in a near future doesn't seem correct.
>>> Devices are getting created and destroyed, interrupts are getting
>>> re-scheduled, CPUs are going online and offline dynamically.
>>> Any of these events may leave packets stuck in defer list for a
>>> long time.  It might be OK, if they are just a piece of memory,
>>> but we can't afford them holding references to any other resources.
>>>
>>> In case of OVS, nfct reference keeps the kernel thread in busy loop
>>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
>>> later modprobe request from user space:
>>>
>>>   # ps
>>>    299 root  R  99.3  200:25.89 kworker/u96:4+
>>>
>>>   # journalctl
>>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>>>         Not tainted 5.19.0-rc2 #8
>>>   task:modprobe     state:D
>>>   Call Trace:
>>>    <TASK>
>>>    __schedule+0x8aa/0x21d0
>>>    schedule+0xcc/0x200
>>>    rwsem_down_write_slowpath+0x8e4/0x1580
>>>    down_write+0xfc/0x140
>>>    register_pernet_subsys+0x15/0x40
>>>    nf_nat_init+0xb6/0x1000 [nf_nat]
>>>    do_one_initcall+0xbb/0x410
>>>    do_init_module+0x1b4/0x640
>>>    load_module+0x4c1b/0x58d0
>>>    __do_sys_init_module+0x1d7/0x220
>>>    do_syscall_64+0x3a/0x80
>>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>
>>> At this point OVS testsuite is unresponsive and never recover,
>>> because these skbuffs are never freed.
>>>
>>> Solution is to make sure no external references attached to skb
>>> before pushing it to the defer list.  Using skb_release_head_state()
>>> for that purpose.  The function modified to be re-enterable, as it
>>> will be called again during the defer list flush.
>>>
>>> Another approach that can fix the OVS use-case, is to kick all
>>> cores while waiting for references to be released during the net
>>> cleanup.  But that sounds more like a workaround for a current
>>> issue rather than a proper solution and will not cover possible
>>> issues in other parts of the code.
>>>
>>> Additionally checking for skb_zcopy() while deferring.  This might
>>> not be necessary, as I'm not sure if we can actually have zero copy
>>> packets on this path, but seems worth having for completeness as we
>>> should never defer such packets regardless.
>>>
>>> CC: Eric Dumazet <edumazet@...gle.com>
>>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
>>> Signed-off-by: Ilya Maximets <i.maximets@....org>
>>> ---
>>
>> Happy Friday! :)
>>
>> Resurrecting this thread because I managed to reproduce the issue again
>> on a latest 6.6.0-rc1.
>>
>> (It doesn't mean we need to accept this particular patch, I just think
>> that it's an appropriate discussion thread.)
>>
>> It's a bit different testsuite this time.  Last year I had an issue with
>> OVS testsuite, today I have an issue with OVN testsuite.  Their structure
>> is very similar, but OVN tests are a fair bit more complex.
>>
>> The story is the same:  Each test loads a pack of kernel modules including
>> OVS and nf_conntrack, sends some traffic, verifies OVN functionality,
>> stops OVN/OVS and unloads all the previously loaded modules to clean up
>> the space for the next tests.
>>
>> Kernel hangs in the same way as before waiting for nf_conntrack module
>> to be unloaded:
>>
>>   13 root R  100.0  933:17.98 kworker/u80:1+netns
>>
>>   CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7
>>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014
>>   Workqueue: netns cleanup_net
>>   RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>>   Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8
>>         ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e
>>         2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e
>>         2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00
>>   RSP: 0018:ffffb23040073d98 EFLAGS: 00000202
>>   RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38
>>   RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920
>>   RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38
>>   R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080
>>   R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0
>>   FS:  0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000
>>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0
>>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   Call Trace:
>>    <NMI>
>>    ? nmi_cpu_backtrace+0x82/0xf0
>>    ? nmi_cpu_backtrace_handler+0xd/0x20
>>    ? nmi_handle+0x5e/0x150
>>    ? default_do_nmi+0x40/0x100
>>    ? exc_nmi+0x112/0x190
>>    ? end_repeat_nmi+0x16/0x67
>>    ? __pfx_kill_all+0x10/0x10 [nf_conntrack]
>>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>>    </NMI>
>>    <TASK>
>>    nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack]
>>    cleanup_net+0x213/0x3b0
>>    process_one_work+0x177/0x340
>>    worker_thread+0x27e/0x390
>>    ? __pfx_worker_thread+0x10/0x10
>>    kthread+0xe2/0x110
>>    ? __pfx_kthread+0x10/0x10
>>    ret_from_fork+0x30/0x50
>>    ? __pfx_kthread+0x10/0x10
>>    ret_from_fork_asm+0x1b/0x30
>>    </TASK>
>>
>> Since this testsuite is different from what I had issues with before,
>> I don't really know when the issue first appeared.  It may have been
>> there even last year.
>>
>> I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but
>> I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9.
>> It doesn't necessarily mean that RHEL9 doesn't have the issue though,
>> the testsuite is not 100% reliable.
>>
>> I'll try to dig deeper and bisect the problem on the upstream kernel.
>>
>> For now it seems like the issue is likely in IPv6 code, because the
>> tests that trigger it are mostly IPv6-related.
> 
> So, I bisected this down to an unsurprising commit:
> 
> commit b0e214d212030fe497d4d150bb3474e50ad5d093
> Author: Madhu Koriginja <madhu.koriginja@....com>
> Date:   Tue Mar 21 21:28:44 2023 +0530
> 
>     netfilter: keep conntrack reference until IPsecv6 policy checks are done
>     
>     Keep the conntrack reference until policy checks have been performed for
>     IPsec V6 NAT support, just like ipv4.
>     
>     The reference needs to be dropped before a packet is
>     queued to avoid having the conntrack module unloadable.
>     
>     Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support")
>     Signed-off-by: Madhu Koriginja <madhu.koriginja@....com>
>     Signed-off-by: Florian Westphal <fw@...len.de>
> 
> This commit repeated the pattern from the old ipv4 commit b59c270104f0
> ("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done")
> and hence introduced the exact same issue, but for IPv6.  They both failed
> to deliver on the "avoid having the conntrack module unloadable" part.
> 
> IIUC, the fix should be exactly the same as Eric did for ipv4 in commit
> 6f0012e35160 ("tcp: add a missing nf_reset_ct() in 3WHS handling").
> 
> I can try that and send a fix.

Posted a fix here:
  https://lore.kernel.org/netdev/20230922210530.2045146-1-i.maximets@ovn.org/

It fixes the problem I have with OVN testsuite.

> 
> Would still really like to have some preventive mechanism for this kind of
> issues though.  Any ideas on that?
> 
> Best regards, Ilya Maximets.