linux-kernel - Re: [PATCH net] net: ensure all external references are released in deferred skbuffs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb65f794-2ead-6518-863e-bf1643ce1b28@ovn.org>
Date:   Fri, 22 Sep 2023 21:07:27 +0200
From:   Ilya Maximets <i.maximets@....org>
To:     netdev@...r.kernel.org
Cc:     i.maximets@....org, "David S. Miller" <davem@...emloft.net>,
        dev@...nvswitch.org, linux-kernel@...r.kernel.org,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Florian Westphal <fw@...len.de>,
        Frode Nordahl <frode.nordahl@...onical.com>,
        Madhu Koriginja <madhu.koriginja@....com>
Subject: Re: [PATCH net] net: ensure all external references are released in
 deferred skbuffs

On 9/22/23 15:26, Ilya Maximets wrote:
> On 6/19/22 02:39, Ilya Maximets wrote:
>> Open vSwitch system test suite is broken due to inability to
>> load/unload netfilter modules.  kworker thread is getting trapped
>> in the infinite loop while running a net cleanup inside the
>> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>> holding nfct references and not being freed by their CPU cores.
>>
>> In general, the idea that we will have an rx interrupt on every
>> CPU core at some point in a near future doesn't seem correct.
>> Devices are getting created and destroyed, interrupts are getting
>> re-scheduled, CPUs are going online and offline dynamically.
>> Any of these events may leave packets stuck in defer list for a
>> long time.  It might be OK, if they are just a piece of memory,
>> but we can't afford them holding references to any other resources.
>>
>> In case of OVS, nfct reference keeps the kernel thread in busy loop
>> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
>> later modprobe request from user space:
>>
>>   # ps
>>    299 root  R  99.3  200:25.89 kworker/u96:4+
>>
>>   # journalctl
>>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>>         Not tainted 5.19.0-rc2 #8
>>   task:modprobe     state:D
>>   Call Trace:
>>    <TASK>
>>    __schedule+0x8aa/0x21d0
>>    schedule+0xcc/0x200
>>    rwsem_down_write_slowpath+0x8e4/0x1580
>>    down_write+0xfc/0x140
>>    register_pernet_subsys+0x15/0x40
>>    nf_nat_init+0xb6/0x1000 [nf_nat]
>>    do_one_initcall+0xbb/0x410
>>    do_init_module+0x1b4/0x640
>>    load_module+0x4c1b/0x58d0
>>    __do_sys_init_module+0x1d7/0x220
>>    do_syscall_64+0x3a/0x80
>>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>
>> At this point OVS testsuite is unresponsive and never recover,
>> because these skbuffs are never freed.
>>
>> Solution is to make sure no external references attached to skb
>> before pushing it to the defer list.  Using skb_release_head_state()
>> for that purpose.  The function modified to be re-enterable, as it
>> will be called again during the defer list flush.
>>
>> Another approach that can fix the OVS use-case, is to kick all
>> cores while waiting for references to be released during the net
>> cleanup.  But that sounds more like a workaround for a current
>> issue rather than a proper solution and will not cover possible
>> issues in other parts of the code.
>>
>> Additionally checking for skb_zcopy() while deferring.  This might
>> not be necessary, as I'm not sure if we can actually have zero copy
>> packets on this path, but seems worth having for completeness as we
>> should never defer such packets regardless.
>>
>> CC: Eric Dumazet <edumazet@...gle.com>
>> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
>> Signed-off-by: Ilya Maximets <i.maximets@....org>
>> ---
> 
> Happy Friday! :)
> 
> Resurrecting this thread because I managed to reproduce the issue again
> on a latest 6.6.0-rc1.
> 
> (It doesn't mean we need to accept this particular patch, I just think
> that it's an appropriate discussion thread.)
> 
> It's a bit different testsuite this time.  Last year I had an issue with
> OVS testsuite, today I have an issue with OVN testsuite.  Their structure
> is very similar, but OVN tests are a fair bit more complex.
> 
> The story is the same:  Each test loads a pack of kernel modules including
> OVS and nf_conntrack, sends some traffic, verifies OVN functionality,
> stops OVN/OVS and unloads all the previously loaded modules to clean up
> the space for the next tests.
> 
> Kernel hangs in the same way as before waiting for nf_conntrack module
> to be unloaded:
> 
>   13 root R  100.0  933:17.98 kworker/u80:1+netns
> 
>   CPU: 12 PID: 13 Comm: kworker/u80:1 Not tainted 6.6.0-rc1+ #7
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc36 04/01/2014
>   Workqueue: netns cleanup_net
>   RIP: 0010:nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>   Code: 56 41 55 41 54 49 89 fc 55 31 ed 53 e8 a4 db a8
>         ed 48 c7 c7 20 e9 e2 c0 e8 48 f3 a8 ed 39 2d 7e
>         2d 01 00 77 14 e9 05 01 00 00 <83> c5 01 3b 2d 6e
>         2d 01 00 0f 83 f6 00 00 00 48 8b 15 75 2d 01 00
>   RSP: 0018:ffffb23040073d98 EFLAGS: 00000202
>   RAX: 000000000001ce57 RBX: ffff98dbfac73958 RCX: ffff98e31f732b38
>   RDX: ffff98dbfac00000 RSI: ffffb23040073dd0 RDI: ffffffffc0e2e920
>   RBP: 000000000000e72b R08: ffff98e31f732b38 R09: ffff98e31f732b38
>   R10: 000000000001d5ce R11: 0000000000000000 R12: ffffffffc0e1b080
>   R13: ffffb23040073e28 R14: ffff98dbc47b0600 R15: ffffb23040073dd0
>   FS:  0000000000000000(0000) GS:ffff98e31f700000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f98bd7bca80 CR3: 000000076f420004 CR4: 0000000000370ee0
>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   Call Trace:
>    <NMI>
>    ? nmi_cpu_backtrace+0x82/0xf0
>    ? nmi_cpu_backtrace_handler+0xd/0x20
>    ? nmi_handle+0x5e/0x150
>    ? default_do_nmi+0x40/0x100
>    ? exc_nmi+0x112/0x190
>    ? end_repeat_nmi+0x16/0x67
>    ? __pfx_kill_all+0x10/0x10 [nf_conntrack]
>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>    ? nf_ct_iterate_cleanup+0x35/0x1e0 [nf_conntrack]
>    </NMI>
>    <TASK>
>    nf_conntrack_cleanup_net_list+0xac/0x140 [nf_conntrack]
>    cleanup_net+0x213/0x3b0
>    process_one_work+0x177/0x340
>    worker_thread+0x27e/0x390
>    ? __pfx_worker_thread+0x10/0x10
>    kthread+0xe2/0x110
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork+0x30/0x50
>    ? __pfx_kthread+0x10/0x10
>    ret_from_fork_asm+0x1b/0x30
>    </TASK>
> 
> Since this testsuite is different from what I had issues with before,
> I don't really know when the issue first appeared.  It may have been
> there even last year.
> 
> I can reproduce the issue with ubuntu 6.2.0-32-generic kernel, but
> I don't see it with the latest 5.14.0-284.25.1.el9_2.x86_64 on RHEL9.
> It doesn't necessarily mean that RHEL9 doesn't have the issue though,
> the testsuite is not 100% reliable.
> 
> I'll try to dig deeper and bisect the problem on the upstream kernel.
> 
> For now it seems like the issue is likely in IPv6 code, because the
> tests that trigger it are mostly IPv6-related.

So, I bisected this down to an unsurprising commit:

commit b0e214d212030fe497d4d150bb3474e50ad5d093
Author: Madhu Koriginja <madhu.koriginja@....com>
Date:   Tue Mar 21 21:28:44 2023 +0530

    netfilter: keep conntrack reference until IPsecv6 policy checks are done
    
    Keep the conntrack reference until policy checks have been performed for
    IPsec V6 NAT support, just like ipv4.
    
    The reference needs to be dropped before a packet is
    queued to avoid having the conntrack module unloadable.
    
    Fixes: 58a317f1061c ("netfilter: ipv6: add IPv6 NAT support")
    Signed-off-by: Madhu Koriginja <madhu.koriginja@....com>
    Signed-off-by: Florian Westphal <fw@...len.de>

This commit repeated the pattern from the old ipv4 commit b59c270104f0
("[NETFILTER]: Keep conntrack reference until IPsec policy checks are done")
and hence introduced the exact same issue, but for IPv6.  They both failed
to deliver on the "avoid having the conntrack module unloadable" part.

IIUC, the fix should be exactly the same as Eric did for ipv4 in commit
6f0012e35160 ("tcp: add a missing nf_reset_ct() in 3WHS handling").

I can try that and send a fix.

Would still really like to have some preventive mechanism for this kind of
issues though.  Any ideas on that?

Best regards, Ilya Maximets.