linux-kernel - Re: [PATCH net] net: ensure all external references are released in deferred skbuffs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iL_EmkEgPAVdhNW4tyzwQbARyji93mUQ9E2MRczWpNm7g@mail.gmail.com>
Date:   Wed, 22 Jun 2022 12:02:36 +0200
From:   Eric Dumazet <edumazet@...gle.com>
To:     Ilya Maximets <i.maximets@....org>
Cc:     netdev <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>, dev@...nvswitch.org,
        LKML <linux-kernel@...r.kernel.org>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Florian Westphal <fw@...len.de>
Subject: Re: [PATCH net] net: ensure all external references are released in
 deferred skbuffs

On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@....org> wrote:
>
> Open vSwitch system test suite is broken due to inability to
> load/unload netfilter modules.  kworker thread is getting trapped
> in the infinite loop while running a net cleanup inside the
> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> holding nfct references and not being freed by their CPU cores.
>
> In general, the idea that we will have an rx interrupt on every
> CPU core at some point in a near future doesn't seem correct.
> Devices are getting created and destroyed, interrupts are getting
> re-scheduled, CPUs are going online and offline dynamically.
> Any of these events may leave packets stuck in defer list for a
> long time.  It might be OK, if they are just a piece of memory,
> but we can't afford them holding references to any other resources.
>
> In case of OVS, nfct reference keeps the kernel thread in busy loop
> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> later modprobe request from user space:
>
>   # ps
>    299 root  R  99.3  200:25.89 kworker/u96:4+
>
>   # journalctl
>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
>         Not tainted 5.19.0-rc2 #8
>   task:modprobe     state:D
>   Call Trace:
>    <TASK>
>    __schedule+0x8aa/0x21d0
>    schedule+0xcc/0x200
>    rwsem_down_write_slowpath+0x8e4/0x1580
>    down_write+0xfc/0x140
>    register_pernet_subsys+0x15/0x40
>    nf_nat_init+0xb6/0x1000 [nf_nat]
>    do_one_initcall+0xbb/0x410
>    do_init_module+0x1b4/0x640
>    load_module+0x4c1b/0x58d0
>    __do_sys_init_module+0x1d7/0x220
>    do_syscall_64+0x3a/0x80
>    entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> At this point OVS testsuite is unresponsive and never recover,
> because these skbuffs are never freed.
>
> Solution is to make sure no external references attached to skb
> before pushing it to the defer list.  Using skb_release_head_state()
> for that purpose.  The function modified to be re-enterable, as it
> will be called again during the defer list flush.
>
> Another approach that can fix the OVS use-case, is to kick all
> cores while waiting for references to be released during the net
> cleanup.  But that sounds more like a workaround for a current
> issue rather than a proper solution and will not cover possible
> issues in other parts of the code.
>
> Additionally checking for skb_zcopy() while deferring.  This might
> not be necessary, as I'm not sure if we can actually have zero copy
> packets on this path, but seems worth having for completeness as we
> should never defer such packets regardless.
>
> CC: Eric Dumazet <edumazet@...gle.com>
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Signed-off-by: Ilya Maximets <i.maximets@....org>
> ---
>  net/core/skbuff.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)

I do not think this patch is doing the right thing.

Packets sitting in TCP receive queues should not hold state that is
not relevant for TCP recvmsg().

This consumes extra memory for no good reason, and defer expensive
atomic operations.

We for instance release skb dst before skb is queued, we should do the
same for conntrack state.

This would increase performance anyway, as we free ct state while cpu
caches are hot.