linux-kernel - Re: [PATCH net] net: ensure all external references are released in deferred skbuffs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+JdDukwEhZ=41FxY-w63eER6JVixkwL+s2eSOjo6aWEQ@mail.gmail.com>
Date:   Wed, 22 Jun 2022 12:15:31 +0200
From:   Eric Dumazet <edumazet@...gle.com>
To:     Ilya Maximets <i.maximets@....org>
Cc:     netdev <netdev@...r.kernel.org>,
        "David S. Miller" <davem@...emloft.net>, dev@...nvswitch.org,
        LKML <linux-kernel@...r.kernel.org>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Florian Westphal <fw@...len.de>
Subject: Re: [PATCH net] net: ensure all external references are released in
 deferred skbuffs

On Wed, Jun 22, 2022 at 12:02 PM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets <i.maximets@....org> wrote:
> >
> > Open vSwitch system test suite is broken due to inability to
> > load/unload netfilter modules.  kworker thread is getting trapped
> > in the infinite loop while running a net cleanup inside the
> > nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> > holding nfct references and not being freed by their CPU cores.
> >
> > In general, the idea that we will have an rx interrupt on every
> > CPU core at some point in a near future doesn't seem correct.
> > Devices are getting created and destroyed, interrupts are getting
> > re-scheduled, CPUs are going online and offline dynamically.
> > Any of these events may leave packets stuck in defer list for a
> > long time.  It might be OK, if they are just a piece of memory,
> > but we can't afford them holding references to any other resources.
> >
> > In case of OVS, nfct reference keeps the kernel thread in busy loop
> > while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> > later modprobe request from user space:
> >
> >   # ps
> >    299 root  R  99.3  200:25.89 kworker/u96:4+
> >
> >   # journalctl
> >   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> >         Not tainted 5.19.0-rc2 #8
> >   task:modprobe     state:D
> >   Call Trace:
> >    <TASK>
> >    __schedule+0x8aa/0x21d0
> >    schedule+0xcc/0x200
> >    rwsem_down_write_slowpath+0x8e4/0x1580
> >    down_write+0xfc/0x140
> >    register_pernet_subsys+0x15/0x40
> >    nf_nat_init+0xb6/0x1000 [nf_nat]
> >    do_one_initcall+0xbb/0x410
> >    do_init_module+0x1b4/0x640
> >    load_module+0x4c1b/0x58d0
> >    __do_sys_init_module+0x1d7/0x220
> >    do_syscall_64+0x3a/0x80
> >    entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >
> > At this point OVS testsuite is unresponsive and never recover,
> > because these skbuffs are never freed.
> >
> > Solution is to make sure no external references attached to skb
> > before pushing it to the defer list.  Using skb_release_head_state()
> > for that purpose.  The function modified to be re-enterable, as it
> > will be called again during the defer list flush.
> >
> > Another approach that can fix the OVS use-case, is to kick all
> > cores while waiting for references to be released during the net
> > cleanup.  But that sounds more like a workaround for a current
> > issue rather than a proper solution and will not cover possible
> > issues in other parts of the code.
> >
> > Additionally checking for skb_zcopy() while deferring.  This might
> > not be necessary, as I'm not sure if we can actually have zero copy
> > packets on this path, but seems worth having for completeness as we
> > should never defer such packets regardless.
> >
> > CC: Eric Dumazet <edumazet@...gle.com>
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> > Signed-off-by: Ilya Maximets <i.maximets@....org>
> > ---
> >  net/core/skbuff.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
>
> I do not think this patch is doing the right thing.
>
> Packets sitting in TCP receive queues should not hold state that is
> not relevant for TCP recvmsg().
>
> This consumes extra memory for no good reason, and defer expensive
> atomic operations.
>
> We for instance release skb dst before skb is queued, we should do the
> same for conntrack state.
>
> This would increase performance anyway, as we free ct state while cpu
> caches are hot.

I am thinking of the following instead.

A new helper can be added (and later be used in net/packet/af_packet.c
and probably elsewhere)

diff --git a/include/net/dst.h b/include/net/dst.h
index 6aa252c3fc55ccaee58faebf265510469e91d780..7c3316d9d6e73daea17223a5261f6a5c4f68eae3
100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -276,6 +276,15 @@ static inline void skb_dst_drop(struct sk_buff *skb)
        }
 }

+/* Before queueing skb in a receive queue, get rid of
+ * potentially expensive components.
+ */
+static inline void skb_cleanup(struct sk_buff *skb)
+{
+       skb_dst_drop(skb);
+       nf_reset_ct(skb);
+}
+
 static inline void __skb_dst_copy(struct sk_buff *nskb, unsigned long refdst)
 {
        nskb->slow_gro |= !!refdst;
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index fdbcf2a6d08ef4a5164247b5a5b4b222289b191a..913c98e446d56ee067b54b2c704ac1195ef1a81e
100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -177,7 +177,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct
sk_buff *skb)
        if (!skb)
                return;

-       skb_dst_drop(skb);
+       skb_cleanup(skb);
        /* segs_in has been initialized to 1 in tcp_create_openreq_child().
         * Hence, reset segs_in to 0 before calling tcp_segs_in()
         * to avoid double counting.  Also, tcp_segs_in() expects
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2e2a9ece9af27372e6b653d685a89a2c71ba05d1..987981a16ee34e0601e7e722abef1bb098c307c5
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5005,7 +5005,7 @@ static void tcp_data_queue(struct sock *sk,
struct sk_buff *skb)
                __kfree_skb(skb);
                return;
        }
-       skb_dst_drop(skb);
+       skb_cleanup(skb);
        __skb_pull(skb, tcp_hdr(skb)->doff * 4);

        reason = SKB_DROP_REASON_NOT_SPECIFIED;
@@ -5931,7 +5931,7 @@ void tcp_rcv_established(struct sock *sk, struct
sk_buff *skb)
                        NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);

                        /* Bulk data transfer: receiver */
-                       skb_dst_drop(skb);
+                       skb_cleanup(skb);
                        __skb_pull(skb, tcp_header_len);
                        eaten = tcp_queue_rcv(sk, skb, &fragstolen);

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fe8f23b95d32ca4a35d05166d471327bc608fa91..d9acd906f28267ff07450d78d079e4e8eab74957
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1765,7 +1765,7 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb,
         */
        skb_condense(skb);

-       skb_dst_drop(skb);
+       skb_cleanup(skb);

        if (unlikely(tcp_checksum_complete(skb))) {
                bh_unlock_sock(sk);