netdev - Re: [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKhg4tKo66gPNvoQuk8hEBEt+5J=DfJAgYs_8AncFmrDh1rZAQ@mail.gmail.com>
Date:   Mon, 10 Apr 2023 15:26:15 +0800
From:   Liang Chen <liangchen.linux@...il.com>
To:     Ilias Apalodimas <ilias.apalodimas@...aro.org>
Cc:     Alexander H Duyck <alexander.duyck@...il.com>, kuba@...nel.org,
        hawk@...nel.org, davem@...emloft.net, edumazet@...gle.com,
        pabeni@...hat.com, netdev@...r.kernel.org
Subject: Re: [PATCH v2] skbuff: Fix a race between coalescing and releasing SKBs

On Fri, Apr 7, 2023 at 4:03 PM Ilias Apalodimas
<ilias.apalodimas@...aro.org> wrote:
>
> Hi all,
>
> On Thu, 6 Apr 2023 at 18:25, Alexander H Duyck
> <alexander.duyck@...il.com> wrote:
> >
> > On Thu, 2023-04-06 at 19:48 +0800, Liang Chen wrote:
> > > Commit 1effe8ca4e34 ("skbuff: fix coalescing for page_pool fragment
> > > recycling") allowed coalescing to proceed with non page pool page and
> > > page pool page when @from is cloned, i.e.
> > >
> > > to->pp_recycle    --> false
> > > from->pp_recycle  --> true
> > > skb_cloned(from)  --> true
> > >
> > > However, it actually requires skb_cloned(@from) to hold true until
> > > coalescing finishes in this situation. If the other cloned SKB is
> > > released while the merging is in process, from_shinfo->nr_frags will be
> > > set to 0 towards the end of the function, causing the increment of frag
> > > page _refcount to be unexpectedly skipped resulting in inconsistent
> > > reference counts. Later when SKB(@to) is released, it frees the page
> > > directly even though the page pool page is still in use, leading to
> > > use-after-free or double-free errors.
> > >
> > > So it needs to be specially handled at where the ref count may get lost.
> > >
> > > The double-free error message below prompted us to investigate:
> > > BUG: Bad page state in process swapper/1  pfn:0e0d1
> > > page:00000000c6548b28 refcount:-1 mapcount:0 mapping:0000000000000000
> > > index:0x2 pfn:0xe0d1
> > > flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> > > raw: 000fffffc0000000 0000000000000000 ffffffff00000101 0000000000000000
> > > raw: 0000000000000002 0000000000000000 ffffffffffffffff 0000000000000000
> > > page dumped because: nonzero _refcount
> > >
> > > CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E      6.2.0+
> > > Call Trace:
> > >  <IRQ>
> > > dump_stack_lvl+0x32/0x50
> > > bad_page+0x69/0xf0
> > > free_pcp_prepare+0x260/0x2f0
> > > free_unref_page+0x20/0x1c0
> > > skb_release_data+0x10b/0x1a0
> > > napi_consume_skb+0x56/0x150
> > > net_rx_action+0xf0/0x350
> > > ? __napi_schedule+0x79/0x90
> > > __do_softirq+0xc8/0x2b1
> > > __irq_exit_rcu+0xb9/0xf0
> > > common_interrupt+0x82/0xa0
> > > </IRQ>
> > > <TASK>
> > > asm_common_interrupt+0x22/0x40
> > > RIP: 0010:default_idle+0xb/0x20
> > >
> > > Signed-off-by: Liang Chen <liangchen.linux@...il.com>
> > > ---
> > > Changes from v1:
> > > - deal with the ref count problem instead of return back to give more opportunities to coalesce skbs.
> > > ---
> > >  net/core/skbuff.c | 22 ++++++++++++++++++++--
> > >  1 file changed, 20 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 050a875d09c5..77da8ce74a1e 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -5643,7 +5643,19 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> > >
> > >               skb_fill_page_desc(to, to_shinfo->nr_frags,
> > >                                  page, offset, skb_headlen(from));
> > > -             *fragstolen = true;
> > > +
> > > +             /* When @from is pp_recycle and @to isn't, coalescing is
> > > +              * allowed to proceed if @from is cloned. However if the
> > > +              * execution reaches this point, @from is already transitioned
> > > +              * into non-cloned because the other cloned skb is released
> > > +              * somewhere else concurrently. In this case, we need to make
> > > +              * sure the ref count is incremented, not directly stealing
> > > +              * from page pool.
> > > +              */
> > > +             if (to->pp_recycle != from->pp_recycle)
> > > +                     get_page(page);
> > > +             else
> > > +                     *fragstolen = true;
> > >       } else {
> > >               if (to_shinfo->nr_frags +
> > >                   from_shinfo->nr_frags > MAX_SKB_FRAGS)
> > > @@ -5659,7 +5671,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> > >              from_shinfo->nr_frags * sizeof(skb_frag_t));
> > >       to_shinfo->nr_frags += from_shinfo->nr_frags;
> > >
> > > -     if (!skb_cloned(from))
> > > +     /* Same situation as above where head data presents. When @from is
> > > +      * pp_recycle and @to isn't, coalescing is allowed to proceed if @from
> > > +      * is cloned. However @from can be transitioned into non-cloned
> > > +      * concurrently by this point. If it does happen, we need to make sure
> > > +      * the ref count is properly incremented.
> > > +      */
> > > +     if (to->pp_recycle == from->pp_recycle && !skb_cloned(from))
> > >               from_shinfo->nr_frags = 0;
> > >
> > >       /* if the skb is not cloned this does nothing
> >
> > So looking this over I believe this should resolve the issue you
> > pointed out while maintaining current functionality.
> >
> > Reviewed-by: Alexander Duyck <alexanderduyck@...com>
>
> Yes, this would work, but I am not sure we really want this here.
> Is coalescing page-pool-owned and slab-owned SKBs frequent to justify
> the additional overhead? If the answer is yes, the patch seems fine
>
> Thanks
> /Ilias

We did some analysis on the frequency of the case where the following
condition were met when entering the function,

to->pp_recycle    --> false
from->pp_recycle  --> true
skb_cloned(from)  --> true

with our networking setup (mtu 1500), there was no such occurrence. It
was only when we limited the packet size down to 256 bytes we started
to observe very few occurrences (less than 1%). And it was not
significant enough to show any difference in our performance test.

So it seems the performance gain does not justify the additional
overhead. Can you please consider the v1 patch instead?

Thanks,
Liang
> >
> > One follow-on that we may want to do with this would be to look at
> > consolidating the 3 spots where we are checking for our combination of
> > pp_recycle comparison and skb_cloned and maybe pass one boolean flag
> > indicating that we have to transfer everything by taking page
> > references.
> >
> > Also I think we can actually increase the number of cases where we
> > support coalescing if we were to take apart the skb_head_is_locked call
> > and move the skb_cloned check from it into your recycle check in the
> > portion where we are stealing from the header.