netdev - Re: [PATCH net v2] skbuff: disable coalescing for page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YkRP7XwvdgFbvGsk@myrica>
Date:   Wed, 30 Mar 2022 13:41:17 +0100
From:   Jean-Philippe Brucker <jean-philippe@...aro.org>
To:     Alexander H Duyck <alexander.duyck@...il.com>
Cc:     ilias.apalodimas@...aro.org, hawk@...nel.org,
        alexanderduyck@...com, linyunsheng@...wei.com, davem@...emloft.net,
        kuba@...nel.org, pabeni@...hat.com, netdev@...r.kernel.org
Subject: Re: [PATCH net v2] skbuff: disable coalescing for page_pool fragment
 recycling

On Mon, Mar 28, 2022 at 08:03:46AM -0700, Alexander H Duyck wrote:
> >  (3b) Now while handling TCP, coalesce SKB3 with SKB1:
> > 
> >       tcp_v4_rcv(SKB3)
> >         tcp_try_coalesce(to=SKB1, from=SKB3)    // succeeds
> >         kfree_skb_partial(SKB3)
> >           skb_release_data(SKB3)                // drops one dataref
> > 
> >                       SKB1 _____ PAGE1
> >                            \____
> >                       SKB2 _____ PAGE2
> >                                  /
> >                 RX_BD3 _________/
> > 
> >     In tcp_try_coalesce(), __skb_frag_ref() takes a page reference to
> >     PAGE2, where it should instead have increased the page_pool frag
> >     reference, pp_frag_count. Without coalescing, when releasing both
> >     SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
> >     when releasing SKB1 and SKB2, two references to PAGE2 will be
> >     dropped, resulting in underflow.
> > 
> >  (3c) Drop SKB2:
> > 
> >       af_packet_rcv(SKB2)
> >         consume_skb(SKB2)
> >           skb_release_data(SKB2)                // drops second dataref
> >             page_pool_return_skb_page(PAGE2)    // drops one pp_frag_count
> > 
> >                       SKB1 _____ PAGE1
> >                            \____
> >                                  PAGE2
> >                                  /
> >                 RX_BD3 _________/
> > 
> > (4) Userspace calls recvmsg()
> >     Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
> >     release the SKB3 page as well:
> > 
> >     tcp_eat_recv_skb(SKB1)
> >       skb_release_data(SKB1)
> >         page_pool_return_skb_page(PAGE1)
> >         page_pool_return_skb_page(PAGE2)        // drops second pp_frag_count
> > 
> > (5) PAGE2 is freed, but the third RX descriptor was still using it!
> >     In our case this causes IOMMU faults, but it would silently corrupt
> >     memory if the IOMMU was disabled.
> > 
> > Prevent coalescing the SKB if it may hold shared page_pool fragment
> > references.
> > 
> > Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool")
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@...aro.org>
> > ---
> >  net/core/skbuff.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 10bde7c6db44..56b45b9f0b4d 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5276,6 +5276,13 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
> >  	if (skb_cloned(to))
> >  		return false;
> >  
> > +	/* We don't support taking page_pool frag references at the moment.
> > +	 * If the SKB is cloned and could have page_pool frag references, don't
> > +	 * coalesce it.
> > +	 */
> > +	if (skb_cloned(from) && from->pp_recycle)
> > +		return false;
> > +
> >  	/* The page pool signature of struct page will eventually figure out
> >  	 * which pages can be recycled or not but for now let's prohibit slab
> >  	 * allocated and page_pool allocated SKBs from being coalesced.
> 
> 
> This is close but not quite. Actually now that I think about it we can
> probably alter the block below rather than adding a new one.
> 
> The issue is we want only reference counted pages in standard skbs, and
> pp_frag_count pages in pp_recycle skbs. So we already had logic along
> the lines of:
> 	if (to->pp_recycle != from->pp_recycle)
> 		return false;
> 
> I would say we need to change that because from->pp_recycle is the
> piece that is probably too simplistic. Basically we will get a page
> pool page if from->pp_recycle && !skb_cloned(from). So we can probably
> just tweak the check below to be something along the lines of:
> 	if (to->pp_recycle != (from->pp_recycle && !skb_cloned(from)))
> 		return false;

Just to confirm this is fine: the behavior now changes for
to->pp_recycle == 0, from->pp_recycle == 1 and skb_cloned(from) == 1
In this case we now coalesce and take a page ref. So the page has two refs
and two pp_frag_count. (3c) drops one pp_frag_count. If there wasn't
another RX desc holding a pp_frag_count (ie. no step (5)), that would also
drop a page ref, but since 'to' SKB is holding a second page ref the page
is not recycled. That reference gets dropped at (4) and the page is freed
there.  With step (5), the page would get recycled into page_pool, but
without (5) the page is discarded.

I guess it works, just want to make sure that it's OK to mix page_pool
pp_frag_count and normal reference counting at the same time.

Thanks,
Jean

> 
> That should actually increase the number of cases where we can
> correctly coalesce frames while also addressing the issue you
> discovered as it will only merge cloned skbs into standard skbs instead
> of page pool ones.
>