[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cde2dc9e07c49e642f16ab80c6f5b9f605ecbef4.camel@redhat.com>
Date: Thu, 21 Mar 2024 12:55:12 +0100
From: Paolo Abeni <pabeni@...hat.com>
To: Eric Dumazet <edumazet@...gle.com>, Florian Westphal <fw@...len.de>
Cc: netdev@...r.kernel.org, dsahern@...nel.org, kuba@...nel.org,
netfilter-devel@...r.kernel.org, xingwei lee <xrivendell7@...il.com>, yue
sun <samsun1006219@...il.com>,
syzbot+e5167d7144a62715044c@...kaller.appspotmail.com
Subject: Re: [PATCH net] inet: inet_defrag: prevent sk release while still
in use
Hi,
On Wed, 2024-03-20 at 15:14 +0100, Eric Dumazet wrote:
> On Tue, Mar 19, 2024 at 12:36 PM Florian Westphal <fw@...len.de> wrote:
> >
> > ip_local_out() and other functions can pass skb->sk as function argument.
> >
> > If the skb is a fragment and reassembly happens before such function call
> > returns, the sk must not be released.
> >
> > This affects skb fragments reassembled via netfilter or similar
> > modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
> >
> > Eric Dumazet made an initial analysis of this bug. Quoting Eric:
> > Calling ip_defrag() in output path is also implying skb_orphan(),
> > which is buggy because output path relies on sk not disappearing.
> >
> > A relevant old patch about the issue was :
> > 8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")
> >
> > [..]
> >
> > net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
> > inet socket, not an arbitrary one.
> >
> > If we orphan the packet in ipvlan, then downstream things like FQ
> > packet scheduler will not work properly.
> >
> > We need to change ip_defrag() to only use skb_orphan() when really
> > needed, ie whenever frag_list is going to be used.
> >
> > Eric suggested to stash sk in fragment queue and made an initial patch.
> > However there is a problem with this:
> >
> > If skb is refragmented again right after, ip_do_fragment() will copy
> > head->sk to the new fragments, and sets up destructor to sock_wfree.
> > IOW, we have no choice but to fix up sk_wmem accouting to reflect the
> > fully reassembled skb, else wmem will underflow.
> >
> > This change moves the orphan down into the core, to last possible moment.
> > As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
> > offset into the FRAG_CB, else skb->sk gets clobbered.
> >
> > This allows to delay the orphaning long enough to learn if the skb has
> > to be queued or if the skb is completing the reasm queue.
> >
> > In the former case, things work as before, skb is orphaned. This is
> > safe because skb gets queued/stolen and won't continue past reasm engine.
> >
> > In the latter case, we will steal the skb->sk reference, reattach it to
> > the head skb, and fix up wmem accouting when inet_frag inflates truesize.
> >
> > Diagnosed-by: Eric Dumazet <edumazet@...gle.com>
> > Reported-by: xingwei lee <xrivendell7@...il.com>
> > Reported-by: yue sun <samsun1006219@...il.com>
> > Reported-by: syzbot+e5167d7144a62715044c@...kaller.appspotmail.com
Possibly:
Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
it's not very accurate but should be a reasonable oldest affected
version.
> > Signed-off-by: Florian Westphal <fw@...len.de>
> > ---
> > include/linux/skbuff.h | 7 +--
> > net/ipv4/inet_fragment.c | 71 ++++++++++++++++++++-----
> > net/ipv4/ip_fragment.c | 2 +-
> > net/ipv6/netfilter/nf_conntrack_reasm.c | 2 +-
> > 4 files changed, 61 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 7d56ce195120..6d08ff8a9357 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -753,8 +753,6 @@ typedef unsigned char *sk_buff_data_t;
> > * @list: queue head
> > * @ll_node: anchor in an llist (eg socket defer_list)
> > * @sk: Socket we are owned by
> > - * @ip_defrag_offset: (aka @sk) alternate use of @sk, used in
> > - * fragmentation management
> > * @dev: Device we arrived on/are leaving by
> > * @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL
> > * @cb: Control buffer. Free for use by every layer. Put private vars here
> > @@ -875,10 +873,7 @@ struct sk_buff {
> > struct llist_node ll_node;
> > };
> >
> > - union {
> > - struct sock *sk;
> > - int ip_defrag_offset;
> > - };
> > + struct sock *sk;
> >
> > union {
> > ktime_t tstamp;
> > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> > index 7072fc0783ef..7254b640ba06 100644
> > --- a/net/ipv4/inet_fragment.c
> > +++ b/net/ipv4/inet_fragment.c
> > @@ -24,6 +24,8 @@
> > #include <net/ip.h>
> > #include <net/ipv6.h>
> >
> > +#include "../core/sock_destructor.h"
> > +
> > /* Use skb->cb to track consecutive/adjacent fragments coming at
> > * the end of the queue. Nodes in the rb-tree queue will
> > * contain "runs" of one or more adjacent fragments.
> > @@ -39,6 +41,7 @@ struct ipfrag_skb_cb {
> > };
> > struct sk_buff *next_frag;
> > int frag_run_len;
> > + int ip_defrag_offset;
> > };
> >
> > #define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb))
> > @@ -396,12 +399,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> > */
> > if (!last)
> > fragrun_create(q, skb); /* First fragment. */
> > - else if (last->ip_defrag_offset + last->len < end) {
> > + else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) {
> > /* This is the common case: skb goes to the end. */
> > /* Detect and discard overlaps. */
> > - if (offset < last->ip_defrag_offset + last->len)
> > + if (offset < FRAG_CB(last)->ip_defrag_offset + last->len)
> > return IPFRAG_OVERLAP;
> > - if (offset == last->ip_defrag_offset + last->len)
> > + if (offset == FRAG_CB(last)->ip_defrag_offset + last->len)
> > fragrun_append_to_last(q, skb);
> > else
> > fragrun_create(q, skb);
> > @@ -418,13 +421,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> >
> > parent = *rbn;
> > curr = rb_to_skb(parent);
> > - curr_run_end = curr->ip_defrag_offset +
> > + curr_run_end = FRAG_CB(curr)->ip_defrag_offset +
> > FRAG_CB(curr)->frag_run_len;
> > - if (end <= curr->ip_defrag_offset)
> > + if (end <= FRAG_CB(curr)->ip_defrag_offset)
> > rbn = &parent->rb_left;
> > else if (offset >= curr_run_end)
> > rbn = &parent->rb_right;
> > - else if (offset >= curr->ip_defrag_offset &&
> > + else if (offset >= FRAG_CB(curr)->ip_defrag_offset &&
> > end <= curr_run_end)
> > return IPFRAG_DUP;
> > else
> > @@ -438,23 +441,39 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> > rb_insert_color(&skb->rbnode, &q->rb_fragments);
> > }
> >
> > - skb->ip_defrag_offset = offset;
> > + FRAG_CB(skb)->ip_defrag_offset = offset;
> >
> > return IPFRAG_OK;
> > }
> > EXPORT_SYMBOL(inet_frag_queue_insert);
> >
> > +void tcp_wfree(struct sk_buff *skb);
>
> Thanks a lot Florian for looking at this !
>
> Since you had : #include "../core/sock_destructor.h", perhaps the line
> can be removed,
> because it includes <net/tcp.h>
I think Florian will not able to reply for a few days.
Since the issue looks ancient and we are early in the cycle, I guess
there are no problems with that.
Cheers,
Paolo
Powered by blists - more mailing lists