[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+S0EYPM4N-3RsN5-QDQts5wobJjBikF7=feMo6hHY3Lw@mail.gmail.com>
Date: Wed, 20 Mar 2024 15:14:40 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: Florian Westphal <fw@...len.de>
Cc: netdev@...r.kernel.org, dsahern@...nel.org, kuba@...nel.org,
pabeni@...hat.com, netfilter-devel@...r.kernel.org,
xingwei lee <xrivendell7@...il.com>, yue sun <samsun1006219@...il.com>,
syzbot+e5167d7144a62715044c@...kaller.appspotmail.com
Subject: Re: [PATCH net] inet: inet_defrag: prevent sk release while still in use
On Tue, Mar 19, 2024 at 12:36 PM Florian Westphal <fw@...len.de> wrote:
>
> ip_local_out() and other functions can pass skb->sk as function argument.
>
> If the skb is a fragment and reassembly happens before such function call
> returns, the sk must not be released.
>
> This affects skb fragments reassembled via netfilter or similar
> modules, e.g. openvswitch or ct_act.c, when run as part of tx pipeline.
>
> Eric Dumazet made an initial analysis of this bug. Quoting Eric:
> Calling ip_defrag() in output path is also implying skb_orphan(),
> which is buggy because output path relies on sk not disappearing.
>
> A relevant old patch about the issue was :
> 8282f27449bf ("inet: frag: Always orphan skbs inside ip_defrag()")
>
> [..]
>
> net/ipv4/ip_output.c depends on skb->sk being set, and probably to an
> inet socket, not an arbitrary one.
>
> If we orphan the packet in ipvlan, then downstream things like FQ
> packet scheduler will not work properly.
>
> We need to change ip_defrag() to only use skb_orphan() when really
> needed, ie whenever frag_list is going to be used.
>
> Eric suggested to stash sk in fragment queue and made an initial patch.
> However there is a problem with this:
>
> If skb is refragmented again right after, ip_do_fragment() will copy
> head->sk to the new fragments, and sets up destructor to sock_wfree.
> IOW, we have no choice but to fix up sk_wmem accouting to reflect the
> fully reassembled skb, else wmem will underflow.
>
> This change moves the orphan down into the core, to last possible moment.
> As ip_defrag_offset is aliased with sk_buff->sk member, we must move the
> offset into the FRAG_CB, else skb->sk gets clobbered.
>
> This allows to delay the orphaning long enough to learn if the skb has
> to be queued or if the skb is completing the reasm queue.
>
> In the former case, things work as before, skb is orphaned. This is
> safe because skb gets queued/stolen and won't continue past reasm engine.
>
> In the latter case, we will steal the skb->sk reference, reattach it to
> the head skb, and fix up wmem accouting when inet_frag inflates truesize.
>
> Diagnosed-by: Eric Dumazet <edumazet@...gle.com>
> Reported-by: xingwei lee <xrivendell7@...il.com>
> Reported-by: yue sun <samsun1006219@...il.com>
> Reported-by: syzbot+e5167d7144a62715044c@...kaller.appspotmail.com
> Signed-off-by: Florian Westphal <fw@...len.de>
> ---
> include/linux/skbuff.h | 7 +--
> net/ipv4/inet_fragment.c | 71 ++++++++++++++++++++-----
> net/ipv4/ip_fragment.c | 2 +-
> net/ipv6/netfilter/nf_conntrack_reasm.c | 2 +-
> 4 files changed, 61 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 7d56ce195120..6d08ff8a9357 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -753,8 +753,6 @@ typedef unsigned char *sk_buff_data_t;
> * @list: queue head
> * @ll_node: anchor in an llist (eg socket defer_list)
> * @sk: Socket we are owned by
> - * @ip_defrag_offset: (aka @sk) alternate use of @sk, used in
> - * fragmentation management
> * @dev: Device we arrived on/are leaving by
> * @dev_scratch: (aka @dev) alternate use of @dev when @dev would be %NULL
> * @cb: Control buffer. Free for use by every layer. Put private vars here
> @@ -875,10 +873,7 @@ struct sk_buff {
> struct llist_node ll_node;
> };
>
> - union {
> - struct sock *sk;
> - int ip_defrag_offset;
> - };
> + struct sock *sk;
>
> union {
> ktime_t tstamp;
> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 7072fc0783ef..7254b640ba06 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -24,6 +24,8 @@
> #include <net/ip.h>
> #include <net/ipv6.h>
>
> +#include "../core/sock_destructor.h"
> +
> /* Use skb->cb to track consecutive/adjacent fragments coming at
> * the end of the queue. Nodes in the rb-tree queue will
> * contain "runs" of one or more adjacent fragments.
> @@ -39,6 +41,7 @@ struct ipfrag_skb_cb {
> };
> struct sk_buff *next_frag;
> int frag_run_len;
> + int ip_defrag_offset;
> };
>
> #define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb))
> @@ -396,12 +399,12 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> */
> if (!last)
> fragrun_create(q, skb); /* First fragment. */
> - else if (last->ip_defrag_offset + last->len < end) {
> + else if (FRAG_CB(last)->ip_defrag_offset + last->len < end) {
> /* This is the common case: skb goes to the end. */
> /* Detect and discard overlaps. */
> - if (offset < last->ip_defrag_offset + last->len)
> + if (offset < FRAG_CB(last)->ip_defrag_offset + last->len)
> return IPFRAG_OVERLAP;
> - if (offset == last->ip_defrag_offset + last->len)
> + if (offset == FRAG_CB(last)->ip_defrag_offset + last->len)
> fragrun_append_to_last(q, skb);
> else
> fragrun_create(q, skb);
> @@ -418,13 +421,13 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
>
> parent = *rbn;
> curr = rb_to_skb(parent);
> - curr_run_end = curr->ip_defrag_offset +
> + curr_run_end = FRAG_CB(curr)->ip_defrag_offset +
> FRAG_CB(curr)->frag_run_len;
> - if (end <= curr->ip_defrag_offset)
> + if (end <= FRAG_CB(curr)->ip_defrag_offset)
> rbn = &parent->rb_left;
> else if (offset >= curr_run_end)
> rbn = &parent->rb_right;
> - else if (offset >= curr->ip_defrag_offset &&
> + else if (offset >= FRAG_CB(curr)->ip_defrag_offset &&
> end <= curr_run_end)
> return IPFRAG_DUP;
> else
> @@ -438,23 +441,39 @@ int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
> rb_insert_color(&skb->rbnode, &q->rb_fragments);
> }
>
> - skb->ip_defrag_offset = offset;
> + FRAG_CB(skb)->ip_defrag_offset = offset;
>
> return IPFRAG_OK;
> }
> EXPORT_SYMBOL(inet_frag_queue_insert);
>
> +void tcp_wfree(struct sk_buff *skb);
Thanks a lot Florian for looking at this !
Since you had : #include "../core/sock_destructor.h", perhaps the line
can be removed,
because it includes <net/tcp.h>
Powered by blists - more mailing lists