[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <675b8f8f65e28_ff0720890@john.notmuch>
Date: Thu, 12 Dec 2024 17:36:15 -0800
From: John Fastabend <john.fastabend@...il.com>
To: Jiayuan Chen <mrpre@....com>,
John Fastabend <john.fastabend@...il.com>
Cc: bpf@...r.kernel.org,
martin.lau@...ux.dev,
ast@...nel.org,
edumazet@...gle.com,
jakub@...udflare.com,
davem@...emloft.net,
dsahern@...nel.org,
kuba@...nel.org,
pabeni@...hat.com,
linux-kernel@...r.kernel.org,
song@...nel.org,
andrii@...nel.org,
mhal@...x.co,
yonghong.song@...ux.dev,
daniel@...earbox.net,
xiyou.wangcong@...il.com,
horms@...nel.org
Subject: Re: [PATCH bpf v2 1/2] bpf: fix wrong copied_seq calculation
Jiayuan Chen wrote:
> On Tue, Dec 10, 2024 at 06:11:26PM -0800, John Fastabend wrote:
> > Jiayuan Chen wrote:
> > > 'sk->copied_seq' was updated in the tcp_eat_skb() function when the
> > > action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS,
> > > the update logic for 'sk->copied_seq' was moved to
> > > tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature.
> > >
> > > It works for a single stream_verdict scenario, as it also modified
> > > 'sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb'
> > > to remove updating 'sk->copied_seq'.
> > >
> > > However, for programs where both stream_parser and stream_verdict are
> > > active(strparser purpose), tcp_read_sock() was used instead of
> > > tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock)
> > > tcp_read_sock() now still update 'sk->copied_seq', leading to duplicated
> > > updates.
> > >
> > > In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
> > > in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
> > >
> > > The issue causes incorrect copied_seq calculations, which prevent
> > > correct data reads from the recv() interface in user-land.
> > >
> > > Modifying tcp_read_sock() or strparser implementation directly is
> > > unreasonable, as it is widely used in other modules.
> > >
> > > Here, we introduce a method tcp_bpf_read_sock() to replace
> > > 'sk->sk_socket->ops->read_sock' (like 'tls_build_proto()' does in
> > > tls_main.c). Such replacement action was also used in updating
> > > tcp_bpf_prots in tcp_bpf.c, so it's not weird.
> > > (Note that checkpatch.pl may complain missing 'const' qualifier when we
> > > define the bpf-specified 'proto_ops', but we have to do because we need
> > > update it).
> > >
> > > Also we remove strparser check in tcp_eat_skb() since we implement custom
> > > function tcp_bpf_read_sock() without copied_seq updating.
> > >
> > > Since strparser currently supports only TCP, it's sufficient for 'ops' to
> > > inherit inet_stream_ops.
> > >
> > > In strparser's implementation, regardless of partial or full reads,
> > > it completely clones the entire skb, allowing us to unconditionally
> > > free skb in tcp_bpf_read_sock().
> > >
> > > Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
> > > Signed-off-by: Jiayuan Chen <mrpre@....com>
> >
> > [...]
> >
> > > +/* The tcp_bpf_read_sock() is an alternative implementation
> > > + * of tcp_read_sock(), except that it does not update copied_seq.
> > > + */
> > > +static int tcp_bpf_read_sock(struct sock *sk, read_descriptor_t *desc,
> > > + sk_read_actor_t recv_actor)
> > > +{
> > > + struct sk_buff *skb;
> > > + int copied = 0;
> > > +
> > > + if (sk->sk_state == TCP_LISTEN)
> > > + return -ENOTCONN;
> > > +
> > > + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
> > > + u8 tcp_flags;
> > > + int used;
> > > +
> > > + WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
> > > + tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
> > > + used = recv_actor(desc, skb, 0, skb->len);
> >
> > Here the skb is still on the receive_queue how does this work with
> > tcp_try_coalesce()? So I believe you need to unlink before you
> > call the actor which creates a bit of trouble if recv_actor
> > doesn't want the entire skb.
> >
> > I think easier is to do similar logic to read_sock and track
> > offset and len? Did I miss something.
>
> Thanks to John Fastabend.
>
> Let me explain it.
> Now I only replace the read_sock handler when using strparser.
>
> My previous implementation added the replacement of read_sock in
> sk_psock_start_strp() to more explicitly require replacement when using
> strparser, for instance:
> '''skmsg.c
> void sk_psock_start_strp(struct sock *sk, struct sk_psock *psock)
> {
> ...
> sk->sk_data_ready = sk_psock_strp_data_ready;
> /* Replacement */
> sk->sk_socket->ops->read_sock = tcp_bpf_read_sock;
> }
> '''
>
> As you can see that it only works for strparser.
> (The current implementation of replacement in tcp_bpf_update_proto()
> achieves the same effect just not as obviously.)
>
> So the current implementation of recv_actor() can only be strp_recv(),
> with the call stack as follows:
> '''
> sk_psock_strp_data_ready
> -> strp_data_ready
> -> strp_read_sock
> -> strp_recv
> '''
>
> The implementation of strp_recv() will consume all input skb. Even if it
> reads part of the data, it will clone it, then place it into its own
> queue, expecting the caller to release the skb. I didn't find any
> logic like tcp_try_coalesce() (fix me if i miss something).
So here is what I believe the flow is,
sk_psock_strp_data_ready
-> str_data_ready
-> strp_read_sock
-> sock->ops->read_sock(.., strp_recv)
We both have the same idea up to here. But then the proposed data_ready()
call
+ while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) {
+ u8 tcp_flags;
+ int used;
+
+ WARN_ON_ONCE(!skb_set_owner_sk_safe(skb, sk));
+ tcp_flags = TCP_SKB_CB(skb)->tcp_flags;
+ used = recv_actor(desc, skb, 0, skb->len);
The recv_actor here is strp_recv() all good so far. But, because
that skb is still on the sk_receive_queue() the TCP stack may
at the same time do
tcp_data_queue
-> tcp_queue_rcv
-> tail = skb_peek_tail(&sk->sk_receive_queue);
tcp_try_coalesce(sk, tail, skb, fragstolen)
-> skb_try_coalesce()
... skb->len += len
So among other things you will have changed the skb->len and added some
data to it. If this happens while you are running the recv actor we will
eat the data by calling tcp_eat_recv_skb(). Any data added from the
try_coalesce will just be dropped and never handled? The clone() from
the strparser side doesn't help you the tcp_eat_recv_skb call will
unlik the skb from the sk_receive_queue.
I don't think you have any way to protect this at the moment.
>
> The record of the 'offset' is stored within its own context(strparser/_strp_msg).
> With all skbs and offset saved in strp context, the caller does not need to
> maintain it.
>
> I have also added various logic tests for this situation in the test
> case, and it works correctly.
> > > + /* strparser clone and consume all input skb
> > > + * even in waiting head or body status
> > > + */
> > > + tcp_eat_recv_skb(sk, skb);
> > > + if (used <= 0) {
> > > + if (!copied)
> > > + copied = used;
> > > + break;
> > > + }
> > > + copied += used;
> > > + if (!desc->count)
> > > + break;
> > > + if (tcp_flags & TCPHDR_FIN)
> > > + break;
> > > + }
> > > + return copied;
> > > +}
> > > +
> > > enum {
> > > TCP_BPF_IPV4,
> > > TCP_BPF_IPV6,
> > > @@ -595,6 +636,10 @@ enum {
>
>
Powered by blists - more mailing lists