[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoBuQYsbfGuL0PUGdvy7UHGKii3rXv8q+GjzbHvK3hKsQw@mail.gmail.com>
Date: Thu, 6 Feb 2025 08:33:51 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Martin KaFai Lau <martin.lau@...ux.dev>, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, pabeni@...hat.com, dsahern@...nel.org, willemb@...gle.com,
ast@...nel.org, daniel@...earbox.net, andrii@...nel.org, eddyz87@...il.com,
song@...nel.org, yonghong.song@...ux.dev, john.fastabend@...il.com,
kpsingh@...nel.org, sdf@...ichev.me, haoluo@...gle.com, jolsa@...nel.org,
horms@...nel.org, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v7 11/13] net-timestamp: add a new callback in tcp_tx_timestamp()
On Thu, Feb 6, 2025 at 5:02 AM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> Jason Xing wrote:
> > On Wed, Feb 5, 2025 at 11:20 PM Willem de Bruijn
> > <willemdebruijn.kernel@...il.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > On Wed, Feb 5, 2025 at 2:09 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > >
> > > > > On Wed, Feb 5, 2025 at 1:08 AM Willem de Bruijn
> > > > > <willemdebruijn.kernel@...il.com> wrote:
> > > > > >
> > > > > > Jason Xing wrote:
> > > > > > > On Tue, Feb 4, 2025 at 9:16 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> > > > > > > >
> > > > > > > > On 1/28/25 12:46 AM, Jason Xing wrote:
> > > > > > > > > Introduce the callback to correlate tcp_sendmsg timestamp with other
> > > > > > > > > points, like SND/SW/ACK. We can let bpf trace the beginning of
> > > > > > > > > tcp_sendmsg_locked() and fetch the socket addr, so that in
> > > > > > > >
> > > > > > > > Instead of "fetch the socket addr...", should be "store the sendmsg timestamp at
> > > > > > > > the bpf_sk_storage ...".
> > > > > > >
> > > > > > > I will revise it. Thanks.
> > > > > > >
> > > > > > > >
> > > > > > > > > tcp_tx_timestamp() we can correlate the tskey with the socket addr.
> > > > > > > >
> > > > > > > >
> > > > > > > > > It is accurate since they are under the protect of socket lock.
> > > > > > > > > More details can be found in the selftest.
> > > > > > > >
> > > > > > > > The selftest uses the bpf_sk_storage to store the sendmsg timestamp at
> > > > > > > > fentry/tcp_sendmsg_locked and retrieves it back at tcp_tx_timestamp (i.e.
> > > > > > > > BPF_SOCK_OPS_TS_SND_CB added in this patch).
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jason Xing <kerneljasonxing@...il.com>
> > > > > > > > > ---
> > > > > > > > > include/uapi/linux/bpf.h | 7 +++++++
> > > > > > > > > net/ipv4/tcp.c | 1 +
> > > > > > > > > tools/include/uapi/linux/bpf.h | 7 +++++++
> > > > > > > > > 3 files changed, 15 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > > > > > index 800122a8abe5..accb3b314fff 100644
> > > > > > > > > --- a/include/uapi/linux/bpf.h
> > > > > > > > > +++ b/include/uapi/linux/bpf.h
> > > > > > > > > @@ -7052,6 +7052,13 @@ enum {
> > > > > > > > > * when SK_BPF_CB_TX_TIMESTAMPING
> > > > > > > > > * feature is on.
> > > > > > > > > */
> > > > > > > > > + BPF_SOCK_OPS_TS_SND_CB, /* Called when every sendmsg syscall
> > > > > > > > > + * is triggered. For TCP, it stays
> > > > > > > > > + * in the last send process to
> > > > > > > > > + * correlate with tcp_sendmsg timestamp
> > > > > > > > > + * with other timestamping callbacks,
> > > > > > > > > + * like SND/SW/ACK.
> > > > > > > >
> > > > > > > > Do you have a chance to look at how this will work at UDP?
> > > > > > >
> > > > > > > Sure, I feel like it could not be useful for UDP. Well, things get
> > > > > > > strange because I did write a long paragraph about this thing which
> > > > > > > apparently disappeared...
> > > > > > >
> > > > > > > I manage to find what I wrote:
> > > > > > > For UDP type, BPF_SOCK_OPS_TS_SND_CB may be not suitable because
> > > > > > > there are two sending process, 1) lockless path, 2) lock path, which
> > > > > > > should be handled carefully later. For the former, even though it's
> > > > > > > unlikely multiple threads access the socket to call sendmsg at the
> > > > > > > same time, I think we'd better not correlate it like what we do to the
> > > > > > > TCP case because of the lack of sock lock protection. Considering SND_CB is
> > > > > > > uapi flag, I think we don't need to forcely add the 'TCP_' prefix in
> > > > > > > case we need to use it someday.
> > > > > > >
> > > > > > > And one more thing is I'd like to use the v5[1] method in the next round
> > > > > > > to introduce a new tskey_bpf which is good for UDP type. The new field
> > > > > > > will not conflict with the tskey in shared info which is generated
> > > > > > > by sk->sk_tskey in __ip_append_data(). It hardly works if both features
> > > > > > > (so_timestamping and its bpf extension) exists at the same time. Users
> > > > > > > could get confused because sometimes they fetch the tskey from skb,
> > > > > > > sometimes they don't, especially when we have cmsg feature to turn it on/
> > > > > > > off per sendmsg. A standalone tskey for bpf extension will be needed.
> > > > > > > With this tskey_bpf, we can easily correlate the timestamp in sendmsg
> > > > > > > syscall with other tx points(SND/SW/ACK...).
> > > > > > >
> > > > > > > [1]: https://lore.kernel.org/all/20250112113748.73504-14-kerneljasonxing@gmail.com/
> > > > > > >
> > > > > > > If possible, we can leave this question until the UDP support series
> > > > > > > shows up. I will figure out a better solution :)
> > > > > > >
> > > > > > > In conclusion, it probably won't be used by the UDP type. It's uAPI
> > > > > > > flag so I consider the compatibility reason.
> > > > > >
> > > > > > I don't think this is acceptable. We should aim for an API that can
> > > > > > easily be used across protocols, like SO_TIMESTAMPING.
> > > > >
> > > > > After I revisit the UDP SO_TIMESTAMPING again, my thoughts are
> > > > > adjusted like below:
> > > > >
> > > > > It's hard to provide an absolutely uniform interface or usage to users
> > > > > for TCP and UDP and even more protocols. Cases can be handled one by
> > > > > one.
> > >
> > > We should try hard. SO_TIMESTAMPING is uniform across protocols.
> > > An interface that is not is just hard to use.
> > >
> > > > > The main obstacle is how we can correlate the timestamp in
> > > > > sendmsg syscall with other sending timestamps. It's worth noticing
> > > > > that for SO_TIMESTAMPING the sendmsg timestamp is collected in the
> > > > > userspace. For instance, while skb enters the qdisc, we fail to know
> > > > > which skb belongs to which sendmsg.
> > > > >
> > > > > An idea coming up is to introduce BPF_SOCK_OPS_TS_SND_CB to correlate
> > > > > the sendmsg timestamp with tskey (in tcp_tx_timestamp()) under the
> > > > > protection of socket lock + syscall as the current patch does. But for
> > > > > UDP, it can be lockless. IIUC, there is a very special case where even
> > > > > SO_TIMESTAMPING may get lost: if multiple threads accessing the same
> > > > > socket send UDP packets in parallel, then users could be confused
> > > > > which tskey matches which sendmsg.
> > >
> > > This is a known issue for lockless datagram sockets.
> > >
> > > With SO_TIMESTAMPING, but the use of timestamping and of concurrent
> > > sendmsg calls is under control of the process, so it only shoots
> > > itself in the foot.
> > >
> > > With BPF timestamping, a process may confuse a third party admin, so
> > > the situation is slightly different.
> >
> > Agreed.
> >
> > >
> > > > > IIUC, I will not consider this
> > > > > unlikely case, then the UDP case is quite similar to the TCP case.
> > > > >
> > > > > The scenario for the UDP case is:
> > > > > 1) using fentry bpf to hook the udp_sendmsg() to get the timestamp
> > > > > like TCP does in this series.
> > > > > 2) insert BPF_SOCK_OPS_TS_SND_CB into __ip_append_data() near the
> > > > > SO_TIMESTAMPING code snippets to let bpf program correlate the tskey
> > > > > with timestamp.
> > > > > Note: tskey in UDP will be handled carefully in a different way
> > > > > because we should support both modes for socket timestamping at the
> > > > > same time.
> > > > > It's really similar to TCP regardless of handling tskey.
> > > > >
> > > >
> > > > To be more precise in case you don't have much time to read the above
> > > > long paragraph, BPF_SOCK_OPS_TS_SND_CB is mainly used to correlate
> > > > sendmsg timestamp with corresponding tskey.
> > > >
> > > > 1. For TCP, we can correlate it in tcp_tx_timestamp() like this patch does.
> > > >
> > > > 2. For UDP, we can correlate in __ip_append_data() along with those
> > > > tskey initialization, assuming there are no multiple threads calling
> > > > locklessly ip_make_skb(). Locked path
> > > > (udp_sendmsg()->ip_append_data()) works like TCP under the socket lock
> > > > protection, so it can be easily handled. Lockless path
> > > > (udp_sendmsg()->ip_make_skb()) can be visited by multiple threads at
> > > > the same time, which should be handled properly.
> > >
> > > Different hook points is fine, as UDP (and RAW) uses __ip_append_data
> >
> > Then this approach (introducing this new flag) is feasible. Sorry that
> > last night I wrote such a long paragraph which buried something
> > important. Because of that, I rephrase the whole idea about how to let
> > UDP work with this kind of new flag in [patch v8 11/12]. Link is
> > https://lore.kernel.org/all/CAL+tcoCmXcDot-855XYU7PKCiGvJL=O3CQBGuOTRAs2_=Ys=gg@mail.gmail.com/
> >
> > > or more importantly ip_send_skb, while TCP uses ip_queue_xmit.
> >
> > For TCP, we use tcp_tx_timestamp to finish the map between sendmsg
> > timestamp and tskey.
> >
> > >
> > > As long as the API is the same: the operation (BPF_SOCK_OPS_TS_SND_CB)
> > > and the behavior of that operation. Subject to the usual distinction
> > > between protocol behavior (bytestream vs datagram).
> >
> > I see your point.
> >
> > >
> > > > I prefer to implement
> > > > the bpf extension for IPCORK_TS_OPT_ID, which should be another topic,
> > > > I think. This might be the only one corner case, IIUC?
> > >
> > > This sounds like an entirely different topic? Not sure what this is.
> >
> > Not really a different topic. I mean let bpf prog take the whole
> > control of setting the tskey, then with this BPF_SOCK_OPS_TS_SND_CB
> > flag we can correlate the sendmsg timestamp with tskey. So It has
> > something to do with the usage of UDP. Please take a look at that link
> > to patch 11/12. For TCP, we don't need to care about the value of
> > tskey which has already been taken care of by SO_TIMESTAMPING. So it
> > is slightly different. I'm not sure if this kind of usage is
> > acceptable?
>
> Why can TCP rely on SO_TIMESTAMPING to set tskey, but UDP cannot?
Because for TCP the shared info tskey is calculated by seqno (in
tcp_tx_timestamp()), so it works for so_timestamping and its bpf
extension and they are the same. However, for UDP, the shared info
tskey can be different, depending on when to call __ip_append_data()
and what the sk->sk_tskey is. It can cause conflicts when two modes
work at the same time. More than that, lockless UDP case is a tough
one since we cannot correlate the sendmsg timestamp in udp_sendmsg()
with the tskey generated in __ip_append_data(), which is a long gap
without any lock. So reuse the IPCORK_TS_OPT_ID logic for bpf
extension here can work.
It's worth to highlight that 1) for TCP the time to generate a sendmsg
timestamp and generate a tskey are under the same lock, 2) for
non-lockless UDP, it works like TCP. But in order to deal with the
lockless part, I choose to implement the IPCORK_TS_OPT_ID idea. This
is somehow another topic, but the relevant part is that I tried to
prove BPF_SOCK_OPS_TS_SND_CB can be also used in UDP (but in a
different position compared to TCP protocol).
>
> BPF will need to set the key for both protocol if SO_TIMESTAMPING is
> not enabled, right?
For TCP, we will rely on 'shinfo->tskey = TCP_SKB_CB(skb)->seq +
skb->len - 1;' in tcp_tx_timestamp() regardless of the status of
SO_TIMESTAMPING, so don't bother to manage the tskey from bpf prog.
While for UDP, we have to manage the tskey. Of course, it's another
topic.
Thanks,
Jason
Powered by blists - more mailing lists