[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoCeCVVe=Uiu9Onr8efA9udqL+1yXyckkWwS7PemyZbhUw@mail.gmail.com>
Date: Thu, 6 Feb 2025 12:03:01 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Martin KaFai Lau <martin.lau@...ux.dev>, davem@...emloft.net, edumazet@...gle.com,
kuba@...nel.org, pabeni@...hat.com, dsahern@...nel.org, willemb@...gle.com,
ast@...nel.org, daniel@...earbox.net, andrii@...nel.org, eddyz87@...il.com,
song@...nel.org, yonghong.song@...ux.dev, john.fastabend@...il.com,
kpsingh@...nel.org, sdf@...ichev.me, haoluo@...gle.com, jolsa@...nel.org,
horms@...nel.org, bpf@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v7 11/13] net-timestamp: add a new callback in tcp_tx_timestamp()
On Thu, Feb 6, 2025 at 11:00 AM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> Jason Xing wrote:
> > On Thu, Feb 6, 2025 at 5:02 AM Willem de Bruijn
> > <willemdebruijn.kernel@...il.com> wrote:
> > >
> > > Jason Xing wrote:
> > > > On Wed, Feb 5, 2025 at 11:20 PM Willem de Bruijn
> > > > <willemdebruijn.kernel@...il.com> wrote:
> > > > >
> > > > > Jason Xing wrote:
> > > > > > On Wed, Feb 5, 2025 at 2:09 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > > > > > >
> > > > > > > On Wed, Feb 5, 2025 at 1:08 AM Willem de Bruijn
> > > > > > > <willemdebruijn.kernel@...il.com> wrote:
> > > > > > > >
> > > > > > > > Jason Xing wrote:
> > > > > > > > > On Tue, Feb 4, 2025 at 9:16 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> > > > > > > > > >
> > > > > > > > > > On 1/28/25 12:46 AM, Jason Xing wrote:
> > > > > > > > > > > Introduce the callback to correlate tcp_sendmsg timestamp with other
> > > > > > > > > > > points, like SND/SW/ACK. We can let bpf trace the beginning of
> > > > > > > > > > > tcp_sendmsg_locked() and fetch the socket addr, so that in
> > > > > > > > > >
> > > > > > > > > > Instead of "fetch the socket addr...", should be "store the sendmsg timestamp at
> > > > > > > > > > the bpf_sk_storage ...".
> > > > > > > > >
> > > > > > > > > I will revise it. Thanks.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > tcp_tx_timestamp() we can correlate the tskey with the socket addr.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > It is accurate since they are under the protect of socket lock.
> > > > > > > > > > > More details can be found in the selftest.
> > > > > > > > > >
> > > > > > > > > > The selftest uses the bpf_sk_storage to store the sendmsg timestamp at
> > > > > > > > > > fentry/tcp_sendmsg_locked and retrieves it back at tcp_tx_timestamp (i.e.
> > > > > > > > > > BPF_SOCK_OPS_TS_SND_CB added in this patch).
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jason Xing <kerneljasonxing@...il.com>
> > > > > > > > > > > ---
> > > > > > > > > > > include/uapi/linux/bpf.h | 7 +++++++
> > > > > > > > > > > net/ipv4/tcp.c | 1 +
> > > > > > > > > > > tools/include/uapi/linux/bpf.h | 7 +++++++
> > > > > > > > > > > 3 files changed, 15 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > > > > > > > > index 800122a8abe5..accb3b314fff 100644
> > > > > > > > > > > --- a/include/uapi/linux/bpf.h
> > > > > > > > > > > +++ b/include/uapi/linux/bpf.h
> > > > > > > > > > > @@ -7052,6 +7052,13 @@ enum {
> > > > > > > > > > > * when SK_BPF_CB_TX_TIMESTAMPING
> > > > > > > > > > > * feature is on.
> > > > > > > > > > > */
> > > > > > > > > > > + BPF_SOCK_OPS_TS_SND_CB, /* Called when every sendmsg syscall
> > > > > > > > > > > + * is triggered. For TCP, it stays
> > > > > > > > > > > + * in the last send process to
> > > > > > > > > > > + * correlate with tcp_sendmsg timestamp
> > > > > > > > > > > + * with other timestamping callbacks,
> > > > > > > > > > > + * like SND/SW/ACK.
> > > > > > > > > >
> > > > > > > > > > Do you have a chance to look at how this will work at UDP?
> > > > > > > > >
> > > > > > > > > Sure, I feel like it could not be useful for UDP. Well, things get
> > > > > > > > > strange because I did write a long paragraph about this thing which
> > > > > > > > > apparently disappeared...
> > > > > > > > >
> > > > > > > > > I manage to find what I wrote:
> > > > > > > > > For UDP type, BPF_SOCK_OPS_TS_SND_CB may be not suitable because
> > > > > > > > > there are two sending process, 1) lockless path, 2) lock path, which
> > > > > > > > > should be handled carefully later. For the former, even though it's
> > > > > > > > > unlikely multiple threads access the socket to call sendmsg at the
> > > > > > > > > same time, I think we'd better not correlate it like what we do to the
> > > > > > > > > TCP case because of the lack of sock lock protection. Considering SND_CB is
> > > > > > > > > uapi flag, I think we don't need to forcely add the 'TCP_' prefix in
> > > > > > > > > case we need to use it someday.
> > > > > > > > >
> > > > > > > > > And one more thing is I'd like to use the v5[1] method in the next round
> > > > > > > > > to introduce a new tskey_bpf which is good for UDP type. The new field
> > > > > > > > > will not conflict with the tskey in shared info which is generated
> > > > > > > > > by sk->sk_tskey in __ip_append_data(). It hardly works if both features
> > > > > > > > > (so_timestamping and its bpf extension) exists at the same time. Users
> > > > > > > > > could get confused because sometimes they fetch the tskey from skb,
> > > > > > > > > sometimes they don't, especially when we have cmsg feature to turn it on/
> > > > > > > > > off per sendmsg. A standalone tskey for bpf extension will be needed.
> > > > > > > > > With this tskey_bpf, we can easily correlate the timestamp in sendmsg
> > > > > > > > > syscall with other tx points(SND/SW/ACK...).
> > > > > > > > >
> > > > > > > > > [1]: https://lore.kernel.org/all/20250112113748.73504-14-kerneljasonxing@gmail.com/
> > > > > > > > >
> > > > > > > > > If possible, we can leave this question until the UDP support series
> > > > > > > > > shows up. I will figure out a better solution :)
> > > > > > > > >
> > > > > > > > > In conclusion, it probably won't be used by the UDP type. It's uAPI
> > > > > > > > > flag so I consider the compatibility reason.
> > > > > > > >
> > > > > > > > I don't think this is acceptable. We should aim for an API that can
> > > > > > > > easily be used across protocols, like SO_TIMESTAMPING.
> > > > > > >
> > > > > > > After I revisit the UDP SO_TIMESTAMPING again, my thoughts are
> > > > > > > adjusted like below:
> > > > > > >
> > > > > > > It's hard to provide an absolutely uniform interface or usage to users
> > > > > > > for TCP and UDP and even more protocols. Cases can be handled one by
> > > > > > > one.
> > > > >
> > > > > We should try hard. SO_TIMESTAMPING is uniform across protocols.
> > > > > An interface that is not is just hard to use.
> > > > >
> > > > > > > The main obstacle is how we can correlate the timestamp in
> > > > > > > sendmsg syscall with other sending timestamps. It's worth noticing
> > > > > > > that for SO_TIMESTAMPING the sendmsg timestamp is collected in the
> > > > > > > userspace. For instance, while skb enters the qdisc, we fail to know
> > > > > > > which skb belongs to which sendmsg.
> > > > > > >
> > > > > > > An idea coming up is to introduce BPF_SOCK_OPS_TS_SND_CB to correlate
> > > > > > > the sendmsg timestamp with tskey (in tcp_tx_timestamp()) under the
> > > > > > > protection of socket lock + syscall as the current patch does. But for
> > > > > > > UDP, it can be lockless. IIUC, there is a very special case where even
> > > > > > > SO_TIMESTAMPING may get lost: if multiple threads accessing the same
> > > > > > > socket send UDP packets in parallel, then users could be confused
> > > > > > > which tskey matches which sendmsg.
> > > > >
> > > > > This is a known issue for lockless datagram sockets.
> > > > >
> > > > > With SO_TIMESTAMPING, but the use of timestamping and of concurrent
> > > > > sendmsg calls is under control of the process, so it only shoots
> > > > > itself in the foot.
> > > > >
> > > > > With BPF timestamping, a process may confuse a third party admin, so
> > > > > the situation is slightly different.
> > > >
> > > > Agreed.
> > > >
> > > > >
> > > > > > > IIUC, I will not consider this
> > > > > > > unlikely case, then the UDP case is quite similar to the TCP case.
> > > > > > >
> > > > > > > The scenario for the UDP case is:
> > > > > > > 1) using fentry bpf to hook the udp_sendmsg() to get the timestamp
> > > > > > > like TCP does in this series.
> > > > > > > 2) insert BPF_SOCK_OPS_TS_SND_CB into __ip_append_data() near the
> > > > > > > SO_TIMESTAMPING code snippets to let bpf program correlate the tskey
> > > > > > > with timestamp.
> > > > > > > Note: tskey in UDP will be handled carefully in a different way
> > > > > > > because we should support both modes for socket timestamping at the
> > > > > > > same time.
> > > > > > > It's really similar to TCP regardless of handling tskey.
> > > > > > >
> > > > > >
> > > > > > To be more precise in case you don't have much time to read the above
> > > > > > long paragraph, BPF_SOCK_OPS_TS_SND_CB is mainly used to correlate
> > > > > > sendmsg timestamp with corresponding tskey.
> > > > > >
> > > > > > 1. For TCP, we can correlate it in tcp_tx_timestamp() like this patch does.
> > > > > >
> > > > > > 2. For UDP, we can correlate in __ip_append_data() along with those
> > > > > > tskey initialization, assuming there are no multiple threads calling
> > > > > > locklessly ip_make_skb(). Locked path
> > > > > > (udp_sendmsg()->ip_append_data()) works like TCP under the socket lock
> > > > > > protection, so it can be easily handled. Lockless path
> > > > > > (udp_sendmsg()->ip_make_skb()) can be visited by multiple threads at
> > > > > > the same time, which should be handled properly.
> > > > >
> > > > > Different hook points is fine, as UDP (and RAW) uses __ip_append_data
> > > >
> > > > Then this approach (introducing this new flag) is feasible. Sorry that
> > > > last night I wrote such a long paragraph which buried something
> > > > important. Because of that, I rephrase the whole idea about how to let
> > > > UDP work with this kind of new flag in [patch v8 11/12]. Link is
> > > > https://lore.kernel.org/all/CAL+tcoCmXcDot-855XYU7PKCiGvJL=O3CQBGuOTRAs2_=Ys=gg@mail.gmail.com/
> > > >
> > > > > or more importantly ip_send_skb, while TCP uses ip_queue_xmit.
> > > >
> > > > For TCP, we use tcp_tx_timestamp to finish the map between sendmsg
> > > > timestamp and tskey.
> > > >
> > > > >
> > > > > As long as the API is the same: the operation (BPF_SOCK_OPS_TS_SND_CB)
> > > > > and the behavior of that operation. Subject to the usual distinction
> > > > > between protocol behavior (bytestream vs datagram).
> > > >
> > > > I see your point.
> > > >
> > > > >
> > > > > > I prefer to implement
> > > > > > the bpf extension for IPCORK_TS_OPT_ID, which should be another topic,
> > > > > > I think. This might be the only one corner case, IIUC?
> > > > >
> > > > > This sounds like an entirely different topic? Not sure what this is.
> > > >
> > > > Not really a different topic. I mean let bpf prog take the whole
> > > > control of setting the tskey, then with this BPF_SOCK_OPS_TS_SND_CB
> > > > flag we can correlate the sendmsg timestamp with tskey. So It has
> > > > something to do with the usage of UDP. Please take a look at that link
> > > > to patch 11/12. For TCP, we don't need to care about the value of
> > > > tskey which has already been taken care of by SO_TIMESTAMPING. So it
> > > > is slightly different. I'm not sure if this kind of usage is
> > > > acceptable?
> > >
> > > Why can TCP rely on SO_TIMESTAMPING to set tskey, but UDP cannot?
> >
> > Because for TCP the shared info tskey is calculated by seqno (in
> > tcp_tx_timestamp()), so it works for so_timestamping and its bpf
> > extension and they are the same. However, for UDP, the shared info
> > tskey can be different, depending on when to call __ip_append_data()
> > and what the sk->sk_tskey is. It can cause conflicts when two modes
> > work at the same time.
>
> lockless and locked cannot conflict. (if up->pending then the only
> option is to append to that.)
Sorry, I should have described more about this point. I was trying to
say that if two modes (bpf extension and application timestamping)
work at the same time, will the tskey get messed up? Because we have
to check if the application mode is turned on. If on, we fetch the
existing key generated by the application mode, or else we generate
one by modifying the sk->sk_tskey or the tskey of skb?
>
> > More than that, lockless UDP case is a tough
> > one since we cannot correlate the sendmsg timestamp in udp_sendmsg()
> > with the tskey generated in __ip_append_data(),
>
> With SO_TIMESTAMPING we do not have this distinction between TCP and
> UDP, so we don't need it here.
>
> It is true that multiple lockless sendmsg calls can race and in that
> case that correlation is ambiguous. That is also the case for
> SO_TIMESTAMPING and a known issue.
>
> This is unlikely in most workloads in practice.
Oh, I see.
> > which is a long gap
> > without any lock. So reuse the IPCORK_TS_OPT_ID logic for bpf
> > extension here can work.
>
> It is fine to add a solution to work around the ambiguity. But not
> to make it a precondition and so diverge the API for TCP and UDP.
There is no divergence in API. BPF always uses SND flag to finish the
correlation. The only difference is UDP needs to manage the tskey pool
(allocating which tskey to which sendmsg).
>
> The same argument to choose a key from BPF can be made for TCP to
> a certain extent.
Well...right, but we don't bother to do this for TCP. The TCP case is simpler.
It seems that you object to the idea that let bpf prog control the
allocating tskey for UDP _as default_.
I'm not sure if I follow you. Sorry for repeating in case that I miss something:
1) For UDP, do not allocate the tskey from the bpf program, unless
it's an alternative/workaround to handle the lockless case. So it's a
backup choice.
2) Use the exact same way like TCP to finish the correlation on the
basis of socket lock protection _as default_. Because the lockless
method is seldomly used then we may provide 1) method?
?
Thanks,
Jason
>
> > It's worth to highlight that 1) for TCP the time to generate a sendmsg
> > timestamp and generate a tskey are under the same lock, 2) for
> > non-lockless UDP, it works like TCP. But in order to deal with the
> > lockless part, I choose to implement the IPCORK_TS_OPT_ID idea. This
> > is somehow another topic, but the relevant part is that I tried to
> > prove BPF_SOCK_OPS_TS_SND_CB can be also used in UDP (but in a
> > different position compared to TCP protocol).
> >
> > >
> > > BPF will need to set the key for both protocol if SO_TIMESTAMPING is
> > > not enabled, right?
> >
> > For TCP, we will rely on 'shinfo->tskey = TCP_SKB_CB(skb)->seq +
> > skb->len - 1;' in tcp_tx_timestamp() regardless of the status of
> > SO_TIMESTAMPING, so don't bother to manage the tskey from bpf prog.
> > While for UDP, we have to manage the tskey. Of course, it's another
> > topic.
> >
> > Thanks,
> > Jason
>
>
Powered by blists - more mailing lists