[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <67a424d2aa9ea_19943029427@willemb.c.googlers.com.notmuch>
Date: Wed, 05 Feb 2025 21:56:18 -0500
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Jason Xing <kerneljasonxing@...il.com>,
Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Jakub Kicinski <kuba@...nel.org>,
davem@...emloft.net,
edumazet@...gle.com,
pabeni@...hat.com,
dsahern@...nel.org,
willemdebruijn.kernel@...il.com,
willemb@...gle.com,
ast@...nel.org,
daniel@...earbox.net,
andrii@...nel.org,
eddyz87@...il.com,
song@...nel.org,
yonghong.song@...ux.dev,
john.fastabend@...il.com,
kpsingh@...nel.org,
sdf@...ichev.me,
haoluo@...gle.com,
jolsa@...nel.org,
horms@...nel.org,
bpf@...r.kernel.org,
netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf
extension work
Jason Xing wrote:
> On Thu, Feb 6, 2025 at 9:05 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> >
> > On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> > >
> > > On 2/5/25 4:12 PM, Jason Xing wrote:
> > > > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> > > >>
> > > >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > > >>> On Wed, 5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > > >>>> + if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > > >>>> + SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > > >>>> + struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > >>>> + struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > > >>>> +
> > > >>>> + tcb->txstamp_ack_bpf = 1;
> > > >>>> + shinfo->tx_flags |= SKBTX_BPF;
> > > >>>> + shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > > >>>> + }
> > > >>>
> > > >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> > > >>> right?
> > > >>
> > > >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> > > >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> > > >
> > > > Martin, I'm afraid it's not like what you expect. Only the last
> > > > portion of the sendmsg will enter the above function which means if
> > > > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > > > and be timestamped.
> > >
> > > Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
> > >
> > > My point is, only attaching a bpf alone is not enough. The
> > > SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.
> >
> > Right.
> >
> > >
> > > >
> > > >>
> > > >>>
> > > >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > > >>> interested in tracing current packet all the way thru the stack?
> > > >>
> > > >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> > > >> particular socket.
> > > >>
> > > >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> > > >> already has another usage, which its return value is currently enforced by the
> > > >> verifier. It is better not to convolute it further.
> > > >>
> > > >> I don't prefer to add more use cases to skops->reply either, which is an union
> > > >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> > > >>
> > > >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> > > >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> > > >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> > > >
> > > > Agreed because at netdev 0x19 I have an explicit plan to share the
> > > > experience from our company about how to trace all the skbs which were
> > > > completed through a kernel module. It's how we use in production
> > > > especially for debug or diagnose use.
> > >
> > > This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> > > why move the bit setting into kfunc makes the whole set won't work.
> > >
> > > > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > > > are some functions that I can take as good examples?
> > > >
> > > > I think it's a standalone and good feature, can I handle it after this series?
> > >
> > > Unfortunately, no. Once the default is on, this cannot be changed.
> > >
> > > I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> > > is useful, so I suggested a way to do it.
> >
> > Because, sorry, I don't want to postpone this series any longer (blame
> > on me for delaying almost 4 months), only wanting to focus on the
> > extension for SO_TIMESTAMPING so that we can quickly move on with
> > small changes per series.
> >
> > Selectively sampling the skbs or sampling all the skbs could be an
> > optional good choice/feature for users instead of mandatory?
> >
> > There are two kinds of monitoring in production: 1) daily monitoring,
> > 2) diagnostic monitoring which I'm not sure if I translate in the
> > right way. For the former that is obviously a light-weight feature, I
> > think we don't need to trace that many skbs, only the last skb is
> > enough which was done in Google because even the selective feature[1]
> > is a little bit heavy. I received some complaints from a few
> > latency-sensitive customers to ask us if we can reduce the monitoring
> > in the kernel because as I mentioned before many issues are caused by
> > the application itself instead of kernel.
> >
> > [1] selective feature consists of two parts, only selectively
> > collecting all the skbs in a certain period or selectively collecting
> > exactly like what SO_TIMESTAMPING does in a certain period. It might
> > need a full discussion, I reckon.
>
> I presume you might refer to the former. It works like the cmsg
> feature which can be a good selectively sampling example. It would be
> better to check the value of reply in the BPF_SOCK_OPS_TS_SND_CB
> callback which is nearly the very beginning of each sendmsg syscall
> because I have a hunch we will add more hook points before skb enters
> the qdisc.
>
> I think we can split the whole idea into two parts: for now, because
> of the current series implementing the same function as SO_TIMETAMPING
> does, I will implement the selective sample feature in the series.
> After someday we finish tracing all the skb, then we will add the
> corresponding selective sample feature.
Are you saying that you will include selective sampling now or want to
postpone it?
Jakub brought up a great point. Our continuous deployment would not be
feasible without sampling. Indeed implemented using cmsg.
I think it should be included from the initial patch series.
> But the default mode is the exact same as SO_TIMESTAMPING instead of
> asking bpf prog to enable the sample feature. Does it make sense to
> you?
>
> With that said, the patch looks like this:
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 1f528e63bc71..73909dad7ed4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -497,11 +497,14 @@ static void tcp_tx_timestamp(struct sock *sk,
> struct sockcm_cookie *sockc)
> SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> struct skb_shared_info *shinfo = skb_shinfo(skb);
> struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> + bool enable_sample = true;
>
> - tcb->txstamp_ack_bpf = 1;
> - shinfo->tx_flags |= SKBTX_BPF;
> - shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> - bpf_skops_tx_timestamping(sk, skb, BPF_SOCK_OPS_TS_SND_CB);
> + enable_sample = bpf_skops_tx_timestamping(sk, skb,
> BPF_SOCK_OPS_TS_SND_CB);
> + if (enable_sample) {
> + tcb->txstamp_ack_bpf = 1;
> + shinfo->tx_flags |= SKBTX_BPF;
> + shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> + }
> }
> }
>
> Thanks,
> Jason
Powered by blists - more mailing lists