lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoCPGAjs=+Hnzr4RLkioUV7nzy=ZmKkTDPA7sBeVP=qzow@mail.gmail.com>
Date: Thu, 6 Feb 2025 11:09:49 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: Martin KaFai Lau <martin.lau@...ux.dev>, Jakub Kicinski <kuba@...nel.org>, davem@...emloft.net, 
	edumazet@...gle.com, pabeni@...hat.com, dsahern@...nel.org, 
	willemb@...gle.com, ast@...nel.org, daniel@...earbox.net, andrii@...nel.org, 
	eddyz87@...il.com, song@...nel.org, yonghong.song@...ux.dev, 
	john.fastabend@...il.com, kpsingh@...nel.org, sdf@...ichev.me, 
	haoluo@...gle.com, jolsa@...nel.org, horms@...nel.org, bpf@...r.kernel.org, 
	netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next v8 10/12] bpf: make TCP tx timestamp bpf
 extension work

On Thu, Feb 6, 2025 at 10:56 AM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> Jason Xing wrote:
> > On Thu, Feb 6, 2025 at 9:05 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > On Thu, Feb 6, 2025 at 8:47 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> > > >
> > > > On 2/5/25 4:12 PM, Jason Xing wrote:
> > > > > On Thu, Feb 6, 2025 at 5:57 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> > > > >>
> > > > >> On 2/4/25 5:57 PM, Jakub Kicinski wrote:
> > > > >>> On Wed,  5 Feb 2025 02:30:22 +0800 Jason Xing wrote:
> > > > >>>> +    if (cgroup_bpf_enabled(CGROUP_SOCK_OPS) &&
> > > > >>>> +        SK_BPF_CB_FLAG_TEST(sk, SK_BPF_CB_TX_TIMESTAMPING) && skb) {
> > > > >>>> +            struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > > >>>> +            struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
> > > > >>>> +
> > > > >>>> +            tcb->txstamp_ack_bpf = 1;
> > > > >>>> +            shinfo->tx_flags |= SKBTX_BPF;
> > > > >>>> +            shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
> > > > >>>> +    }
> > > > >>>
> > > > >>> If BPF program is attached we'll timestamp all skbs? Am I reading this
> > > > >>> right?
> > > > >>
> > > > >> If the attached bpf program explicitly turns on the SK_BPF_CB_TX_TIMESTAMPING
> > > > >> bit of a sock, then all skbs of this sock will be tx timestamp-ed.
> > > > >
> > > > > Martin, I'm afraid it's not like what you expect. Only the last
> > > > > portion of the sendmsg will enter the above function which means if
> > > > > the size of sendmsg is large, only the last skb will be set SKBTX_BPF
> > > > > and be timestamped.
> > > >
> > > > Sure. The last skb of a large msg and more skb of small msg (or MSG_EOR).
> > > >
> > > > My point is, only attaching a bpf alone is not enough. The
> > > > SK_BPF_CB_TX_TIMESTAMPING still needs to be turned on.
> > >
> > > Right.
> > >
> > > >
> > > > >
> > > > >>
> > > > >>>
> > > > >>> Wouldn't it be better to let BPF_SOCK_OPS_TS_SND_CB return whether it's
> > > > >>> interested in tracing current packet all the way thru the stack?
> > > > >>
> > > > >> I like this idea. It can give the BPF prog a chance to do skb sampling on a
> > > > >> particular socket.
> > > > >>
> > > > >> The return value of BPF_SOCK_OPS_TS_SND_CB (or any cgroup BPF prog return value)
> > > > >> already has another usage, which its return value is currently enforced by the
> > > > >> verifier. It is better not to convolute it further.
> > > > >>
> > > > >> I don't prefer to add more use cases to skops->reply either, which is an union
> > > > >> of args[4], such that later progs (in the cgrp prog array) may lose the args value.
> > > > >>
> > > > >> Jason, instead of always setting SKBTX_BPF and txstamp_ack_bpf in the kernel, a
> > > > >> new BPF kfunc can be added so that the BPF prog can call it to selectively set
> > > > >> SKBTX_BPF and txstamp_ack_bpf in some skb.
> > > > >
> > > > > Agreed because at netdev 0x19 I have an explicit plan to share the
> > > > > experience from our company about how to trace all the skbs which were
> > > > > completed through a kernel module. It's how we use in production
> > > > > especially for debug or diagnose use.
> > > >
> > > > This is fine. The bpf prog can still do that by calling the kfunc. I don't see
> > > > why move the bit setting into kfunc makes the whole set won't work.
> > > >
> > > > > I'm not knowledgeable enough about BPF, so I'd like to know if there
> > > > > are some functions that I can take as good examples?
> > > > >
> > > > > I think it's a standalone and good feature, can I handle it after this series?
> > > >
> > > > Unfortunately, no. Once the default is on, this cannot be changed.
> > > >
> > > > I think Jakub's suggestion to allow bpf prog selectively choose skb to timestamp
> > > > is useful, so I suggested a way to do it.
> > >
> > > Because, sorry, I don't want to postpone this series any longer (blame
> > > on me for delaying almost 4 months), only wanting to focus on the
> > > extension for SO_TIMESTAMPING so that we can quickly move on with
> > > small changes per series.
> > >
> > > Selectively sampling the skbs or sampling all the skbs could be an
> > > optional good choice/feature for users instead of mandatory?
> > >
> > > There are two kinds of monitoring in production: 1) daily monitoring,
> > > 2) diagnostic monitoring which I'm not sure if I translate in the
> > > right way. For the former that is obviously a light-weight feature, I
> > > think we don't need to trace that many skbs, only the last skb is
> > > enough which was done in Google because even the selective feature[1]
> > > is a little bit heavy. I received some complaints from a few
> > > latency-sensitive customers to ask us if we can reduce the monitoring
> > > in the kernel because as I mentioned before many issues are caused by
> > > the application itself instead of kernel.
> > >
> > > [1] selective feature consists of two parts, only selectively
> > > collecting all the skbs in a certain period or selectively collecting
> > > exactly like what SO_TIMESTAMPING does in a certain period. It might
> > > need a full discussion, I reckon.
> >
> > I presume you might refer to the former. It works like the cmsg
> > feature which can be a good selectively sampling example. It would be
> > better to check the value of reply in the BPF_SOCK_OPS_TS_SND_CB
> > callback which is nearly the very beginning of each sendmsg syscall
> > because I have a hunch we will add more hook points before skb enters
> > the qdisc.
> >
> > I think we can split the whole idea into two parts: for now, because
> > of the current series implementing the same function as SO_TIMETAMPING
> > does, I will implement the selective sample feature in the series.
> > After someday we finish tracing all the skb, then we will add the
> > corresponding selective sample feature.
>
> Are you saying that you will include selective sampling now or want to
> postpone it?

A few months ago, I planned to do it after this series. Since you all
ask, it's not complex to have it included in this series :)

Selective sampling has two kinds of meaning like I mentioned above, so
in the next re-spin I will implement the cmsg feature for bpf
extension in this series. I'm doing the test right now. And leave
another selective sampling small feature until the feature of tracing
all the skbs is implemented if possible.

>
> Jakub brought up a great point. Our continuous deployment would not be
> feasible without sampling. Indeed implemented using cmsg.

Right, right. I just realized that I misunderstood what Jakub offered.

>
> I think it should be included from the initial patch series.

I agree to include this in this series. Like what I wrote in the
previous thread, it should be simple :) And it will be manifested in
the selftests as well.

Thanks,
Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ