[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoBU1FSVvhiTmapX=Byhv8W0T1f8oFR4sAZ1g4xONgSPrg@mail.gmail.com>
Date: Wed, 16 Oct 2024 14:08:22 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
pabeni@...hat.com, dsahern@...nel.org, willemdebruijn.kernel@...il.com,
willemb@...gle.com, ast@...nel.org, daniel@...earbox.net, andrii@...nel.org,
eddyz87@...il.com, song@...nel.org, yonghong.song@...ux.dev,
john.fastabend@...il.com, kpsingh@...nel.org, sdf@...ichev.me,
haoluo@...gle.com, jolsa@...nel.org, bpf@...r.kernel.org,
netdev@...r.kernel.org, Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next v2 06/12] net-timestamp: introduce
TS_SCHED_OPT_CB to generate dev xmit timestamp
On Wed, Oct 16, 2024 at 1:35 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
> On 10/15/24 6:24 PM, Jason Xing wrote:
> > On Wed, Oct 16, 2024 at 9:01 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >>
> >> On 10/11/24 9:06 PM, Jason Xing wrote:
> >>> From: Jason Xing <kernelxing@...cent.com>
> >>>
> >>> Introduce BPF_SOCK_OPS_TS_SCHED_OPT_CB flag so that we can decide to
> >>> print timestamps when the skb just passes the dev layer.
> >>>
> >>> Signed-off-by: Jason Xing <kernelxing@...cent.com>
> >>> ---
> >>> include/uapi/linux/bpf.h | 5 +++++
> >>> net/core/skbuff.c | 17 +++++++++++++++--
> >>> tools/include/uapi/linux/bpf.h | 5 +++++
> >>> 3 files changed, 25 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >>> index 157e139ed6fc..3cf3c9c896c7 100644
> >>> --- a/include/uapi/linux/bpf.h
> >>> +++ b/include/uapi/linux/bpf.h
> >>> @@ -7019,6 +7019,11 @@ enum {
> >>> * by the kernel or the
> >>> * earlier bpf-progs.
> >>> */
> >>> + BPF_SOCK_OPS_TS_SCHED_OPT_CB, /* Called when skb is passing through
> >>> + * dev layer when SO_TIMESTAMPING
> >>> + * feature is on. It indicates the
> >>> + * recorded timestamp.
> >>> + */
> >>> };
> >>>
> >>> /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>> index 3a4110d0f983..16e7bdc1eacb 100644
> >>> --- a/net/core/skbuff.c
> >>> +++ b/net/core/skbuff.c
> >>> @@ -5632,8 +5632,21 @@ static void bpf_skb_tstamp_tx_output(struct sock *sk, int tstype)
> >>> return;
> >>>
> >>> tp = tcp_sk(sk);
> >>> - if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_TX_TIMESTAMPING_OPT_CB_FLAG))
> >>> - return;
> >>> + if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_TX_TIMESTAMPING_OPT_CB_FLAG)) {
> >>> + struct timespec64 tstamp;
> >>> + u32 cb_flag;
> >>> +
> >>> + switch (tstype) {
> >>> + case SCM_TSTAMP_SCHED:
> >>> + cb_flag = BPF_SOCK_OPS_TS_SCHED_OPT_CB;
> >>> + break;
> >>> + default:
> >>> + return;
> >>> + }
> >>> +
> >>> + tstamp = ktime_to_timespec64(ktime_get_real());
> >>> + tcp_call_bpf_2arg(sk, cb_flag, tstamp.tv_sec, tstamp.tv_nsec);
> >>
> >> There is bpf_ktime_get_*() helper. The bpf prog can directly call the
> >> bpf_ktime_get_* helper and use whatever clock it sees fit instead of enforcing
> >> real clock here and doing an extra ktime_to_timespec64. Right now the
> >> bpf_ktime_get_*() does not have real clock which I think it can be added.
> >
> > In this way, there is no need to add tcp_call_bpf_*arg() to pass
> > timestamp to userspace, right? Let the bpf program implement it.
> >
> > Now I wonder what information I should pass? Sorry for the lack of BPF
> > related knowledge :(
>
> Just pass the cb_flag op in this case.
I see. I saw one example just passing a NULL parameter:
tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);.
>
> A bpf selftest is missing in this series to show how it is going to be used.
Sorry, I didn't implement a standard selftest, but I wrote a full BPF
program in patch[0/12]. I planned to write a selftests after every
expert agrees the current approach.
> Yes, there are existing socket API tests on time stamping but I believe this
> discussion has already shown some subtle differences that warrant a closer to
> real world bpf prog example first.
>
> >
> >>
> >> I think overall the tstamp reporting interface does not necessarily have to
> >> follow the socket API. The bpf prog is running in the kernel. It could pass
> >> other information to the bpf prog if it sees fit. e.g. the bpf prog could also
> >> get the original transmitted tcp skb if it is useful.
> >
> > Good to know that! But how the BPF program parses the skb by using
> > tcp_call_bpf_2arg() which only passes u32 parameters.
>
> "struct skbuff *skb" has already been added to "struct bpf_sock_ops_kern". It is
> only assigned during the "BPF_SOCK_OPS_PARSE_*HDR_CB". It is not exposed
> directly to bpf prog but it could be. However, it may need to change some
> convert_ctx code in filter.c which I am not excited about. We haven't added
> convert_ctx changes for a while since it is the old way.
>
> Together with the "u32 bpf_sock_ops_cb_flags;" change in patch 9 which is only
> for tcp_sock and other _CB flags are also tcp specific only. For now, I am not
Right, the first move I made is to make TCP work.
> sure carrying this sockops to the future UDP support is desired.
I hope so. But it's not an urgent thing that needs to be done recently.
>
> Take a look at tcp_call_bpf(). It needs to initialize the whole "struct
> bpf_sock_ops_kern" regardless of what the bpf prog is needed before calling the
> bpf prog. The "u32 args[4]" is one of them. The is the older way of using bpf to
> extend kernel.
I see.
>
> bpf has struct_ops support now which can pass only what is needed and without
> the need of doing the convert_ctx in filter.c. The "struct tcp_congestion_ops"
> can already be implemented in bpf. Take a look at
> selftests/bpf/progs/bpf_cubic.c. All the BPF_SOCK_OPS_*_CB (e.g.
> BPF_SOCK_OPS_TS_SCHED_OPT_CB here) could just a "ops" in the struct_ops.
Interesting, but it seems this way is much more complex than the
current approach.
>
> That said, I think the first thing needs to figure out is how to enable bpf time
> stamping without having side effect on the user space.
In the next version, I will avoid affecting the cmsg case, so no more
side effects I think.
> Continue the sockops approach first
I'm a little hesitant to do so because it looks like we will introduce
more codes. Please let me investigate more :)
> and use it to create a selftest bpf prog example. Then we can decide.
I copy the BPF program from patch [0/12], please take a look and help
me review this:
---
Here is the test output:
1) receive path
iperf3-987305 [008] ...11 179955.200990: bpf_trace_printk: rx: port:
5201:55192, swtimestamp: 1728167973,670426346, hwtimestamp: 0,0
2) xmit path
iperf3-19765 [013] ...11 2021.329602: bpf_trace_printk: tx: port:
47528:5201, key: 1036, timestamp: 1728357067,436678584
iperf3-19765 [013] b..11 2021.329611: bpf_trace_printk: tx: port:
47528:5201, key: 1036, timestamp: 1728357067,436689976
iperf3-19765 [013] ...11 2021.329622: bpf_trace_printk: tx: port:
47528:5201, key: 1036, timestamp: 1728357067,436700739
Here is the full bpf program:
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#include <uapi/linux/net_tstamp.h>
int _version SEC("version") = 1;
char _license[] SEC("license") = "GPL";
# define SO_TIMESTAMPING 37
__section("sockops")
int set_initial_rto(struct bpf_sock_ops *skops)
{
int op = (int) skops->op;
u32 sport = 0, dport = 0;
int flags;
switch (op) {
//case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
case BPF_SOCK_OPS_TCP_CONNECT_CB:
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
flags = SOF_TIMESTAMPING_RX_SOFTWARE |
SOF_TIMESTAMPING_TX_SCHED |
SOF_TIMESTAMPING_TX_ACK | SOF_TIMESTAMPING_TX_SOFTWARE |
SOF_TIMESTAMPING_OPT_ID | SOF_TIMESTAMPING_OPT_ID_TCP;
bpf_setsockopt(skops, SOL_SOCKET, SO_TIMESTAMPING,
&flags, sizeof(flags));
bpf_sock_ops_cb_flags_set(skops,
BPF_SOCK_OPS_TX_TIMESTAMPING_OPT_CB_FLAG|BPF_SOCK_OPS_RX_TIMESTAMPING_OPT_CB_FLAG);
break;
case BPF_SOCK_OPS_TS_SCHED_OPT_CB:
case BPF_SOCK_OPS_TS_SW_OPT_CB:
case BPF_SOCK_OPS_TS_ACK_OPT_CB:
dport = bpf_ntohl(skops->remote_port);
sport = skops->local_port;
bpf_printk("tx: port: %u:%u, key: %u, timestamp: %u,%u\n",
sport, dport, skops->args[0],
skops->args[1], skops->args[2]);
break;
case BPF_SOCK_OPS_TS_RX_OPT_CB:
dport = bpf_ntohl(skops->remote_port);
sport = skops->local_port;
bpf_printk("rx: port: %u:%u, swtimestamp: %u,%u,
hwtimestamp: %u,%u\n",
sport, dport, skops->args[0],
skops->args[1], skops->args[2], skops->args[3]);
break;
}
return 1;
}
---
What is your opinion on the above?
Thanks,
Jason
Powered by blists - more mailing lists