[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoD6fqrDhYDCFkuSuy-HgORo-qxLLwm+=WQqdQA1=C_S3w@mail.gmail.com>
Date: Fri, 1 Nov 2024 15:47:05 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Martin KaFai Lau <martin.lau@...ux.dev>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, willemb@...gle.com,
davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com,
dsahern@...nel.org, ast@...nel.org, daniel@...earbox.net, andrii@...nel.org,
eddyz87@...il.com, song@...nel.org, yonghong.song@...ux.dev,
john.fastabend@...il.com, kpsingh@...nel.org, sdf@...ichev.me,
haoluo@...gle.com, jolsa@...nel.org, shuah@...nel.org, ykolal@...com,
bpf@...r.kernel.org, netdev@...r.kernel.org,
Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH net-next v3 02/14] net-timestamp: allow two features to
work parallelly
On Fri, Nov 1, 2024 at 7:26 AM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
>
> On 10/31/24 6:50 AM, Jason Xing wrote:
> > On Thu, Oct 31, 2024 at 8:30 PM Willem de Bruijn
> > <willemdebruijn.kernel@...il.com> wrote:
> >>
> >> Jason Xing wrote:
> >>> On Thu, Oct 31, 2024 at 2:27 PM Martin KaFai Lau <martin.lau@...ux.dev> wrote:
> >>>>
> >>>> On 10/30/24 5:13 PM, Jason Xing wrote:
> >>>>> I realized that we will have some new sock_opt flags like
> >>>>> TS_SCHED_OPT_CB in patch 4, so we can control whether to print or
> >>>>> not... For each sock_opt point, they will be called without caring if
> >>>>> related flags in skb are set. Well, it's meaningless to add more
> >>>>> control of skb tsflags at each TS_xx_OPT_CB point.
> >>>>>
> >>>>> Am I understanding in a correct way? Now, I'm totally fine with this great idea!
> >>>> Yes, I think so.
> >>>>
> >>>> The sockops prog can choose to ignore any BPF_SOCK_OPS_TS_*_CB. The are only 3:
> >>>> SCHED, SND, and ACK. If the hwtstamp is available from a NIC, I think it would
> >>>> be quite wasteful to throw it away. ACK can be controlled by the
> >>>> TCP_SKB_CB(skb)->bpf_txstamp_ack.
> >>>
> >>> Right, let me try this:)
> >>>
> >>>> Going back to my earlier bpf_setsockopt(SOL_SOCKET, BPF_TX_TIMESTAMPING)
> >>>> comment. I think it may as well go back to use the "u8
> >>>> bpf_sock_ops_cb_flags;" and use the bpf_sock_ops_cb_flags_set() helper to
> >>>> enable/disable the timestamp related callback hook. May be add one
> >>>> BPF_SOCK_OPS_TX_TIMESTAMPING_CB_FLAG.
> >>>
> >>> bpf_sock_ops_cb_flags this flag is only used in TCP condition, right?
> >>> If that is so, it cannot be suitable for UDP.
> >>>
> >>> I'm thinking of this solution:
> >>> 1) adding a new flag in SOF_TIMESTAMPING_OPT_BPF flag (in
> >>> include/uapi/linux/net_tstamp.h) which can be used by sk->sk_tsflags
>
> probably not in include/uapi/linux/net_tstamp.h. This flag can only be used by a
> bpf prog (meaning will not be used by user space syscall). More below.
>
> >>> 2) flags = SOF_TIMESTAMPING_OPT_BPF; bpf_setsockopt(skops,
> >>> SOL_SOCKET, SO_TIMESTAMPING, &flags, sizeof(flags));
> >>> 3) test if sk->sk_tsflags has this new flag in tcp_tx_timestamp() or
> >>> in udp_sendmsg()
> >>> ...
>
> Not sure how many churns/audits is needed to ensure the user space cannot
> set/clear the SOF_TIMESTAMPING_OPT_BPF bit in sk->sk_tsflags. Could be not much.
>
> May be it is cleaner to leave the sk->sk_tsflags for user space only and having
> a separate field in "struct sock" to track bpf specific needs. More like your
> current sk_tsflags_bpf approach but I was thinking to reuse the
> bpf_sock_ops_cb_flags instead. e.g. "BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)" is used to check if it needs to call a bpf
> prog to decide if it needs to add tcp header option. Here we want to test if it
> should call a bpf prog to make a decision on tx timestamp on a skb.
>
> The bpf_sock_ops_cb_flags can be moved from struct tcp_sock to struct sock. It
> is doable from the bpf side.
>
> All that said, but, yes, it will add some TCP specific enum flag (e.g.
> BPF_SOCK_OPS_RTO_CB_FLAG) to the struct sock which will not be used by
> UDP/raw/...etc, so may be keep your current sk_tsflags_bpf approach but rename
> it to sk_bpf_cb_flags in struct "sock" so that it can be reused for other non
> tstamp ops in the future? probably a u8 is enough.
Thanks so much for the details.
>
> This optname is used by the bpf prog only and not usable by user space syscall.
> If it prefers to stay with bpf_setsockopt (which is fine), it needs a bpf
> specific optname like the current TCP_BPF_SOCK_OPS_CB_FLAGS which currently sets
> the tp->bpf_sock_ops_cb_flags. May be a new SK_BPF_CB_FLAGS optname for setting
> the sk->sk_bpf_cb_flags, like bpf_setsockopt(skops_ctx, SOL_SOCKET,
> SK_BPF_CB_FLAGS, &val, sizeof(val)) and handle it in the sol_socket_sockopt()
> alone without calling into sk_{set,get}sockopt. Add a new enum for the optval
> for the sk_bpf_cb_flags:
>
> enum {
> SK_BPF_CB_TX_TIMESTAMPING = (1 << 0),
> SK_BPF_CB_RX_TIEMSTAMPING = (1 << 1),
> };
Then it will involve more strange modification in sol_socket_sockopt()
to retrieve the opt value like what I did in V2 (see
https://lore.kernel.org/all/20241012040651.95616-3-kerneljasonxing@gmail.com/).
It's the reason why I did set and get operation in
sk_{set,get}sockopt() in this series to keep align with other flags.
Handling it in sk_{set,get}sockopt() is not a bad idea and easy to
implement, I feel.
Overall the suggestion looks good to me. I can give it a try :)
I'm thinking of another approach to using bpf_sock_ops_cb_flags_set()
instead of bpf_setsockopt() when sockops like
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB is triggered. I can modify the
bpf_sock_ops_cb_flags_set like this:
diff --git a/net/core/filter.c b/net/core/filter.c
index 58761263176c..001140067c1a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5770,14 +5770,25 @@ BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct
bpf_sock_ops_kern *, bpf_sock,
int, argval)
{
struct sock *sk = bpf_sock->sk;
- int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+ int val = argval;
- if (!IS_ENABLED(CONFIG_INET) || !sk_fullsock(sk))
+ if (!IS_ENABLED(CONFIG_INET))
return -EINVAL;
- tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+ if (sk_is_tcp(sk)) {
+ val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+ if (!sk_fullsock(sk))
+ return -EINVAL;
+
+ tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+ val = argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+ } else {
+ sk->bpf_sock_ops_cb_flags = val;
+ val = argval &
(~(SK_BPF_CB_TX_TIEMSTAMPING|SK_BPF_CB_RX_TIEMSTAMPING));
+ }
- return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+ return val;
}
The BPF program uses bpf_sock_ops_cb_flags_set(skops,
SK_BPF_CB_FLAGS); to set the flags. Then we can implement a similar
function like BPF_SOCK_OPS_TEST_FLAG() in tcp_tx_timestamp() to check
if we are allowed to set shinfo->tx_flags |= SKBTX_BPF.
One advantage of this approach is that the bpf_sock_ops_cb_flags_set()
could be extended for more than only TCP in the future. Admittedly,
this will involve more work.
Which way would you prefer?
>
>
> >>>
> >>>>
> >>>> For tx, one new hook should be at the sendmsg and should be around
> >>>> tcp_tx_timestamp (?) for tcp. Another hook is __skb_tstamp_tx() which should be
> >>>
> >>> I think there are two points we're supposed to record:
> >>> 1) the moment tcp/udp_sendmsg() is triggered. It represents the syscall time.
> >>> 2) another point in tcp_tx_timestamp(). It represents the timestamp of
> >>> the last skb in this sendmsg() call.
> >>> Users may happen to send a big packet.
>
> hmm... a big packet and sendmsg is blocked waiting for memory?
>
> >>
> >> Err on the side of fewer measurement points. It's always possible to
> >> add more later, but not possible to remove them (depending on whether
> >> BPF infra is ABI).
>
> I also think it is better to start with tcp_tx_timestamp() alone first to keep
> the patch set simple now. The selftest prog can use a bpf fentry prog to trace
> the tcp_sendmsg_locked(). This can be revisited later if the bpf fentry prog is
> not enough.
>
> >>
> >> Overall great suggestion. Thanks a lot for sharing your BPF expertise
> >> on this, Martin.
>
> Thanks!
>
> >>
> >> On using the raw seqno: this data is accessible to anyone root in
> >> namespace (ns_capable) using packet sockets, so as long as it does not
> >> open to more than that, it is logically equivalent to the current
> >> setting.
> >>
> >> With seqno the BPF program has to be careful that the same seqno can
> >> be retransmitted, so for instance seeing an ACK before a (second) SND
> >> must be anticipated. That is true for SO_TIMESTAMPING today too.
>
> Ah. It will be a very useful comment to add to the selftests bpf prog.
>
> >>
> >> For datagrams (UDP as well as RAW and many non IP protocols), an
> >> alternative still needs to be found.
>
> In udp/raw/..., I don't know how likely is the user space having "cork->tx_flags
> & SKBTX_ANY_TSTAMP" set but has neither "READ_ONCE(sk->sk_tsflags) &
> SOF_TIMESTAMPING_OPT_ID" nor "cork->flags & IPCORK_TS_OPT_ID" set. If it is
> unlikely, may be we can just disallow bpf prog from directly setting
> skb_shinfo(skb)->tskey for this particular skb.
>
> For all other cases, in __ip[6]_append_data, directly call a bpf prog and also
> pass the kernel decided tskey to the bpf prog.
I'm a bit confused here. IIUC, we need to support the tskey like what
we did in this series to handle non TCP cases?
I think I can keep those three patches related to tskey to support
both TCP and non-TCP cases. Then let the bpf program decide to use
tskey.
>
> The kernel passed tskey could be 0 (meaning the user space has not used it). The
> bpf prog can give one for the kernel to use. The bpf prog can store the
> sk_tskey_bpf in the bpf_sk_storage now. Meaning no need to add one to the struct
> sock. The bpf prog does not have to start from 0 (e.g. start from U32_MAX
> instead) if it helps.
>
> If the kernel passed tskey is not 0, the bpf prog can just use that one
> (assuming the user space is doing something sane, like the value in
> SCM_TS_OPT_ID won't be jumping back and front between 0 to U32_MAX). I hope this
> is very unlikely also (?) but the bpf prog can probably detect this and choose
> to ignore this sk.
>
> To solve the above unsupported corner cases, I think we can allow the bpf prog
> to store something in the shinfo->hwtstamps at the tx path. The bpf-only key
> could be one of the things to store there. Change __ip[6]_append_data to handle
> the shinfo->hwtstamps. I think allowing the bpf prog to write to the
> shinfo->hwtsatmps could be considered later when needed.
>
> [ I may be off tomorrow, so reply could be slower. ]
Thanks for your help!
Thanks,
Jason
Powered by blists - more mailing lists