[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+FuTSdR0yPwXAZZjziGOeujJ_Ac19fX1DsqfRXX3Dsn1uFPAQ@mail.gmail.com>
Date: Thu, 16 Dec 2021 10:32:23 -0500
From: Willem de Bruijn <willemb@...gle.com>
To: Martin KaFai Lau <kafai@...com>
Cc: netdev@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
David Miller <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, kernel-team@...com
Subject: Re: [RFC PATCH v2 net-next] net: Preserve skb delivery time during forward
h
On Wed, Dec 15, 2021 at 3:12 PM Martin KaFai Lau <kafai@...com> wrote:
>
> The skb->skb_mstamp_ns is used as the EDT (Earliest Department Time)
> in TCP. skb->skb_mstamp_ns is a union member of skb->tstamp.
>
> When the skb traveling veth and being forwarded like below, the skb->tstamp
> is reset to 0 at multiple points.
>
> (c: skb->tstamp = 0)
> vv
> tcp-sender => veth@...ns => veth@...tns(b: rx: skb->tstamp = real_clock) => fq@...0
> ^^
> (a: skb->tstamp = 0)
>
> (a) veth@...ns TX to veth@...tns:
> skb->tstamp (mono clock) is a EDT and it is in future time.
> Reset to 0 so that it won't skip the net_timestamp_check at the
> RX side in (b).
> (b) RX (netif_rx) in veth@...tns:
> net_timestamp_check puts a current time (real clock) in skb->tstamp.
> (c) veth@...tns forward to fq@...0:
> skb->tstamp is reset back to 0 again because fq is using
> mono clock.
>
> This leads to an unstable TCP throughput issue described by Daniel in [0].
>
> We also have a use case that a bpf runs at ingress@...h@...tns
> to set EDT in skb->tstamp to limit the bandwidth usage
> of a particular netns. This EDT currently also gets
> reset in step (c) as described above.
>
> Unlike RFC v1 trying to migrate rx tstamp to mono first,
> this patch is to preserve the EDT in skb->skb_mstamp_ns during forward.
Sucks we have to do this complex dance, because there is no room for
an skb->delivery_time.
> The idea is to temporarily store skb->skb_mstamp_ns during forward.
> skb_shinfo(skb)->hwtstamps is used as a temporary store and
> it is union-ed with the newly added "u64 tx_delivery_tstamp".
> hwtstamps should only be used when a packet is received or
> sent out of a hw device.
>
> During forward, skb->tstamp will be temporarily stored in
> skb_shinfo(skb)->tx_delivery_tstamp and a new bit
> (SKBTX_DELIVERY_TSTAMP) in skb_shinfo(skb)->tx_flags
> will also be set to tell tx_delivery_tstamp is in use.
> hwtstamps is accessed through the skb_hwtstamps() getter,
> so unlikely(tx_flags & SKBTX_DELIVERY_TSTAMP) can
> be tested in there and reset tx_delivery_tstamp to 0
> before hwtstamps is used.
>
> After moving the skb->tstamp to skb_shinfo(skb)->tx_delivery_tstamp,
> the skb->tstamp will still be reset to 0 during forward. Thus,
> on the RX side (__netif_receive_skb_core), all existing code paths
> will still get the received time in real clock and will work as-is.
>
> When this skb finally xmit-ing out in __dev_queue_xmit(),
> it will check the SKBTX_DELIVERY_TSTAMP bit in skb_shinfo(skb)->tx_flags
> and restore the skb->tstamp from skb_shinfo(skb)->tx_delivery_tstamp
> if needed. This bit test is done immediately after another existing
> bit test 'skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP'.
>
> Another bit SKBTX_DELIVERY_TSTAMP_ALLOW_FWD is added
> to skb_shinfo(skb)->tx_flags. It is used to specify
> the skb->tstamp is set as a delivery time and can be
> temporarily stored during forward. This bit is now set
> when EDT is stored in skb->skb_mstamp_ns in tcp_output.c
> This will avoid packet received from a NIC with real-clock
> in skb->tstamp being forwarded without reset.
>
> The change in af_packet.c is to avoid it calling skb_hwtstamps()
> which will reset the skb_shinfo(skb)->tx_delivery_tstamp.
> af_packet.c only wants to read the hwtstamps instead of
> storing a time in it, so a new read only getter skb_hwtstamps_ktime()
> is added. Otherwise, a tcpdump will trigger this code path
> and unnecessarily reset the EDT stored in tx_delivery_tstamp.
>
> [Note: not all skb->tstamp=0 reset has been changed in this RFC yet]
>
> [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
>
> Signed-off-by: Martin KaFai Lau <kafai@...com>
> ---
> include/linux/skbuff.h | 52 ++++++++++++++++++++++++++++++++++++++++-
> net/bridge/br_forward.c | 2 +-
> net/core/dev.c | 1 +
> net/core/filter.c | 6 ++---
> net/core/skbuff.c | 2 +-
> net/ipv4/ip_forward.c | 2 +-
> net/ipv4/tcp_output.c | 21 +++++++++++------
> net/ipv6/ip6_output.c | 2 +-
> net/packet/af_packet.c | 8 +++----
> 9 files changed, 77 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 6535294f6a48..9bf0a1e2a1bd 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -435,9 +435,17 @@ enum {
> /* device driver is going to provide hardware time stamp */
> SKBTX_IN_PROGRESS = 1 << 2,
>
> + /* shinfo stores a future tx_delivery_tstamp instead of hwtstamps */
> + SKBTX_DELIVERY_TSTAMP = 1 << 3,
> +
> /* generate wifi status information (where possible) */
> SKBTX_WIFI_STATUS = 1 << 4,
>
> + /* skb->tstamp stored a future delivery time which
> + * was set by a local sk and it can be fowarded.
> + */
> + SKBTX_DELIVERY_TSTAMP_ALLOW_FWD = 1 << 5,
> +
> /* generate software time stamp when entering packet scheduling */
> SKBTX_SCHED_TSTAMP = 1 << 6,
> };
> @@ -530,7 +538,14 @@ struct skb_shared_info {
> /* Warning: this field is not always filled in (UFO)! */
> unsigned short gso_segs;
> struct sk_buff *frag_list;
> - struct skb_shared_hwtstamps hwtstamps;
> + union {
> + /* If SKBTX_DELIVERY_TSTAMP is set in tx_flags,
> + * tx_delivery_tstamp is stored instead of
> + * hwtstamps.
> + */
Should we just encode the timebase and/or type { timestamp,
delivery_time } in th lower bits of the timestamp field? Its
resolution is higher than actual clock precision.
> + struct skb_shared_hwtstamps hwtstamps;
> + u64 tx_delivery_tstamp;
> + };
> unsigned int gso_type;
> u32 tskey;
>
> @@ -1463,9 +1478,44 @@ static inline unsigned int skb_end_offset(const struct sk_buff *skb)
>
> static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
> {
> + if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
> + skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
> + skb_shinfo(skb)->tx_delivery_tstamp = 0;
> + }
> return &skb_shinfo(skb)->hwtstamps;
> }
>
> +/* Caller only needs to read the hwtstamps as ktime.
> + * To update hwtstamps, HW device driver should call the writable
> + * version skb_hwtstamps() that returns a pointer.
> + */
> +static inline ktime_t skb_hwtstamps_ktime(const struct sk_buff *skb)
> +{
> + if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP))
> + return 0;
> + return skb_shinfo(skb)->hwtstamps.hwtstamp;
> +}
> +
> +static inline void skb_scrub_tstamp(struct sk_buff *skb)
skb_save_delivery_time?
is non-zero skb->tstamp test not sufficient, instead of
SKBTX_DELIVERY_TSTAMP_ALLOW_FWD.
It is if only called on the egress path. Is bpf on ingress the only
reason for this?
> +{
> + if (skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP_ALLOW_FWD) {
> + skb_shinfo(skb)->tx_delivery_tstamp = skb->tstamp;
> + skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP;
> + skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> + }
Is this only called when there are no clones/shares?
> + skb->tstamp = 0;
> +}
> +
> +static inline void skb_restore_delivery_time(struct sk_buff *skb)
> +{
> + if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_DELIVERY_TSTAMP)) {
> + skb->tstamp = skb_shinfo(skb)->tx_delivery_tstamp;
> + skb_shinfo(skb)->tx_delivery_tstamp = 0;
> + skb_shinfo(skb)->tx_flags &= ~SKBTX_DELIVERY_TSTAMP;
> + skb_shinfo(skb)->tx_flags |= SKBTX_DELIVERY_TSTAMP_ALLOW_FWD;
> + }
> +}
> +
> static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
> {
> bool is_zcopy = skb && skb_shinfo(skb)->flags & SKBFL_ZEROCOPY_ENABLE;
> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> index ec646656dbf1..a3ba6195f2e3 100644
> --- a/net/bridge/br_forward.c
> +++ b/net/bridge/br_forward.c
> @@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(br_dev_queue_push_xmit);
>
> int br_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
> {
> - skb->tstamp = 0;
> + skb_scrub_tstamp(skb);
> return NF_HOOK(NFPROTO_BRIDGE, NF_BR_POST_ROUTING,
> net, sk, skb, NULL, skb->dev,
> br_dev_queue_push_xmit);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a855e41bbe39..e9e7de758cba 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
Powered by blists - more mailing lists