[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO3-PbpmS8=gTSb84X8wV4NiCA8JNXrEXYONJKmc6RoM4QQwYg@mail.gmail.com>
Date: Mon, 24 Jun 2024 12:49:20 -0500
From: Yan Zhai <yan@...udflare.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
kernel-team <kernel-team@...udflare.com>
Subject: Re: [RFC net-next 1/9] skb: introduce gro_disabled bit
On Mon, Jun 24, 2024 at 8:30 AM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> On 6/23/24 10:23 AM, Willem de Bruijn wrote:
> > Yan Zhai wrote:
> >> On Fri, Jun 21, 2024 at 11:41 AM Daniel Borkmann <daniel@...earbox.net> wrote:
> >>> On 6/21/24 6:00 PM, Yan Zhai wrote:
> >>>> On Fri, Jun 21, 2024 at 8:13 AM Daniel Borkmann <daniel@...earbox.net> wrote:
> >>>>> On 6/21/24 2:15 PM, Willem de Bruijn wrote:
> >>>>>> Yan Zhai wrote:
> >>>>>>> Software GRO is currently controlled by a single switch, i.e.
> >>>>>>>
> >>>>>>> ethtool -K dev gro on|off
> >>>>>>>
> >>>>>>> However, this is not always desired. When GRO is enabled, even if the
> >>>>>>> kernel cannot GRO certain traffic, it has to run through the GRO receive
> >>>>>>> handlers with no benefit.
> >>>>>>>
> >>>>>>> There are also scenarios that turning off GRO is a requirement. For
> >>>>>>> example, our production environment has a scenario that a TC egress hook
> >>>>>>> may add multiple encapsulation headers to forwarded skbs for load
> >>>>>>> balancing and isolation purpose. The encapsulation is implemented via
> >>>>>>> BPF. But the problem arises then: there is no way to properly offload a
> >>>>>>> double-encapsulated packet, since skb only has network_header and
> >>>>>>> inner_network_header to track one layer of encapsulation, but not two.
> >>>>>>> On the other hand, not all the traffic through this device needs double
> >>>>>>> encapsulation. But we have to turn off GRO completely for any ingress
> >>>>>>> device as a result.
> >>>>>>>
> >>>>>>> Introduce a bit on skb so that GRO engine can be notified to skip GRO on
> >>>>>>> this skb, rather than having to be 0-or-1 for all traffic.
> >>>>>>>
> >>>>>>> Signed-off-by: Yan Zhai <yan@...udflare.com>
> >>>>>>> ---
> >>>>>>> include/linux/netdevice.h | 9 +++++++--
> >>>>>>> include/linux/skbuff.h | 10 ++++++++++
> >>>>>>> net/Kconfig | 10 ++++++++++
> >>>>>>> net/core/gro.c | 2 +-
> >>>>>>> net/core/gro_cells.c | 2 +-
> >>>>>>> net/core/skbuff.c | 4 ++++
> >>>>>>> 6 files changed, 33 insertions(+), 4 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >>>>>>> index c83b390191d4..2ca0870b1221 100644
> >>>>>>> --- a/include/linux/netdevice.h
> >>>>>>> +++ b/include/linux/netdevice.h
> >>>>>>> @@ -2415,11 +2415,16 @@ struct net_device {
> >>>>>>> ((dev)->devlink_port = (port)); \
> >>>>>>> })
> >>>>>>>
> >>>>>>> -static inline bool netif_elide_gro(const struct net_device *dev)
> >>>>>>> +static inline bool netif_elide_gro(const struct sk_buff *skb)
> >>>>>>> {
> >>>>>>> - if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
> >>>>>>> + if (!(skb->dev->features & NETIF_F_GRO) || skb->dev->xdp_prog)
> >>>>>>> return true;
> >>>>>>> +
> >>>>>>> +#ifdef CONFIG_SKB_GRO_CONTROL
> >>>>>>> + return skb->gro_disabled;
> >>>>>>> +#else
> >>>>>>> return false;
> >>>>>>> +#endif
> >>>>>>
> >>>>>> Yet more branches in the hot path.
> >>>>>>
> >>>>>> Compile time configurability does not help, as that will be
> >>>>>> enabled by distros.
> >>>>>>
> >>>>>> For a fairly niche use case. Where functionality of GRO already
> >>>>>> works. So just a performance for a very rare case at the cost of a
> >>>>>> regression in the common case. A small regression perhaps, but death
> >>>>>> by a thousand cuts.
> >>>>>
> >>>>> Mentioning it here b/c it perhaps fits in this context, longer time ago
> >>>>> there was the idea mentioned to have BPF operating as GRO engine which
> >>>>> might also help to reduce attack surface by only having to handle packets
> >>>>> of interest for the concrete production use case. Perhaps here meta data
> >>>>> buffer could be used to pass a notification from XDP to exit early w/o
> >>>>> aggregation.
> >>>>
> >>>> Metadata is in fact one of our interests as well. We discussed using
> >>>> metadata instead of a skb bit to carry this information internally.
> >>>> Since metadata is opaque atm so it seems the only option is to have a
> >>>> GRO control hook before napi_gro_receive, and let BPF decide
> >>>> netif_receive_skb or napi_gro_receive (echo what Paolo said). With BPF
> >>>> it could indeed be more flexible, but the cons is that it could be
> >>>> even more slower than taking a bit on skb. I am actually open to
> >>>> either approach, as long as it gives us more control on when to enable
> >>>> GRO :)
> >>>
> >>> Oh wait, one thing that just came to mind.. have you tried u64 per-CPU
> >>> counter map in XDP? For packets which should not be GRO-aggregated you
> >>> add count++ into the meta data area, and this forces GRO to not aggregate
> >>> since meta data that needs to be transported to tc BPF layer mismatches
> >>> (and therefore the contract/intent is that tc BPF needs to see the different
> >>> meta data passed to it).
> >>
> >> We did this before accidentally (we put a timestamp for debugging
> >> purposes in metadata) and this actually caused about 20% of OoO for
> >> TCP in production: all PSH packets are reordered. GRO does not fire
> >> the packet to the upper layer when a diff in metadata is found for a
> >> non-PSH packet, instead it is queued as a “new flow” on the GRO list
> >> and waits for flushing. When a PSH packet arrives, its semantic is to
> >> flush this packet immediately and thus precedes earlier packets of the
> >> same flow.
> >
> > Is that a bug in XDP metadata handling for GRO?
> >
> > Mismatching metadata should not be taken as separate flows, but as a
> > flush condition.
>
> Definitely a bug as it should flush. If noone is faster I can add it to my
> backlog todo to fix it, but might probably take a week before I get to it.
>
In theory we should flush if the same flow has different metadata.
However "same flow" is not finally confirmed until GRO proceeds to the
TCP layer in this case. So to achieve this flush semantic, metadata
should be compared at the leaf protocol handlers instead of initially
inside dev_gro_receive. I am not quite sure if it is worthwhile, or
just allow skipping GRO from XDP, since it's the XDP program who sets
this metadata.
Yan
> Thanks,
> Daniel
Powered by blists - more mailing lists