[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQJPgpo7J0qVTQJYYocZ=Jnw=O5GfN2=PyAQ55+WWG_DVg@mail.gmail.com>
Date: Mon, 31 Jul 2023 18:03:26 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Larysa Zaremba <larysa.zaremba@...el.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com>, bpf <bpf@...r.kernel.org>,
Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, Song Liu <song@...nel.org>,
Yonghong Song <yhs@...com>, John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...gle.com>, Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
David Ahern <dsahern@...il.com>, Jakub Kicinski <kuba@...nel.org>,
Willem de Bruijn <willemb@...gle.com>, Jesper Dangaard Brouer <brouer@...hat.com>,
Anatoly Burakov <anatoly.burakov@...el.com>, Alexander Lobakin <alexandr.lobakin@...el.com>,
Magnus Karlsson <magnus.karlsson@...il.com>, Maryam Tahhan <mtahhan@...hat.com>,
xdp-hints@...-project.net, Network Development <netdev@...r.kernel.org>,
Simon Horman <simon.horman@...igine.com>
Subject: Re: [PATCH bpf-next v4 12/21] xdp: Add checksum hint
On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@...el.com> wrote:
>
> On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote:
> > Alexei Starovoitov wrote:
> > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
> > > <willemdebruijn.kernel@...il.com> wrote:
> > > >
> > > > Alexei Starovoitov wrote:
> > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > > > > >
> > > > > > +union xdp_csum_info {
> > > > > > + /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > > > > + * valid, but was never calculated, TX device has to do this,
> > > > > > + * starting from csum_start packet byte.
> > > > > > + * Any preceding checksums are also considered valid.
> > > > > > + * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > > > > + */
> > > > > > + struct {
> > > > > > + u16 csum_start;
> > > > > > + u16 csum_offset;
> > > > > > + };
> > > > > > +
> > > > >
> > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
> > > >
> > > > It can be observed on RX when packets are looped.
> > > >
> > > > This may be observed even in XDP on veth.
> > >
> > > veth and XDP is a broken combination. GSO packets coming out of containers
> > > cannot be parsed properly by XDP.
> > > It was added mainly for testing. Just like "generic XDP".
> > > bpf progs at skb layer is much better fit for veth.
> >
> > Ok. Still, seems forward looking and little cost to define the
> > constant?
> >
>
> +1
> CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change
> anything from the perspective of the user that does not use it, so I think it is
> worth having.
"little cost to define the constant".
Not really. A constant in UAPI is a heavy burden.
> > > > > > + /* Checksum, calculated over the whole packet.
> > > > > > + * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> > > > > > + */
> > > > > > + u32 checksum;
> > > > >
> > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
> > > > > or XDP_CHECKSUM_UNNECESSARY.
> > > > >
> > > > > > +};
> > > > > > +
> > > > > > +enum xdp_csum_status {
> > > > > > + /* HW had parsed several transport headers and validated their
> > > > > > + * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> > > > > > + * 3 least significant bytes contain number of consecutive checksums,
> > > > > > + * starting with the outermost, reported by hardware as valid.
> > > > > > + * ``sk_buff`` checksum level (``csum_level``) notation is provided
> > > > > > + * for driver developers.
> > > > > > + */
> > > > > > + XDP_CHECKSUM_VALID_LVL0 = 1, /* 1 outermost checksum */
> > > > > > + XDP_CHECKSUM_VALID_LVL1 = 2, /* 2 outermost checksums */
> > > > > > + XDP_CHECKSUM_VALID_LVL2 = 3, /* 3 outermost checksums */
> > > > > > + XDP_CHECKSUM_VALID_LVL3 = 4, /* 4 outermost checksums */
> > > > > > + XDP_CHECKSUM_VALID_NUM_MASK = GENMASK(2, 0),
> > > > > > + XDP_CHECKSUM_VALID = XDP_CHECKSUM_VALID_NUM_MASK,
> > > > >
> > > > > I don't see what bpf prog suppose to do with these levels.
> > > > > The driver should pick between 3:
> > > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.
> > > > >
> > > > > No levels and no anything partial. please.
> > > >
> > > > This levels business is an unfortunate side effect of
> > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> > > > does the boolean actually mean? With these levels, at least that is
> > > > well defined: the first N checksum fields.
> > >
> > > If I understand this correctly this is intel specific feature that
> > > other NICs don't have. skb layer also doesn't have such concept.
>
> Please look into csum_level field in sk_buff. It is not the most used property
> in the kernel networking code, but it is certainly 1. used by networking stack
> 2. set to non-zero value by many vendors.
>
> So you do not need to search yourself, I'll copy-paste the docs for
> CHECKSUM_UNNECESSARY here:
>
> * %CHECKSUM_UNNECESSARY is applicable to following protocols:
> *
> * - TCP: IPv6 and IPv4.
> * - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a
> * zero UDP checksum for either IPv4 or IPv6, the networking stack
> * may perform further validation in this case.
> * - GRE: only if the checksum is present in the header.
> * - SCTP: indicates the CRC in SCTP header has been validated.
> * - FCOE: indicates the CRC in FC frame has been validated.
> *
>
> Please, look at this:
>
> * &sk_buff.csum_level indicates the number of consecutive checksums found in
> * the packet minus one that have been verified as %CHECKSUM_UNNECESSARY.
> * For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet
> * and a device is able to verify the checksums for UDP (possibly zero),
> * GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to
> * two. If the device were only able to verify the UDP checksum and not
> * GRE, either because it doesn't support GRE checksum or because GRE
> * checksum is bad, skb->csum_level would be set to zero (TCP checksum is
> * not considered in this case).
>
> From:
> https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/skbuff.h#L115
>
> > > The driver should say CHECKSUM_UNNECESSARY when it's sure
> > > or don't pretend that it checks the checksum and just say NONE.
> >
>
> Well, in such case, most of the NICs that use CHECKSUM_UNNECESSARY would have to
> return CHECKSUM_NONE instead, because based on my quick search, they mostly
> return checksum level of 0 (no tunneling detected) or 1 (tunneling detected),
> so they only parse headers up to a certain depth, meaning it's not possible
> to tell whether there isn't another CHECKSUM_UNNECESSARY-eligible header hiding
> in the payload, so those NIC cannot guarantee ALL the checksums present in the
> packet are correct. So, by your logic, we should make e.g. AF_XDP user re-check
> already verified checksums themselves, because HW "doesn't pretend that it
> checks the checksum and just says NONE".
>
> > I did not know how much this was used, but quick grep for non constant
> > csum_level shows devices from at least six vendors.
>
> Yes, there are several vendors that set the csum_level, including broadcom
> (bnxt) and mellanox (mlx4 and mlx5).
>
> Also, CHECKSUM_UNNECESSARY is found in 100+ drivers/net/ethernet files,
> while csum_level is in like 20, which means overwhelming majority of
> CHECKSUM_UNNECESSARY NICs actually stay with the default checksum level of '0'
> (they check only the outermost checksum - anything else needs to be verified by
> the networking stack).
No. What I'm saying is that XDP_CHECKSUM_UNNECESSARY should be
equivalent to skb's CHECKSUM_UNNECESSARY with csum_level = 0.
I'm well aware that some drivers are trying to be smart and put csum_level=1.
There is no use case for it in XDP.
"But our HW supports it so XDP prog should read it" is the reason NOT
to expose it to bpf in generic api.
Either we're doing per-driver kfuncs and no common infra or common kfunc
that covers 99% of the drivers. Which is CHECKSUM_UNNECESSARY && csum_level = 0
It's not acceptable to present a generic api to xdp prog with multi level
csum that only works on a specific HW. Next thing there will be new flags
and MAX_CSUM_LEVEL in XDP features.
Pretending to be generic while being HW specific is the worst interface.
Powered by blists - more mailing lists