lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <202202061826.50506BF8@keescook>
Date:   Sun, 6 Feb 2022 18:53:44 -0800
From:   Kees Cook <keescook@...omium.org>
To:     Alexander Lobakin <alexandr.lobakin@...el.com>
Cc:     Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
        Eric Dumazet <edumazet@...gle.com>,
        "Gustavo A. R. Silva" <gustavoars@...nel.org>
Subject: Re: [RFC PATCH 2/3] net: gro: minor optimization for
 dev_gro_receive()

On Wed, Feb 02, 2022 at 01:08:27PM +0100, Alexander Lobakin wrote:
> From: Paolo Abeni <pabeni@...hat.com>
> Date: Wed, 02 Feb 2022 11:09:41 +0100
> 
> > Hello,
> 
> Hi!
> 
> > 
> > On Tue, 2022-01-18 at 18:39 +0100, Alexander Lobakin wrote:
> > > From: Paolo Abeni <pabeni@...hat.com>
> > > Date: Tue, 18 Jan 2022 17:31:00 +0100
> > > 
> > > > On Tue, 2022-01-18 at 16:56 +0100, Alexander Lobakin wrote:
> > > > > From: Paolo Abeni <pabeni@...hat.com>
> > > > > Date: Tue, 18 Jan 2022 16:24:19 +0100
> > > > > 
> > > > > > While inspecting some perf report, I noticed that the compiler
> > > > > > emits suboptimal code for the napi CB initialization, fetching
> > > > > > and storing multiple times the memory for flags bitfield.
> > > > > > This is with gcc 10.3.1, but I observed the same with older compiler
> > > > > > versions.
> > > > > > 
> > > > > > We can help the compiler to do a nicer work e.g. initially setting
> > > > > > all the bitfield to 0 using an u16 alias. The generated code is quite
> > > > > > smaller, with the same number of conditional
> > > > > > 
> > > > > > Before:
> > > > > > objdump -t net/core/gro.o | grep " F .text"
> > > > > > 0000000000000bb0 l     F .text	0000000000000357 dev_gro_receive
> > > > > > 
> > > > > > After:
> > > > > > 0000000000000bb0 l     F .text	000000000000033c dev_gro_receive
> > > > > > 
> > > > > > Signed-off-by: Paolo Abeni <pabeni@...hat.com>
> > > > > > ---
> > > > > >  include/net/gro.h | 13 +++++++++----
> > > > > >  net/core/gro.c    | 16 +++++-----------
> > > > > >  2 files changed, 14 insertions(+), 15 deletions(-)
> > > > > > 
> > > > > > diff --git a/include/net/gro.h b/include/net/gro.h
> > > > > > index 8f75802d50fd..a068b27d341f 100644
> > > > > > --- a/include/net/gro.h
> > > > > > +++ b/include/net/gro.h
> > > > > > @@ -29,14 +29,17 @@ struct napi_gro_cb {
> > > > > >  	/* Number of segments aggregated. */
> > > > > >  	u16	count;
> > > > > >  
> > > > > > -	/* Start offset for remote checksum offload */
> > > > > > -	u16	gro_remcsum_start;
> > > > > > +	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > > > > +	u16	proto;
> > > > > >  
> > > > > >  	/* jiffies when first packet was created/queued */
> > > > > >  	unsigned long age;
> > > > > >  
> > > > > > -	/* Used in ipv6_gro_receive() and foo-over-udp */
> > > > > > -	u16	proto;
> > > > > > +	/* portion of the cb set to zero at every gro iteration */
> > > > > > +	u32	zeroed_start[0];
> > > > > > +
> > > > > > +	/* Start offset for remote checksum offload */
> > > > > > +	u16	gro_remcsum_start;
> > > > > >  
> > > > > >  	/* This is non-zero if the packet may be of the same flow. */
> > > > > >  	u8	same_flow:1;
> > > > > > @@ -70,6 +73,8 @@ struct napi_gro_cb {
> > > > > >  	/* GRO is done by frag_list pointer chaining. */
> > > > > >  	u8	is_flist:1;
> > > > > >  
> > > > > > +	u32	zeroed_end[0];
> > > > > 
> > > > > This should be wrapped in struct_group() I believe, or compilers
> > > > > will start complaining soon. See [0] for the details.
> > > > > Adding Kees to the CCs.

Hi! Sorry I missed the original email sent my way. :)

> > > > 
> > > > Thank you for the reference. That really slipped-off my mind.
> > > > 
> > > > This patch does not use memcpy() or similar, just a single direct
> > > > assignement. Would that still require struct_group()?
> > > 
> > > Oof, sorry, I saw start/end and overlooked that it's only for
> > > a single assignment.
> > > Then it shouldn't cause warnings, but maybe use an anonymous
> > > union instead?
> > > 
> > > 	union {
> > > 		u32 zeroed;
> > > 		struct {
> > > 			u16 gro_remcsum_start;
> > > 			...
> > > 		};
> > > 	};
> > > 	__wsum csum;
> > > 
> > > Use can still use a BUILD_BUG_ON() in this case, like
> > > sizeof(zeroed) != offsetof(csum) - offsetof(zeroed).
> > 
> > Please forgive me for the very long delay. I'm looking again at this
> > stuff for formal non-rfc submission.
> 
> Sure, not a problem at all (:
> 
> > 
> > I like the anonymous union less, because it will move around much more
> > code - making the patch less readable - and will be more fragile e.g.
> > some comment alike "please don't move around 'csum'" would be needed.
> 
> We still need comments around zeroed_{start,end}[0] for now.
> I used offsetof(csum) as offsetofend(is_flist) which I'd prefer here
> unfortunately expands to offsetof(is_flist) + sizeof(is_flist), and
> the latter causes an error of using sizeof() against a bitfield.
> 
> > 
> > No strong opinion anyway, so if you really prefer that way I can adapt.
> > Please let me know.
> 
> I don't really have a strong preference here, I just suspect that
> zero-length array will produce or already produce -Warray-bounds
> warnings, and empty-struct constructs like

Yes, -Warray-bounds would yell very loudly about an 0-element array
having its first element assigned. :)

> 
> 	struct { } zeroed_start;
> 	u16 gro_remcsum_start;
> 	...
> 	struct { } zeroed_end;
> 
> 	memset(NAPI_GRO_CB(skb)->zeroed_start, 0,
> 	       offsetofend(zeroed_end) - offsetsof(zeroed_start));
> 
> will trigger Fortify compile-time errors from Kees' KSPP tree.
> 
> I think we could use
> 
> 	__struct_group(/* no tag */, zeroed, /* no attrs */,

Since this isn't UAPI, you can just do:

	struct_group(zeroed,

> 		u16 gro_remcsum_start;
> 		...
> 		u8 is_flist:1;
> 	);
> 	__wsum csum;
> 
> 	BUILD_BUG_ON(sizeof_field(struct napi_gro_cb, zeroed) != sizeof(u32));
> 	BUILD_BUG_ON(!IS_ALIGNED(offsetof(struct napi_gro_cb, zeroed),
> 				 sizeof(u32))); /* Avoid slow unaligned acc */
> 
> 	*(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0;
> 
> This doesn't depend on `csum`, doesn't need `struct { }` or
> `struct zero[0]` markers and still uses a direct assignment.

Given the kernel's -O2 use, a constant-sized u32 memcpy() and a direct
assignment will resolve to the same machine code:

https://godbolt.org/z/b9qKexx6q

void zero_var(struct object *p)
{
    p->zero_var = 0;
}

void zero_group(struct object *p)
{
    memset(&p->zero_group, 0, sizeof(p->zero_group));
}

zero_var:
  movl $0, (%rdi)
  ret
zero_group:
  movl $0, (%rdi)
  ret

And assigning them individually results in careful "and"ing to avoid the
padding:

zero_each:
  andl $-134217728, (%rdi)
  ret

I would have recommended struct_group() + memset() here, just because
you don't need the casts. But since you're doing BUILD_BUG_ON() size
verification, it's fine either way.

If you want to avoid the cast and avoid the memset(), you could use the
earlier suggestion of anonymous union and anonymous struct. It's
basically an open-coded struct_group, but you get to pick the type of
the overlapping variable. :)

-Kees

> Also adding Gustavo, maybe he'd like to leave a comment here.
> 
> > 
> > Thanks!
> > 
> > Paolo
> 
> Thanks,
> Al

-- 
Kees Cook

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ