[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.01.0908010856080.3304@localhost.localdomain>
Date: Sat, 1 Aug 2009 09:13:51 -0700 (PDT)
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Jakub Jelinek <jakub@...hat.com>
cc: Ulrich Drepper <drepper@...hat.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Richard Guenther <rguenther@...e.de>
Subject: Re: [PATCH] information leak in sigaltstack
On Sat, 1 Aug 2009, Jakub Jelinek wrote:
>
> yes, it is because of -Os, rep stosq is longer than rep stosl. For -O2 it
> generates:
> movq $0, 8(%rdi)
> movq %rsi, (%rdi)
> movl $6, 8(%rdi)
> movq $2, 16(%rdi)
Ok, so -Os actually generates _larger_ code due to this silly one-byte
micro-optimization that then disables other optimizations.
I suspect that the "rep stosl" choice is _particularly_ bad, since doing a
32-bit write actually is much more likely to cause subsequent stalls on
the store buffer if any of the latter accesses are 64-bit reads. It tends
to be much better to create larger stores (and equal or smaller loads) if
there are any overlapping accesses.
I think a lot of micro-architectures will stall if they try to do a read
that crosses multiple stores in the store buffer. If the load hits in the
store buffer and is entirely contained within one store, you have a much
better chance of just bypassing the cache entirely.
> which still isn't perfect, but is much better.
Yes. It would have been nice if gcc optimized the overlapping accesses
too, but it's already much better.
> At -O2 when GCC decides to do the memset piecewise it is easier to kill
> dead stores from the memset
On 32-bit, we do memcpy() by hand because gcc is/was so bad at this (we do
it for memset too, but only for really small areas, so it wouldn't
trigger).
On x86-64, we've trusted that gcc is better. Sadly, it's clearly not very
good. I bet that for anything that is just a couple of stores (three, in
this case), you'd almost always be better off using regular stores. The
"rep stosl" may be small, but the register constraints and the inability
to combine memory accesses are a big downer.
> At -Os when GCC decides during the expand to use arch specific pattern
> for the memset it would be much harder to handle it at the RTL level.
> So the above should be ideally optimized already at the tree level.
Yes. Or perhaps by just saying that for -Os you'd still expand it for a
really _small_ number of accesses (just make it smaller than for -O2).
> I'd say the test you want to do is
> if (sizeof (oss.ss_sp) + sizeof (oss.ss_size) + sizeof (oss.ss_flags)
> != sizeof oss)
> memset (&oss, 0, sizeof oss);
> (i.e. check whether the struct has any padding in it or not).
Yeah, that sounds logical. I'd still like to figure out how to get gcc to
generate better code in this case, though.
Explicit padding bytes would do it, of course, but then we're talking
changs to every single 64-bit architecture. Grr.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists