[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b8e8cea6ac884c04b8c9e7a479fd2208@AcuMS.aculab.com>
Date: Mon, 22 Mar 2021 15:41:56 +0000
From: David Laight <David.Laight@...LAB.COM>
To: 'Peter Zijlstra' <peterz@...radead.org>
CC: "x86@...nel.org" <x86@...nel.org>,
"jpoimboe@...hat.com" <jpoimboe@...hat.com>,
"jgross@...e.com" <jgross@...e.com>,
"mbenes@...e.com" <mbenes@...e.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2 03/14] x86/retpoline: Simplify retpolines
From: Peter Zijlstra
> Sent: 22 March 2021 09:33
>
> On Fri, Mar 19, 2021 at 05:18:14PM +0000, David Laight wrote:
> > From: Peter Zijlstra
> > > Sent: 18 March 2021 17:11
> > >
> > > Due to commit c9c324dc22aa ("objtool: Support stack layout changes
> > > in alternatives"), it is possible to simplify the retpolines.
> > >
> > ...
> > > Notice that since the longest alternative sequence is now:
> > >
> > > 0: e8 07 00 00 00 callq c <.altinstr_replacement+0xc>
> > > 5: f3 90 pause
> > > 7: 0f ae e8 lfence
> > > a: eb f9 jmp 5 <.altinstr_replacement+0x5>
> > > c: 48 89 04 24 mov %rax,(%rsp)
> > > 10: c3 retq
> > >
> > > 17 bytes, we have 15 bytes NOP at the end of our 32 byte slot. (IOW,
> > > if we can shrink the retpoline by 1 byte we can pack it more dense)
> >
> > I'm intrigued about the lfence after the pause.
> > Clearly this is for very warped cpu behaviour.
> > To get to the pause you have to be speculating past an
> > unconditional call.
>
> Please read up on retpoline... That's the speculation trap. The warped
> CPU behaviour is called Spectre-v2.
There is 'warped' and 'very warped' :-)
Most of Spectre-v2 (don't search for Spectra v2 by mistake) is avoiding
the indirect branch prediction - which I knew.
I think the 'pause' is only executed is the cpu speculates through
the mov and retq; rather the speculating past the 'call' - which
some ARM cpu seem to do.
> For others, the obvious alternative is the below; and possibly we could
> then also remove the loop.
Another alternative is 'hlt' with the loop (rather than int3).
> The original retpoline, as per Paul's article has: 1: pause; jmp 1b;.
> That is, it lacks the LFENCE we have and would also fit 16 bytes.
Yes. Just 'jmps .' ought to be enough to block any side effects
of speculative execution.
Adding 'pause' is 'a good idea' for any spin loop.
There might be another lurking performance issue.
Skylake increased the execution time of pause from ~10 to ~140 clocks.
Reading between the lines I suspect this applies to speculatively
executed pause.
If that happens on a regular basis it might be noticeable.
So it may even be best to remove the pause!
As you say, the original retpoline lacked the lfence.
The only 'load' instruction in that sequence is the 'retq'.
It has to be said that given all (normal) loads are executed
in program order and all (normal) stores are also executed
in program order I've never actually seen what either
lfence or sfence actually do for you.
(mfence synchronises reads and writes - so may be useful.)
The (pre spectre) copies of the intel pdf's I have don't say
anything about lfence being any kind of a barrier against
speculative memory reads.
If the retpoline doesn't fit in 16 bytes it is almost certainly
(probably) worth putting the called label at offset 16.
This would mean that there is only one 16-byte code block
read from the call target.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Powered by blists - more mailing lists