lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 17 Aug 2016 12:54:04 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Denys Vlasenko <dvlasenk@...hat.com>
Cc:	Andy Lutomirski <luto@...capital.net>,
	Sara Sharon <sara.sharon@...el.com>,
	Dan Williams <dan.j.williams@...el.com>,
	Christian König <christian.koenig@....com>,
	Vinod Koul <vinod.koul@...el.com>,
	Alex Deucher <alexander.deucher@....com>,
	Johannes Berg <johannes.berg@...el.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
	Andy Lutomirski <luto@...nel.org>,
	"the arch/x86 maintainers" <x86@...nel.org>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Adrian Hunter <adrian.hunter@...el.com>
Subject: Re: RFC: Petition Intel/AMD to add POPF_IF insn

On Wed, Aug 17, 2016 at 12:35 PM, Denys Vlasenko <dvlasenk@...hat.com> wrote:
>
> Experimentally, POPF is stupidly slow _always_. 6 cycles
> even if none of the "scary" flags are changed.

6 cycles is nothing.

That's basically the overhead of "oops, I need to use the microcode sequencer".

One issue is that the intel decoders (AMD too, for that matter) can
only generate a fairly small set of uops for any instruction. Some
instructions are really trivial to decode (popf definitely falls under
that heading), but are more than just a couple of uops, so you end up
having to use the uop sequencer logic.

According to Agner Fog's tables, there's one or two
micro-architectures that actually dot he simple "popf" case with a
single cycle throughput, but that's the very unusual case.

You can't even fit the "pop a value, see if only the arithmetic flags
changed, trap to microcode otherwise" into the three of four uops that
the "complex decoder" can generate directly.

And that "fall back to the uop sequencer engine" tends to just always
cause several cycles regardless. So yes, microcode tends to be slow
even for what would otherwise be trivial operations. You'd think Intel
could do as well as they do for the L0 uop cache, but afaik they
don't.

Anyway, six cycles is fast. I'd *love* for popf to actually be just 6
cycles when IF changes.  It's much much worse iirc (although honestly,
I haven't timed it in years - it's much easier to time just the
arithmetic flag changes).

It used to be more like a hundred cycles on Prescott.

                  Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ