linux-kernel - Re: RFC: Petition Intel/AMD to add POPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFzJ6=xV19twjXbEahDeP7mHJzBG0BByWWyVLzYKzoU4KA@mail.gmail.com>
Date:	Wed, 17 Aug 2016 12:54:04 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Denys Vlasenko <dvlasenk@...hat.com>
Cc:	Andy Lutomirski <luto@...capital.net>,
	Sara Sharon <sara.sharon@...el.com>,
	Dan Williams <dan.j.williams@...el.com>,
	Christian König <christian.koenig@....com>,
	Vinod Koul <vinod.koul@...el.com>,
	Alex Deucher <alexander.deucher@....com>,
	Johannes Berg <johannes.berg@...el.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
	Andy Lutomirski <luto@...nel.org>,
	"the arch/x86 maintainers" <x86@...nel.org>,
	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Adrian Hunter <adrian.hunter@...el.com>
Subject: Re: RFC: Petition Intel/AMD to add POPF_IF insn

On Wed, Aug 17, 2016 at 12:35 PM, Denys Vlasenko <dvlasenk@...hat.com> wrote:
>
> Experimentally, POPF is stupidly slow _always_. 6 cycles
> even if none of the "scary" flags are changed.

6 cycles is nothing.

That's basically the overhead of "oops, I need to use the microcode sequencer".

One issue is that the intel decoders (AMD too, for that matter) can
only generate a fairly small set of uops for any instruction. Some
instructions are really trivial to decode (popf definitely falls under
that heading), but are more than just a couple of uops, so you end up
having to use the uop sequencer logic.

According to Agner Fog's tables, there's one or two
micro-architectures that actually dot he simple "popf" case with a
single cycle throughput, but that's the very unusual case.

You can't even fit the "pop a value, see if only the arithmetic flags
changed, trap to microcode otherwise" into the three of four uops that
the "complex decoder" can generate directly.

And that "fall back to the uop sequencer engine" tends to just always
cause several cycles regardless. So yes, microcode tends to be slow
even for what would otherwise be trivial operations. You'd think Intel
could do as well as they do for the L0 uop cache, but afaik they
don't.

Anyway, six cycles is fast. I'd *love* for popf to actually be just 6
cycles when IF changes.  It's much much worse iirc (although honestly,
I haven't timed it in years - it's much easier to time just the
arithmetic flag changes).

It used to be more like a hundred cycles on Prescott.

                  Linus