linux-kernel - Re: [RFC] de-asmify the x86-64 system call slowpath

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 5 Feb 2014 16:32:55 -0800
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Peter Anvin <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <peterz@...radead.org>
Cc:	"the arch/x86 maintainers" <x86@...nel.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [RFC] de-asmify the x86-64 system call slowpath

On Sun, Jan 26, 2014 at 2:28 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> Comments? This was obviously brought on by my frustration with the
> currently nasty do_notify_resume() always returning to iret for the
> task_work case, and PeterZ's patch that fixed that, but made the asm
> mess even *worse*.

Actually, I should have taken a closer look.

Yes, do_notify_resume() is a real issue, and my stupid open/close
test-case showed that part of the profile.

But the "iretq" that dominates on the kernel build is actually the
page fault one.

I noticed this when I compared "-e cycles:pp" with "-e cycles:p". The
single-p version shows largely the same profile for the kernel, except
that instead of showing "iretq" as the big cost, it shows the first
instruction in "page_fault".

In fact, even when *not* zoomed into the kernel DSO, "page_fault"
actually takes 5% of CPU time according to pref report. That's really
quite impressive.

I suspect the Haswell architecture has made everything else cheaper,
and the exception overhead hasn't kept up. I'm wondering if there is
anything we could do to speed this up - like doing gang lookup in the
page cache and pre-populating the page tables opportunistically.

We're using an interrupt gate for the page fault handling, and I don't
think we can avoid that. For all I know, a trap gate might be slightly
faster (but likely not really noticeable - the microcode is surely
expensive, but the pipeline unwinding is probably the biggest cost of
the page fault), but we have the issue of interrupts causing page
faults for vmalloc pages.. And obviously we can't avoid the iretq for
the return path.

So as far as I can see, there's no sane way to make the page fault
itself cheaper. Looking at opportunistically prepopulating page tables
when it's cheap and easy might be the best we can do..

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/