linux-kernel - Re: [patch 1/2] x86_64 page fault NMI-safe

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AANLkTik-sJL_pAJg38Hzpl_jpn2v90t7tppYqOGmu6p-@mail.gmail.com>
Date:	Wed, 14 Jul 2010 15:56:43 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Frederic Weisbecker <fweisbec@...il.com>
Cc:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Andi Kleen <andi@...stfloor.org>, Ingo Molnar <mingo@...e.hu>,
	LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Steven Rostedt <rostedt@...tedt.homelinux.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Christoph Hellwig <hch@....de>, Li Zefan <lizf@...fujitsu.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Johannes Berg <johannes.berg@...el.com>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Tom Zanussi <tzanussi@...il.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Jeremy Fitzhardinge <jeremy@...p.org>,
	"Frank Ch. Eigler" <fche@...hat.com>, Tejun Heo <htejun@...il.com>
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Wed, Jul 14, 2010 at 3:31 PM, Frederic Weisbecker <fweisbec@...il.com> wrote:
>
> Until now I didn't because I clearly misunderstand the vmalloc internals. I'm
> not even quite sure why a memory allocated with vmalloc sometimes can be not
> mapped (and then fault once for this to sync). Some people have tried to explain
> me but the picture is still vague to me.

So the issue is that the system can have thousands and thousands of
page tables (one for each process), and what do you do when you add a
new kernel virtual mapping?

You can:

 - make sure that you only ever use _one_ single top-level entry for
all vmalloc issues, and can make sure that all processes are created
with that static entry filled in. This is optimal, but it just doesn't
work on all architectures (eg on 32-bit x86, it would limit the
vmalloc space to 4MB in non-PAE, whatever)

 - at vmalloc time, when adding a new page directory entry, walk all
the tens of thousands of existing page tables under a lock that
guarantees that we don't add any new ones (ie it will lock out fork())
and add the required pgd entry to them.

 - or just take the fault and do the "fill the page tables" on demand.

Quite frankly, most of the time it's probably better to make that last
choice (unless your hardware makes it easy to make the first choice,
which is obviously simplest for everybody). It makes it _much_ cheaper
to do vmalloc. It also avoids that nasty latency issue. And it's just
simpler too, and has no interesting locking issues with how/when you
expose the page tables in fork() etc.

So the only downside is that you do end up taking a fault in the
(rare) case where you have a newly created task that didn't get an
even newer vmalloc entry. And that fault can sometimes be in an
interrupt or an NMI. Normally it's trivial to handle that fairly
simple nested fault. But NMI has that inconvenient "iret unblocks
NMI's, because there is no dedicated 'nmiret' instruction" problem on
x86.

                            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/