linux-kernel - Re: [patch 1/2] x86_64 page fault NMI-safe

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100716104716.GA5377@nowhere>
Date:	Fri, 16 Jul 2010 12:47:26 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
	Andi Kleen <andi@...stfloor.org>, Ingo Molnar <mingo@...e.hu>,
	LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Steven Rostedt <rostedt@...tedt.homelinux.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Christoph Hellwig <hch@....de>, Li Zefan <lizf@...fujitsu.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Johannes Berg <johannes.berg@...el.com>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Tom Zanussi <tzanussi@...il.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Jeremy Fitzhardinge <jeremy@...p.org>,
	"Frank Ch. Eigler" <fche@...hat.com>, Tejun Heo <htejun@...il.com>
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 10:46:13AM -0400, Steven Rostedt wrote:
> On Thu, 2010-07-15 at 16:11 +0200, Frederic Weisbecker wrote:
> 
> > >  - make sure that you only ever use _one_ single top-level entry for
> > > all vmalloc issues, and can make sure that all processes are created
> > > with that static entry filled in. This is optimal, but it just doesn't
> > > work on all architectures (eg on 32-bit x86, it would limit the
> > > vmalloc space to 4MB in non-PAE, whatever)
> > 
> > 
> > But then, even if you ensure that, don't we need to also fill lower level
> > entries for the new mapping.
> 
> If I understand your question, you do not need to worry about the lower
> level entries because all the processes will share the same top level.
> 
> process 1's GPD ------,
>                       |
>                       +------> PMD --> ...
>                       |
> process 2' GPD -------'
> 
> Thus we have one page entry shared by all processes. The issue happens
> when the vm space crosses the PMD boundary and we need to update all the
> GPD's of all processes to point to the new PMD we need to add to handle
> the spread of the vm space.




Oh right. We point to that PMD, and the update has been made itself inside
the lower level entries pointed by the PMD. Indeed.



> 
> > 
> > Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
> > risk to add a new memory mapping for new memory allocated with kmalloc?
> 
> Because all of memory (well 800 some megs on 32 bit) is mapped into
> memory for all processes. That is, kmalloc only uses this memory (as
> does get_free_page()). All processes have a PMD (or PUD, whatever) that
> maps this memory. The issues only arises when we use new virtual memory,
> which vmalloc does. Vmalloc may map to physical memory that is already
> mapped to all processes but the address that the vmalloc uses to access
> that memory is not yet mapped.



Ok I see.




> 
> The usual reason the kernel uses vmalloc is to get a contiguous range of
> memory. The vmalloc can map several pages as one contiguous piece of
> memory that in reality is several different pages scattered around
> physical memory. kmalloc can only map pages that are contiguous in
> physical memory. That is, if kmalloc gets 8192 bytes on an arch with
> 4096 byte pages, it will allocate two consecutive pages in physical
> memory. If two contiguous pages are not available even if thousand of
> single pages are, the kmalloc will fail, where as the vmalloc will not.
> 
> An allocation of vmalloc can use two different pages and just map the
> page table to make them contiguous in view of the kernel. Note, this
> comes at a cost. One is when we do this, we suffer the case where we
> need to update a bunch of page tables. The other is that we must waste
> TLB entries to point to these separate pages. Kmalloc and
> get_free_page() uses the big memory mappings. That is, if the TLB allows
> us to map large pages, we can do that for kernel memory since we just
> want the contiguous memory as it is in physical memory.
> 
> Thus the kernel maps the physical memory with the fewest TLB entries as
> needed (large pages and large TLB entries). If we can map 64K pages, we
> do that. Then kmalloc just allocates within this range, it does not need
> to map any pages. They are already mapped.
> 
> Does this make a bit more sense?



Totally! You've made it very clear to me.
Moreover I did not know we can have such variable page size. I mean I thought
we can have variable page size but that would apply to every pages.





> 
> > 
> > 
> > 
> > >  - at vmalloc time, when adding a new page directory entry, walk all
> > > the tens of thousands of existing page tables under a lock that
> > > guarantees that we don't add any new ones (ie it will lock out fork())
> > > and add the required pgd entry to them.
> > > 
> > >  - or just take the fault and do the "fill the page tables" on demand.
> > > 
> > > Quite frankly, most of the time it's probably better to make that last
> > > choice (unless your hardware makes it easy to make the first choice,
> > > which is obviously simplest for everybody). It makes it _much_ cheaper
> > > to do vmalloc. It also avoids that nasty latency issue. And it's just
> > > simpler too, and has no interesting locking issues with how/when you
> > > expose the page tables in fork() etc.
> > > 
> > > So the only downside is that you do end up taking a fault in the
> > > (rare) case where you have a newly created task that didn't get an
> > > even newer vmalloc entry.
> > 
> > 
> > But then how did the previous tasks get this new mapping? You said
> > we don't walk through every process page tables for vmalloc.
> 
> Actually we don't even need to walk the page tables in the first task
> (although we might do that). When the kernel accesses that memory we
> take the page fault, the page fault will see that this memory is vmalloc
> data and fill in the page tables for the task at that time.



Right.




> > 
> > I would understand this race if we were to walk on every processes page
> > tables and add the new mapping on them, but we missed one new task that
> > forked or so, because we didn't lock (or just rcu).
> > 
> > 
> > 
> > > And that fault can sometimes be in an
> > > interrupt or an NMI. Normally it's trivial to handle that fairly
> > > simple nested fault. But NMI has that inconvenient "iret unblocks
> > > NMI's, because there is no dedicated 'nmiret' instruction" problem on
> > > x86.
> > 
> > 
> > Yeah.
> > 
> > 
> > So the parts of the problem I don't understand are:
> > 
> > - why don't we have this problem with kmalloc() ?
> 
> I hope I explained that above.



Yeah :)

Thanks a lot for your explanations!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/