linux-kernel - Re: [PATCH] proc: pagemap: Hold mmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.1003311754110.3707@i5.linux-foundation.org>
Date:	Wed, 31 Mar 2010 18:33:48 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Matt Mackall <mpm@...enic.com>
cc:	San Mehat <san@...gle.com>, linux-kernel@...r.kernel.org,
	Brian Swetland <swetland@...gle.com>,
	Dave Hansen <haveblue@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] proc: pagemap: Hold mmap_sem during page walk



On Wed, 31 Mar 2010, Matt Mackall wrote:
>
> Linus, I must say your charm has really worn thin. I've just stuck a
> post-it on my monitor saying "don't be Linus" to remind me not to be
> rude to my contributors. 

You didn't actually answer the problem, though.

I'm rude, because I think the code is buggy. I pointed out how, and why I 
think it's pretty fundamental. You quoted it, but you didn't answer it.

> If I recall correctly, the get_user_pages is there to force all the
> output pages to be guaranteed present before we later fill them so that
> the output needn't be double-buffered and the locking and recursion on
> the page tables is significantly simpler and faster. put_user is then
> actually easier than "writing to the physical pages through the array".

Umm. Why would get_user_pages() guarantee that the pages don't get swapped 
out in the meantime, and you end up with a page fault _anyway_?

Yes, it makes those page faults much rarer, but that just makes things 
much worse. Now you have a nasty subtle fundamental bug that is hard to 
trigger too.

NOTE! The fact that we mark the page dirty (and increment the page count) 
in get_user_pages() (not in the later loop, which does indeed look 
pointless) should mean that the page doesn't go away physically, and if it 
gets mapped out the same page will be mapped back in on access. That's how 
things like direct-IO work (modulo a concurrent fork(), when they don't 
work, but that's a separate issue).

But the problem with the deadlock is not that "same physical page" issue, 
it's simply the problem of the page fault code itself wanting to do a 
down_read() on the mmap_sem. So the fact that the physical page is 
reliable in the array doesn't actually solve the bug I was pointing out.

See? 

So the

	nr = get_user_pages(.. 1, 0, ..)

	...

	for_each_page()
		mark_it_dirty();

pattern is a valid pattern, but it is _only_ valid if you then do the 
write to the physical page you looked up. If you do the accesses through 
the virtual addresses, it makes zero sense.

> You're right that the SetPageDirty() at cleanup is redundant (but
> harmless), and I probably copied that pattern from another user of
> get_user_pages.

Fine. So then you'd do get_user_pages() and do _nothing_ with the array? 
Please explain what the point of that is, again?

See above: there are valid reasons for things like direct-IO to do that 
exact "get_user_pages()" pattern - they mark the pages dirty both early 
_and_ late, and both page dirtying is required:

 - the first one (done by get_user_pages() itself) is to make sure that an
   anonymous page doesn't get switched around to another anonymous page 
   that just happens to have the same contents (ie if the page gets 
   swapped out, the swapout code will turn it into a swap-cache page).

   Once it's a swap-out page, the page refcount will then guarantee that 
   it doesn't get replaced with anything else, so now you can trust the 
   physical page.

 - the second "mark it dirty afterwards" is required in case swapout did 
   happen, and the page got marked clean before the direct-IO actually 
   wrote to it. So you mark it dirty again - this time not to force any 
   swap cache things, but simply because there might have been a race 
   between cleaning the page and the thing that wrote to it through the 
   physical address.

So there is a very real reason for that pattern existing. It's just that 
that reason has nothing to do with locking the thing into the page tables. 
That is absolutely NOT guaranteed by the get_user_pages() physical page 
pinning (munmap() is an extreme example of this, but I think swapout will 
clear it too in try_to_unmap_one().

In fact, even without swapping the page out, the accessed and dirty bit 
stuff may end up being done by software and have that page fault happen.

What I'm trying to explain is simply that all these VM rules mean that you 
must never do a virtual user space access while holding mmap_sem. And 
doing get_user_pages() in no way changes that fundamental rule.

Now, I will _happily_ admit that the above is all very complicated, and 
very easy to get wrong. There is a real reason why I hate direct-IO. It's 
a total nightmare from a VM standpoint. We've had bugs here. So I can well 
see that people don't always understand all the rules.

Quite frankly, the only reason that /proc/self/pagemap code got merged is 
because it came through Andrew. Andrew knows the VM layer. Probably way 
better than I do by now. So when I get patches that touch things like this 
through the -mm tree, I tend to apply them.

And looking now at the code again, I really think it was a mistake.

> Earlier versions of the pagewalk code studiously avoided calling
> find_vma and only looked at the page tables (with either caller doing
> locking or accepting the non-atomicity) to avoid the sort of issue that
> San has run into, but it looks like I forgot about that and let it sneak
> back in when other folks added more hugepage support.

Ok, that would fix the problem. The page tables can be accessed even 
without holding mmap_sem. And once you don't need mmap_sem, all the 
deadlocks go away. The get_user_pages() pattern still doesn't make sense, 
but at that point it's a _harmless_ problem, and doesn't really hurt.

See what I'm saying?

> The deeper problem is that hugepages have various nasty layering
> violations associated with them like being tied to vmas so until some
> hugepage genius shows up and tells us how to do this cleanly, we'll
> probably have to deal with the associated mmap_sem.
>
> San, your patch is touching the wrong mm_sem, I think. The interesting
> races are going to happen on the target mm, not current's (generally not
> the same).

The thing is, you need to hold the mmap_sem over the whole loop in 
pagemap_pte_range (since you're using the vma inside the loop). And that 
means that you'd hold mmap sem over the add_to_pagemap(), and the 
put_user() reference. Which has deadlock bug I pointed out, and that you 
totally ignored.

The fact that it's a separate mm doesn't change the deadlock in _any_ 
shape or form. It _could_ be the same mm, but even if it was another VM 
entirely, it would just make the deadlock possible as an ABBA thing across 
two different VM's instead of through a single VM.

> Holding the mm_sem across the entire walk is also prohibitive, it 
> probably needs to be localized to individual find_vma calls.

You'd need to do it for each individual page, as far as I can tell, and 
move the find_vma() inside the loop itself in order to avoid the whole 
"hold mmap-sem while potentially doing a page fault" case.

At that point, it looks like it wouldn't be buggy any more, it would just 
suck horribly from a performance standpoint.

So Matt, please actually address the _bug_ I pointed out rather than talk 
about other things. And yes, getting rid of the vma accesses sounds like 
it would fix it best. If that means that it doesn't work for hugepages, so 
be it.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/