linux-kernel - Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of vma_merge succeeding in copy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 16 Nov 2011 16:16:57 -0800 (PST)
From:	Hugh Dickins <hughd@...gle.com>
To:	Andrea Arcangeli <aarcange@...hat.com>
cc:	Mel Gorman <mgorman@...e.de>, Nai Xia <nai.xia@...il.com>,
	Pawel Sikora <pluto@...k.net>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	jpiszcz@...idpixels.com, arekm@...-linux.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mremap: enforce rmap src/dst vma ordering in case of
 vma_merge succeeding in copy_vma

On Wed, 16 Nov 2011, Andrea Arcangeli wrote:
> On Wed, Nov 09, 2011 at 02:25:42AM +0100, Andrea Arcangeli wrote:
> > Also note, if we find a way to enforce orderings in the prio tree (not
> > sure if it's possible, apparently it's already using list_add_tail
> > so..), then we could also remove the i_mmap_lock from mremap and fork.
> 
> I'm not optimistic we can enforce ordering there. Being a tree it's
> walked in range order.
> 
> I thought of another solution that would avoid having to reorder the
> list in mremap and avoid the i_mmap_mutex to be added to fork (and
> then we can remove it from mremap too). The solution is to rmap_walk
> twice. I mean two loops over the same_anon_vma for those rmap walks
> that must be reliable (that includes two calls of
> unmap_mapping_range). For both same_anon_vma and prio tree.
> 
> Reading truncate_pagecache I see two loops already and a comment
> saying it's for fork(), to avoid leaking ptes in the child. So fork is
> probably ok already without having to take the i_mmap_mutex, but then
> I wonder why that also doesn't fix mremap if we do two loops there and
> why that i_mmap_mutex is really needed in mremap considering those two
> calls already present in truncate_pagecache. I wonder if that was a
> "theoretical" fix that missed the fact truncate already walks the prio
> tree twice, so it doesn't matter if the rmap_walk goes in the opposite
> direction of move_page_tables? That i_mmap_lock in mremap (now
> i_mmap_mutex) is there since start of git history. The double loop was
> introduced in d00806b183152af6d24f46f0c33f14162ca1262a. So it's very
> possible that i_mmap_mutex is now useless (after
> d00806b183152af6d24f46f0c33f14162ca1262a) and the fix for fork, was
> already taking care of mremap too and that i_mmap_mutex can now be
> removed.

As you found, the mremap locking long predates truncation's double unmap.

That's an interesting point, and you may be right - though, what about
the *very* unlikely case where unmap_mapping_range looks at new vma
when pte is in old, then at old vma when pte is in new, then
move_page_tables runs out of memory and cannot complete, then the
second unmap_mapping_range looks at old vma while pte is still in new
(I guess this needs some other activity to have jumbled the prio_tree,
and may just be impossible), then at new (to be abandoned) vma after
pte has moved back to old.

Probably not an everyday occurrence :)

But, setting that aside, I've always thought of that second call to
unmap_mapping_range() as a regrettable expedient that we should try
to eliminate e.g. by checking for private mappings in the first pass,
and skipping the second call if there were none.

But since nobody ever complained about that added overhead, I never
got around to bothering; and you may consider the i_mmap_mutex in
move_ptes a more serious unnecessary overhead.

By the way, you mention "a comment saying it's for fork()": I don't
find "fork" anywhere in mm/truncate.c, my understanding is in this
comment (probably mine) from truncate_pagecache():

	/*
	 * unmap_mapping_range is called twice, first simply for
	 * efficiency so that truncate_inode_pages does fewer
	 * single-page unmaps.  However after this first call, and
	 * before truncate_inode_pages finishes, it is possible for
	 * private pages to be COWed, which remain after
	 * truncate_inode_pages finishes, hence the second
	 * unmap_mapping_range call must be made for correctness.
	 */

The second call was not (I think) necessary when we relied upon
truncate_count, but became necessary once Nick relied upon page lock
(the page lock on the file page providing no guarantee for the COWed
page).

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/