linux-kernel - Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <52D519DC.9060402@suse.cz>
Date:	Tue, 14 Jan 2014 12:05:00 +0100
From:	Vlastimil Babka <vbabka@...e.cz>
To:	Motohiro Kosaki <Motohiro.Kosaki@...fujitsu.com>,
	Andrew Morton <akpm@...ux-foundation.org>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Sasha Levin <sasha.levin@...cle.com>,
	Wanpeng Li <liwanp@...ux.vnet.ibm.com>,
	Michel Lespinasse <walken@...gle.com>,
	Bob Liu <bob.liu@...cle.com>, Nick Piggin <npiggin@...e.de>,
	Motohiro Kosaki JP <kosaki.motohiro@...fujitsu.com>,
	Rik van Riel <riel@...hat.com>,
	David Rientjes <rientjes@...gle.com>,
	Mel Gorman <mgorman@...e.de>, Minchan Kim <minchan@...nel.org>,
	Hugh Dickins <hughd@...gle.com>,
	Johannes Weiner <hannes@...xchg.org>,
	linux-mm <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

On 01/13/2014 03:03 PM, Vlastimil Babka wrote:
> On 01/10/2014 06:48 PM, Motohiro Kosaki wrote:
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 068522d..b99c742 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
>>> cursor, unsigned int *mapcount,
>>>    		BUG_ON(!page || PageAnon(page));
>>>
>>>    		if (locked_vma) {
>>> -			mlock_vma_page(page);   /* no-op if already
>>> mlocked */
>>> -			if (page == check_page)
>>> +			if (page == check_page) {
>>> +				/* we know we have check_page locked */
>>> +				mlock_vma_page(page);
>>>    				ret = SWAP_MLOCK;
>>> +			} else if (trylock_page(page)) {
>>> +				/*
>>> +				 * If we can lock the page, perform mlock.
>>> +				 * Otherwise leave the page alone, it will be
>>> +				 * eventually encountered again later.
>>> +				 */
>>> +				mlock_vma_page(page);
>>> +				unlock_page(page);
>>> +			}
>>>    			continue;	/* don't unmap */
>>>    		}
>>
>> I audited all related mm code. However I couldn't find any race that it can close.
>
> Well, I would say the lock here closes the race with page migration, no? (As discussed below)
>
>> First off,  current munlock code is crazy tricky.
>
> Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12).
>
>> munlock
>> 	down_write(mmap_sem)
>> 	do_mlock()
>> 		mlock_fixup
>> 			munlock_vma_pages_range
>> 				__munlock_pagevec
>> 					spin_lock_irq(zone->lru_lock)
>> 					TestClearPageMlocked(page)
>> 					del_page_from_lru_list
>> 					spin_unlock_irq(zone->lru_lock)
>> 					lock_page
>> 					__munlock_isolated_page
>> 					unlock_page
>> 				
>> 	up_write(mmap_sem)
>>
>> Look, TestClearPageMlocked(page) is not protected by lock_page.
>
> Right :( That's my fault, when developing the patch series I didn't see the page
> migration race, and it seemed that lock is only needed to protect the rmap operations
> in __munlock_isolated_page()
>
>> But this is still
>> safe because Page_mocked mean one or more vma marked VM_LOCKED then we
>> only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.
>>
>> And,
>>
>> 					spin_lock_irq(zone->lru_lock)
>> 					del_page_from_lru_list
>> 					spin_unlock_irq(zone->lru_lock)
>>
>> This idiom ensures I or someone else isolate the page from lru and isolated pages
>> will be put back by putback_lru_page in anycase. So, the pages will move the right
>> lru eventually.
>>
>> And then, taking page-lock doesn't help to close vs munlock race.
>>
>> On the other hands, page migration has the following call stack.
>>
>> some-caller [isolate_lru_page]
>> 	unmap_and_move
>> 		__unmap_and_move
>> 		trylock_page
>> 		try_to_unmap
>> 		move_to_new_page
>> 			migrate_page
>> 				migrate_page_copy
>> 		unlock_page
>>
>> The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock.
>
> However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(),
> or clearing PG_mlocked in __munlock_pagevec().
>
> The race I have in mind for munlock is:
>
> CPU1 does page migration
>   some-caller [isolate_lru_page]
>   	unmap_and_move
>   		__unmap_and_move
>   		trylock_page
>   		try_to_unmap
>   		move_to_new_page
>   			migrate_page
>   				migrate_page_copy
> 					mlock_migrate_page - transfers PG_mlocked from old page to new page
>
> CPU2 does munlock:
>   munlock
>   	down_write(mmap_sem)
>   	do_mlock()
>   		mlock_fixup
>   			munlock_vma_pages_range
>   				__munlock_pagevec
>   					spin_lock_irq(zone->lru_lock)
>   					TestClearPageMlocked(page) - here it finds PG_mlocked already cleared
> 					so it stops, but meanwhile new page already has PG_mlocked set and will
> 					stay inevictable

As Mel pointed out to me, page lock in munlock alone would not help here
anyway, because munlock would just wait for migration to unlock the old 
page and then still fail to clear a flag that has been transferred by 
migration to the new page. So if there was a race, it would be older 
than __munlock_pagevec() removing TestClearPageMlocked(page) from the 
page_lock protected section.

But then I noticed that migration bails out for pages that have elevated 
pins, and this is done after making the page unreachable for 
mlock/munlock via pte migration entries. So this alone should prevent a 
race with m(un)lock, with three possible scenarios:
- m(un)lock sees the migration entry and waits for migration to
   complete (via follow_page_mask), operates on the new page
- m(un)lock gets the page reference before migration entry is set,
   finishes before migration checks the page pin count, migration
   transfers the new PG_mlocked state to the new page
- m(un)lock doesn't finish that quickly, migration bails out, m(un)lock
   changes the old page that is mapped back


>> Page fault path take lock page and doesn't use page isolation. This is correct.
>> try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too.
>
> I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated?
> (I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster()
> and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking,
> but a distorted accounting anyway.
>
>> Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against
>> page migration, but not lru manipulation.
>>
>> 		if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
>> 			lock_page(old_page);	/* LRU manipulation */
>> 			munlock_vma_page(old_page);
>> 			unlock_page(old_page);
>> 		}
>>

So the protection is actually for rmap operations in try_to_munlock.
The comment could be removed from the caller here and appear just at the 
PageLocked assertion in munlock_vma_page().

>> But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected
>> page lock. If so, I propose to add PageMlocked() like this
>>
>> 			} else if (!PageMlocked() && trylock_page(page)) {
>> 				/*
>> 				 * If we can lock the page, perform mlock.
>> 				 * Otherwise leave the page alone, it will be
>> 				 * eventually encountered again later.
>> 				 */
>> 				mlock_vma_page(page);
>> 				unlock_page(page);
>>
>> This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may
>> do the right job and will fix the mistake.
>>
>> Anyway,  I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush.

If we agree that page migration is safe as explained above, then 
removing the PageLocked assertion from mlock_vma_page is again an 
option, instead of making sure that all callers lock?

> Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable
> when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking.
> Another option would be that page migration would somehow deal with the race, or just leave the target pages without
> PG_mlocked and let them be dealt with later.
> But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider.
> And finally, we could then at least replace the atomic test and set with faster non-atomic variants.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/