linux-kernel - Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when faulting in pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 8 Dec 2010 15:58:21 -0800
From:	Michel Lespinasse <walken@...gle.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	linux-mm@...ck.org, Hugh Dickins <hughd@...gle.com>,
	Rik van Riel <riel@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Nick Piggin <npiggin@...nel.dk>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	linux-kernel@...r.kernel.org,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH 1/6] mlock: only hold mmap_sem in shared mode when
 faulting in pages

On Wed, Dec 8, 2010 at 3:27 PM, Andrew Morton <akpm@...ux-foundation.org> wrote:
>> Currently mlock() holds mmap_sem in exclusive mode while the pages get
>> faulted in. In the case of a large mlock, this can potentially take a
>> very long time, during which various commands such as 'ps auxw' will
>> block. This makes sysadmins unhappy:
>>
>> real    14m36.232s
>> user    0m0.003s
>> sys     0m0.015s
>>(output from 'time ps auxw' while a 20GB file was being mlocked without
>> being previously preloaded into page cache)
>
> The kernel holds down_write(mmap_sem) for 14m36s?

Yes...

[... patch snipped off ...]

> Am I correct in believing that we'll still hold down_read(mmap_sem) for
> a quarter hour?

Yes, patch 1/6 changes the long hold time to be in read mode instead
of write mode, which is only a band-aid. But, this prepares for patch
5/6, which releases mmap_sem whenever there is contention on it or
when blocking on disk reads.

> We don't need to hold mmap_sem at all while faulting in those pages,
> do we?  We could just do
>
>        for (addr = start, addr < end; addr += PAGE_SIZE)
>                get_user(x, addr);
>
> and voila.  If the pages are in cache and the ptes are set up then that
> will be *vastly* faster than the proposed code.  If the get_user()
> takes a minor fault then it'll be slower.  If it's a major fault then
> the difference probably doesn't matter much.

get_user wouldn't suffice if the page is already mapped in, as we need
to mark it as PageMlocked. Also, we need to skip IO and PFNMAP
regions. I don't think you can make things much simpler than what I
ended up with.

> But whatever.  Is this patchset a half-fix, and should we rather be
> looking for a full-fix?

I think the series fully fixes the mlock() and mlockall() cases, which
has been the more pressing use case for us.

Even then, there are still cases where we could still observe long
mmap_sem hold times - fundamentally, every place that calls
get_user_pages (or do_mmap, in the mlockall MCL_FUTURE case) with a
large page range may create such problems. From the looks of it, most
of these places wouldn't actually care if the mmap_sem got dropped in
the middle of the operation, but a general fix will have to involve
looking at all the call sites to be sure.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/