linux-kernel - Re: mmap() scalability in the presence of the MAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130104115748.GA8830@google.com>
Date:	Fri, 4 Jan 2013 03:57:48 -0800
From:	Michel Lespinasse <walken@...gle.com>
To:	Roman Dubtsov <dubtsov@...il.com>
Cc:	linux-kernel@...r.kernel.org,
	Andy Lutomirski <luto@...capital.net>,
	Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Hugh Dickins <hughd@...gle.com>
Subject: Re: mmap() scalability in the presence of the MAP_POPULATE flag

On Fri, Jan 04, 2013 at 12:09:37AM +0700, Roman Dubtsov wrote:
> On Wed, 2013-01-02 at 16:09 -0800, Michel Lespinasse wrote:
> > > Is there an interest in fixing this or concurrent mmaps() from the same
> > > process are too much of a corner case to worry about it?
> > 
> > Funny this comes up again. I actually have a patch series that is
> > supposed to do that:
> > [PATCH 0/9] Avoid populating unbounded num of ptes with mmap_sem held
> > 
> > However, the patches are still pending, didn't get much review
> > (probably not enough for Andrew to take them at this point), and I
> > think everyone forgot about them during the winter break.
> > 
> > Care to have a look at that thread and see if it works for you ?
> > 
> > (caveat: you will possibly also need "[PATCH 10/9] mm: make
> > do_mmap_pgoff return populate as a size in bytes, not as a bool" to
> > make the series actually work for you)
> 
> I applied the patches on top of 3.7.1. Here're the results for 4 threads
> concurrently mmap()-ing 10 64MB buffers in a loop without munmap()-s.
> The data is from a Nehalem i7-920 single-socket 4-core CPU. I've also
> added the older data I have for the 3.6.11 (patched and not) for
> reference.
> 
> 3.6.11 vanilla, do not populate: 0.001 seconds
> 3.6.11 vanilla, populate via a loop: 0.216 seconds
> 3.6.11 vanilla, populate via MAP_POPULATE: 0.358 seconds 
> 
> 3.6.11 + crude patch, do not populate: 0.002 seconds
> 3.6.11 + crude patch, populate via loop: 0.215 seconds
> 3.6.11 + crude patch, populate via MAP_POPULATE: 0.217 seconds
> 
> 3.7.1 vanilla, do not populate: 0.001 seconds
> 3.7.1 vanilla, populate via a loop: 0.216 seconds
> 3.7.1 vanilla, populate via MAP_POPULATE: 0.411 seconds
> 
> 3.7.1 + patch series, do not populate: 0.001 seconds
> 3.7.1 + patch series, populate via loop: 0.216 seconds
> 3.7.1 + patch series, populate via MAP_POPULATE: 0.273 seconds
> 
> So, the patch series mentioned above do improve performance but as far
> as I can read the benchmarking data there's still some performance left
> on the table.

Interesting. I expect you are using anon memory, so it's likely that
mm_populate() holds the mmap_sem read side for the entire duration of
the 64MB populate.

Just curious, does the following help ?

diff --git a/mm/memory.c b/mm/memory.c
index e4ab66b94bb8..f65a4b3b2141 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1627,6 +1627,12 @@ static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long add
 	       stack_guard_page_end(vma, addr+PAGE_SIZE);
 }
 
+/* not upstreamable as is, just for the sake of testing */
+static inline int rwsem_is_contended(struct rw_semaphore *sem)
+{
+	return (sem->count < 0);
+}
+
 /**
  * __get_user_pages() - pin user pages in memory
  * @tsk:	task_struct of target task
@@ -1854,6 +1860,11 @@ next_page:
 			i++;
 			start += PAGE_SIZE;
 			nr_pages--;
+			if (nonblocking && rwsem_is_contended(&mm->mmap_sem)) {
+				up_read(&mm->mmap_sem);
+				*nonblocking = 0;
+				return i;
+			}
 		} while (nr_pages && start < vma->vm_end);
 	} while (nr_pages);
 	return i;

Linus didn't like rwsem_is_contended() when I implemented the mlock
side of this a couple years ago, but maybe we can change his mind now.

If this doesn't help, could you please send me your test case ? I
think you described enough of it that I would be able to reproduce it
given some time, but it's just easier if you send me a short C file :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/