linux-kernel - Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130715072432.GA28053@voom.fritz.box>
Date:	Mon, 15 Jul 2013 17:24:32 +1000
From:	David Gibson <david@...son.dropbear.id.au>
To:	Davidlohr Bueso <davidlohr.bueso@...com>
Cc:	Hugh Dickins <hughd@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>,
	Michel Lespinasse <walken@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Konstantin Khlebnikov <khlebnikov@...nvz.org>,
	Michal Hocko <mhocko@...e.cz>,
	"AneeshKumarK.V" <aneesh.kumar@...ux.vnet.ibm.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Hillf Danton <dhillf@...il.com>, linux-mm@...ck.org,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm/hugetlb: per-vma instantiation mutexes

On Sun, Jul 14, 2013 at 08:16:44PM -0700, Davidlohr Bueso wrote:
> On Fri, 2013-07-12 at 17:54 -0700, Hugh Dickins wrote:
> > Adding the essential David Gibson to the Cc list.
> > 
> > On Fri, 12 Jul 2013, Davidlohr Bueso wrote:
> > 
> > > The hugetlb_instantiation_mutex serializes hugepage allocation and instantiation
> > > in the page directory entry. It was found that this mutex can become quite contended
> > > during the early phases of large databases which make use of huge pages - for instance
> > > startup and initial runs. One clear example is a 1.5Gb Oracle database, where lockstat
> > > reports that this mutex can be one of the top 5 most contended locks in the kernel during
> > > the first few minutes:
> > > 
> > > hugetlb_instantiation_mutex:      10678     10678
> > >              ---------------------------
> > >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > >              ---------------------------
> > >              hugetlb_instantiation_mutex    10678  [<ffffffff8115e14e>] hugetlb_fault+0x9e/0x340
> > > 
> > > contentions:          10678
> > > acquisitions:         99476
> > > waittime-total: 76888911.01 us
> > > 
> > > Instead of serializing each hugetlb fault, we can deal with concurrent faults for pages
> > > in different vmas. The per-vma mutex is initialized when creating a new vma. So, back to
> > > the example above, we now get much less contention:
> > > 
> > >  &vma->hugetlb_instantiation_mutex:  1         1
> > >        ---------------------------------
> > >        &vma->hugetlb_instantiation_mutex       1   [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> > >        ---------------------------------
> > >        &vma->hugetlb_instantiation_mutex       1    [<ffffffff8115e216>] hugetlb_fault+0xa6/0x350
> > > 
> > > contentions:          1
> > > acquisitions:    108092
> > > waittime-total:  621.24 us
> > > 
> > > Signed-off-by: Davidlohr Bueso <davidlohr.bueso@...com>
> > 
> > I agree this is a problem worth solving,
> > but I doubt this patch is the right solution.

It's not.

> > > ---
> > >  include/linux/mm_types.h |  3 +++
> > >  mm/hugetlb.c             | 12 +++++-------
> > >  mm/mmap.c                |  3 +++
> > >  3 files changed, 11 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index fb425aa..b45fd87 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -289,6 +289,9 @@ struct vm_area_struct {
> > >  #ifdef CONFIG_NUMA
> > >  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
> > >  #endif
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > > +	struct mutex hugetlb_instantiation_mutex;
> > > +#endif
> > >  };
> > 
> > Bloating every vm_area_struct with a rarely useful mutex:
> > I'm sure you can construct cases where per-vma mutex would win over
> > per-mm mutex, but they will have to be very common to justify the bloat.
> > 
> 
> I cannot disagree here, this was my main concern about this patch, and,
> as you mentioned, if we can just get rid of the need for the lock, much
> better.

So, by all means try to get rid of the mutex, but doing so is
surprisingly difficult - I made several failed attempts a long time
ago, before giving up and creating the mutex in the first place.

Note that the there is no analogous handling in the normal page case,
because for normal pages we always assume we have a few pages of
"slush" and can temporarily overallocate without problems.  Because
hugepages are a more precious resource, it's usual to want to allocate
and use every single one - spurious OOMs when racing to allocate the
very last hugepage are what the mutex protects against.

> > >  struct core_thread {
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 83aff0a..12e665b 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -137,12 +137,12 @@ static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
> > >   * The region data structures are protected by a combination of the mmap_sem
> > >   * and the hugetlb_instantion_mutex.  To access or modify a region the caller
> > >   * must either hold the mmap_sem for write, or the mmap_sem for read and
> > > - * the hugetlb_instantiation mutex:
> > > + * the vma's hugetlb_instantiation mutex:
> > 
> > Reading the existing comment, this change looks very suspicious to me.
> > A per-vma mutex is just not going to provide the necessary exclusion, is
> > it?  (But I recall next to nothing about these regions and
> > reservations.)

A per-VMA lock is definitely wrong.  I think it handles one form of
the race, between threads sharing a VM on a MAP_PRIVATE mapping.
However another form of the race can and does occur between different
MAP_SHARED VMAs in the same or different processes.  I think there may
be edge cases involving mremap() and MAP_PRIVATE that will also be
missed by a per-VMA lock.

Note that the libhugetlbfs testsuite contains tests for both PRIVATE
and SHARED variants of the race.

> > >   *
> > >   *	down_write(&mm->mmap_sem);
> > >   * or
> > >   *	down_read(&mm->mmap_sem);
> > > - *	mutex_lock(&hugetlb_instantiation_mutex);
> > > + *	mutex_lock(&vma->hugetlb_instantiation_mutex);
> > >   */
> > >  struct file_region {
> > >  	struct list_head link;
> > > @@ -2547,7 +2547,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  
> > >  /*
> > >   * Hugetlb_cow() should be called with page lock of the original hugepage held.
> > > - * Called with hugetlb_instantiation_mutex held and pte_page locked so we
> > > + * Called with the vma's hugetlb_instantiation_mutex held and pte_page locked so we
> > >   * cannot race with other handlers or page migration.
> > >   * Keep the pte_same checks anyway to make transition from the mutex easier.
> > >   */
> > > @@ -2847,7 +2847,6 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  	int ret;
> > >  	struct page *page = NULL;
> > >  	struct page *pagecache_page = NULL;
> > > -	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
> > >  	struct hstate *h = hstate_vma(vma);
> > >  
> > >  	address &= huge_page_mask(h);
> > > @@ -2872,7 +2871,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > >  	 * get spurious allocation failures if two CPUs race to instantiate
> > >  	 * the same page in the page cache.
> > >  	 */
> > > -	mutex_lock(&hugetlb_instantiation_mutex);
> > > +	mutex_lock(&vma->hugetlb_instantiation_mutex);
> > >  	entry = huge_ptep_get(ptep);
> > >  	if (huge_pte_none(entry)) {
> > >  		ret = hugetlb_no_page(mm, vma, address, ptep, flags);
> > > @@ -2943,8 +2942,7 @@ out_page_table_lock:
> > >  	put_page(page);
> > >  
> > >  out_mutex:
> > > -	mutex_unlock(&hugetlb_instantiation_mutex);
> > > -
> > > +	mutex_unlock(&vma->hugetlb_instantiation_mutex);
> > >  	return ret;
> > >  }
> > >  
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index fbad7b0..8f0b034 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1543,6 +1543,9 @@ munmap_back:
> > >  	vma->vm_page_prot = vm_get_page_prot(vm_flags);
> > >  	vma->vm_pgoff = pgoff;
> > >  	INIT_LIST_HEAD(&vma->anon_vma_chain);
> > > +#ifdef CONFIG_HUGETLB_PAGE
> > > +	mutex_init(&vma->hugetlb_instantiation_mutex);
> > > +#endif
> > >  
> > >  	error = -EINVAL;	/* when rejecting VM_GROWSDOWN|VM_GROWSUP */
> > >  
> > 
> > The hugetlb_instantiation_mutex has always been rather an embarrassment:
> > it would be much more satisfying to remove the need for it, than to split
> > it in this way.  (Maybe a technique like THP sometimes uses, marking an
> > entry as in transition while the new entry is prepared.)
> 
> I didn't realize this was a known issue. Doing some googling I can see
> some additional alternatives to getting rid of the lock:
> 
> - [PATCH] remove hugetlb_instantiation_mutex:
> https://lkml.org/lkml/2007/7/27/46
> 
> - Commit 3935baa (hugepage: serialize hugepage allocation and
> instantiation): David mentioned a way to possibly avoid the need for
> this lock.
> 
> > 
> > But I suppose it would not have survived so long if that were easy,
> > and I think it may have grown some subtle dependants over the years -
> > as the region comment indicates.
> 
> Indeed. I'm not very acquainted with hugetlb code, so, assuming this
> patch's approach isn't valid, and we can figure out some way of getting
> rid of the mutex, I'd like to get some mm folks feedback on this.

I have previously proposed a correct method of improving scalability,
although it doesn't eliminate the lock.  That's to use a set of hashed
mutexes.  It wasn't merged before, but I don't recall the reasons
why.  Old and probably bitrotted patch below for reference:

hugepage: Allow parallelization of the hugepage fault path

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash of the address_space and
file offset being faulted (or mm and virtual address for MAP_PRIVATE
mappings).

Signed-off-by: David Gibson <david@...son.dropbear.id.au>

Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c	2006-10-25 16:30:00.000000000 +1000
+++ working-2.6/mm/hugetlb.c	2006-10-26 14:38:08.000000000 +1000
@@ -32,6 +32,13 @@ static unsigned int free_huge_pages_node
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table;
+
 static void clear_huge_page(struct page *page, unsigned long addr)
 {
 	int i;
@@ -160,6 +167,13 @@ static int __init hugetlb_init(void)
 	}
 	max_huge_pages = free_huge_pages = nr_huge_pages = i;
 	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+
+	num_fault_mutexes = 2 * num_possible_cpus() - 1;
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -458,19 +472,14 @@ static int hugetlb_cow(struct mm_struct 
 }
 
 int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, int write_access)
+		    struct address_space *mapping, unsigned long idx,
+		    unsigned long address, pte_t *ptep, int write_access)
 {
 	int ret = VM_FAULT_SIGBUS;
-	unsigned long idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 
-	mapping = vma->vm_file->f_mapping;
-	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
-		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -545,28 +554,50 @@ out:
 	return ret;
 }
 
+static int fault_mutex_hash(struct mm_struct *mm, struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    unsigned long pagenum, unsigned long address)
+{
+	void *p = mapping;
+
+	if (! (vma->vm_flags & VM_SHARED)) {
+		p = mm;
+		pagenum = address << HPAGE_SIZE;
+	}
+
+	return ((unsigned long)p + pagenum) % num_fault_mutexes;
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access)
 {
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct address_space *mapping;
+	unsigned long idx;
+	int hash;
 
 	ptep = huge_pte_alloc(mm, address);
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
 	entry = *ptep;
 	if (pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
-		mutex_unlock(&hugetlb_instantiation_mutex);
+		ret = hugetlb_no_page(mm, vma, mapping, idx,
+				      address, ptep, write_access);
+		mutex_unlock(&htlb_fault_mutex_table[hash]);
 		return ret;
 	}
 
@@ -578,7 +609,7 @@ int hugetlb_fault(struct mm_struct *mm, 
 		if (write_access && !pte_write(entry))
 			ret = hugetlb_cow(mm, vma, address, ptep, entry);
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&hugetlb_instantiation_mutex);
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 
 	return ret;
 }


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

Content of type "application/pgp-signature" skipped