linux-kernel - [PATCH] futex: Reduce the scope of lock_page, aka lockless futex_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1452025435-15355-1-git-send-email-dave@stgolabs.net>
Date:	Tue,  5 Jan 2016 12:23:55 -0800
From:	Davidlohr Bueso <dave@...olabs.net>
To:	Mel Gorman <mgorman@...hsingularity.net>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>
Cc:	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Chris Mason <clm@...com>, Darren Hart <dvhart@...ux.intel.com>,
	dave@...olabs.net, linux-kernel@...r.kernel.org,
	Davidlohr Bueso <dbueso@...e.de>
Subject: [PATCH] futex: Reduce the scope of lock_page, aka lockless futex_get_key()

When dealing with key handling for shared futexes, we can drastically reduce
the usage/need of the page lock. 1) For anonymous pages, the associated futex
object is the mm_struct which does not require the page lock. 2) For inode
based, keys, we can check under RCU read lock if the page mapping is still
valid and take reference to the inode. This just leaves one rare race that
requires the page lock in the slow path when examining the swapcache.

Additionally realtime users currently have a problem with the page lock being
contended for unbounded periods of time during futex operations.

Task A
     get_futex_key()
     lock_page()
    ---> preempted

Now any other task trying to lock that page will have to wait until
task A gets scheduled back in, which is an unbound time.

With this patch, we pretty much have a lockless futex_get_key().

Signed-off-by: Davidlohr Bueso <dbueso@...e.de>
---

Hi,

This change by Mel was originally sent sometime ago, and I wish to revisit it.

https://lkml.org/lkml/2013/10/29/455

I ran into this patch while looking into a reported issue with the ihold < 2
warning reported a while back for futexes. It seems that there were no strong
objections to this optimization. Experiments show that this patch can boost
the hashing of shared futexes with the perf futex benchmarks (which is good
for measuring such changes related to in key handling) by up to 45% when
there are high (> 100) thread counts on a 60 core Westmere. Lower counts are
pretty much in the noise range or less than 10%, but mid range can be seen at
over 30% overall throughput (hash ops/sec). This makes anon-mem shared futexes
much closer to its private counterpart. Note that memory backed futexes are not
as tested as I would like.

Additionally I've done some small changes to Mel's original patch:
- minor changelog re-wording.
- replace the inode mapping corruption/reusage WARN_ON with BUG_ON.
- ensure that we continue to have a MB (B) for the case of inode based keys.

Based on top of today's linux-next, which has Kirill's THP refcounting rework
changes/simplifications to futexes incorporated. Nothing has crashed so far, but
again, needs more testing, specially with file mapped futexes.

Thanks,
Davidlohr

 kernel/futex.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 81 insertions(+), 8 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 0f01666..91925ba 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -520,7 +520,18 @@ again:
 	else
 		err = 0;
 
-	lock_page(page);
+	/*
+	 * The treatment of mapping from this point on is critical. The page
+	 * lock protects many things but in this context the page lock
+	 * stabilises mapping, prevents inode freeing in the shared
+	 * file-backed region case and guards against movement to swap cache.
+	 * Strictly speaking the page lock is not needed in all cases being
+	 * considered here and page lock forces unnecessarily serialisation.
+	 * From this point on, mapping will be reverified if necessary and
+	 * page lock will be acquired only if it is unavoiable.
+	 */
+	mapping = READ_ONCE(compound_head(page)->mapping);
+
 	/*
 	 * If page->mapping is NULL, then it cannot be a PageAnon
 	 * page; but it might be the ZERO_PAGE or in the gate area or
@@ -536,19 +547,34 @@ again:
 	 * shmem_writepage move it from filecache to swapcache beneath us:
 	 * an unlikely race, but we do need to retry for page->mapping.
 	 */
-	mapping = compound_head(page)->mapping;
-	if (!mapping) {
-		int shmem_swizzled = PageSwapCache(page);
+	if (unlikely(!mapping)) {
+		int shmem_swizzled;
+
+		/*
+		 * Page lock is required to identify which special case above
+		 * applies. If this is really a shmem page then the page lock
+		 * will prevent unexpected transitions.
+		 */
+		lock_page(page);
+		shmem_swizzled = PageSwapCache(page);
 		unlock_page(page);
 		put_page(page);
+		WARN_ON_ONCE(mapping);
+
 		if (shmem_swizzled)
 			goto again;
+
 		return -EFAULT;
 	}
 
 	/*
 	 * Private mappings are handled in a simple way.
 	 *
+	 * If the futex key is stored on an anonymous page then the associated
+	 * object is the mm which is implicitly pinned by the calling process.
+	 * Page lock is unnecessary to stabilise page->mapping in this case and
+	 * is not taken.
+	 *
 	 * NOTE: When userspace waits on a MAP_SHARED mapping, even if
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
@@ -566,16 +592,63 @@ again:
 		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
+
+		get_futex_key_refs(key); /* implies MB (B) */
+
 	} else {
+		struct inode *inode;
+
+		/*
+		 * The associtated futex object in this case is the inode and
+		 * the page->mapping must be traversed. Ordinarily this should
+		 * be stabilised under page lock but it's not strictly
+		 * necessary in this case as we just want to pin the inode, not
+		 * update radix tree or anything like that.
+		 *
+		 * The RCU read lock is taken as the inode is finally freed
+		 * under RCU. If the mapping still matches expectations then the
+		 * mapping->host can be safely accessed as being a valid inode.
+		 */
+		rcu_read_lock();
+		if (READ_ONCE(page->mapping) != mapping || !mapping->host) {
+			rcu_read_unlock();
+			put_page(page);
+			goto again;
+		}
+		inode = mapping->host;
+
+		/*
+		 * Take a reference unless it is about to be freed. Previously
+		 * this reference was taken by ihold under the page lock
+		 * pinning the inode in place so i_lock was unnecessary. The
+		 * only way for this check to fail is if the inode was
+		 * truncated in parallel so warn for now if this happens.
+		 *
+		 * TODO: VFS and/or filesystem people should review this check
+		 * and see if there is a safer or more reliable way to do this.
+		 */
+		if (WARN_ON(!atomic_inc_not_zero(&inode->i_count))) {
+			rcu_read_unlock();
+			put_page(page);
+			goto again;
+		}
+
+		/*
+		 * get_futex_key() must imply MB (B) and we are not going to
+		 * call into get_futex_key_refs() at this point.
+		 */
+		smp_mb__after_atomic();
+
+		/* Should be impossible but lets be paranoid for now */
+		BUG_ON(inode->i_mapping != mapping);
+
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = mapping->host;
+		key->shared.inode = inode;
 		key->shared.pgoff = basepage_index(page);
+		rcu_read_unlock();
 	}
 
-	get_futex_key_refs(key); /* implies MB (B) */
-
 out:
-	unlock_page(page);
 	put_page(page);
 	return err;
 }
--
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/