lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1488863010-13028-2-git-send-email-dave@stgolabs.net>
Date:   Mon,  6 Mar 2017 21:03:26 -0800
From:   Davidlohr Bueso <dave@...olabs.net>
To:     mingo@...nel.org, peterz@...radead.org, akpm@...ux-foundation.org
Cc:     jack@...e.cz, kirill.shutemov@...ux.intel.com, mhocko@...e.com,
        mgorman@...hsingularity.net, dave@...olabs.net,
        linux-kernel@...r.kernel.org, Davidlohr Bueso <dbueso@...e.de>
Subject: [PATCH 1/5] locking: Introduce range reader/writer lock

This implements a sleepable range rwlock, based on interval tree, serializing
conflicting/intersecting/overlapping ranges within the tree. The largest range
is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
involves dealing with the tree itself and the range to be locked, normally
stack allocated and always explicitly prepared/initialized by the user in a
[a0, a1] a0 <= a1 sorted manner, before actually taking the lock.

We allow exclusive locking of arbitrary ranges. We guarantee that each
range is locked only after all conflicting range locks requested previously
have been unlocked. Thus we achieve fairness and avoid livelocks.

When new range lock is requested, we add its interval to the tree and store
number of intervals intersecting it to 'blocking_ranges'. When a range is
unlocked, we again walk intervals that intersect with the unlocked one and
decrement their 'blocking_ranges'. We wake up owner of any range lock whose
'blocking_ranges' drops to 0.

For the shared case, the 'blocking_ranges' is only incremented if the
intersecting range is not marked as a reader. In order to mitigate some of
the tree walk overhead for non-intersecting ranges, the tree's min/max values
are maintained and consulted in O(1) in the fastpath.

How much does it cost:
----------------------

The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
total number of ranges and R_int is the number of ranges intersecting the
new range range to be added.

Due to its sharable nature, full range locks can be compared with rw-sempahores,
which also serves from a mutex standpoint as writer-only situations are
pretty similar nowadays.

The first is the memory footprint, rwsems are larger than tree locks: 40 vs
24 bytes, but the later requires an additional 64 bytes of stack for the range
structure.

Secondly, because every range call is serialized by the tree->lock, any lock()
fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
overhead compared to a single atomic insn in the case of rwsems. Similar scenario
obviously for the unlock() case.

The torture module was used to measure 1-1 differences in lock acquisition with
increasing core counts over a period of 10 minutes. Readers and writers are
interleaved, with a slight advantage to writers as its the first kthread that is
created. The following shows the avg ops/minute with various thread-setups on
boxes with small and large core-counts.

** 4-core AMD Opteron **
(write-only)
rwsem-2thr: 4198.5, stddev: 7.77
range-2thr: 4199.1, stddev: 0.73

rwsem-4thr: 6036.8, stddev: 50.91
range-4thr: 6004.9, stddev: 126.57

rwsem-8thr: 6245.6, stddev: 59.39
range-8thr: 6229.3, stddev: 10.60

(read-only)
rwsem-2thr: 5930.7, stddev: 21.92
range-2thr: 5917.3, stddev: 25.45

rwsem-4thr: 9881.6, stddev: 0.70
range-4thr: 9540.2, stddev: 98.28

rwsem-8thr: 11633.2, stddev: 7.72
range-8thr: 11314.7, stddev: 62.22

For the read/write-only cases, there is very little difference between the range lock
and rwsems, with up to a 3% hit, which could very well be considered in the noise range.

(read-write)
rwsem-write-1thr: 1744.8, stddev: 11.59
rwsem-read-1thr:  1043.1, stddev: 3.97
range-write-1thr: 1740.2, stddev: 5.99
range-read-1thr:  1022.5, stddev: 6.41

rwsem-write-2thr: 1662.5, stddev: 0.70
rwsem-read-2thr:  1278.0, stddev: 25.45
range-write-2thr: 1321.5, stddev: 51.61
range-read-2thr:  1243.5, stddev: 30.40

rwsem-write-4thr: 1761.0, stddev: 11.31
rwsem-read-4thr:  1426.0, stddev: 7.07
range-write-4thr: 1417.0, stddev: 29.69
range-read-4thr:  1398.0, stddev: 56.56

While a single reader and writer threads does not show must difference, increasing
core counts shows that in reader/writer workloads, writer threads can take a hit in
raw performance of up to ~20%, while the number of reader throughput is quite similar
among both locks.

** 240-core (ht) IvyBridge **
(write-only)
rwsem-120thr: 6844.5, stddev: 82.73
range-120thr: 6070.5, stddev: 85.55

rwsem-240thr: 6292.5, stddev: 146.3
range-240thr: 6099.0, stddev: 15.55

rwsem-480thr: 6164.8, stddev: 33.94
range-480thr: 6062.3, stddev: 19.79

(read-only)
rwsem-120thr: 136860.4, stddev: 2539.92
range-120thr: 138052.2, stddev: 327.39

rwsem-240thr: 235297.5, stddev: 2220.50
range-240thr: 232099.1, stddev: 3614.72

rwsem-480thr: 272683.0, stddev: 3924.32
range-480thr: 256539.2, stddev: 9541.69

Similar to the small box, larger machines show that range locks take only a minor
(up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.

(read-write)
rwsem-write-60thr: 4658.1, stddev: 1303.19
rwsem-read-60thr:  1108.7, stddev: 718.42
range-write-60thr: 3203.6, stddev: 139.30
range-read-60thr:  1852.8, stddev: 147.5

rwsem-write-120thr: 3971.3, stddev: 1413.0
rwsem-read-120thr:  1038.8, stddev: 353.51
range-write-120thr: 2282.1, stddev: 207.18
range-read-120thr:  1856.5, stddev: 198.69

rwsem-write-240thr: 4112.7, stddev: 2448.1
rwsem-read-240thr:  1277.4, stddev: 430.30
range-write-240thr: 2353.1, stddev: 502.04
range-read-240thr:  1551.5, stddev: 361.33

When mixing readers and writers, writer throughput can take a hit of up to ~40%,
similar to the 4 core machine, however, reader threads can increase the number of
acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
always be higher for rwsems. A huge factor in this behavior is that range locks
do not have writer spin-on-owner feature.

On both machines when actually testing threads acquiring different ranges, the
amount of throughput will always outperform the rwsem, due to the increased
parallelism; which is no surprise either. As such microbenchmarks that merely
pounds on a lock will pretty much always suffer upon direct lock conversions,
but not enough to matter in the overall picture.

Signed-off-by: Davidlohr Bueso <dbueso@...e.de>
---
 drivers/gpu/drm/Kconfig       |   2 -
 drivers/gpu/drm/i915/Kconfig  |   1 -
 include/linux/range_rwlock.h  |  96 +++++++++
 kernel/locking/Makefile       |   2 +-
 kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig                   |  14 --
 lib/Kconfig.debug             |   1 -
 lib/Makefile                  |   2 +-
 8 files changed, 560 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/range_rwlock.h
 create mode 100644 kernel/locking/range_rwlock.c

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 88e01e08e279..e4d9eadd2c47 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -154,7 +154,6 @@ config DRM_RADEON
 	select HWMON
 	select BACKLIGHT_CLASS_DEVICE
 	select BACKLIGHT_LCD_SUPPORT
-	select INTERVAL_TREE
 	help
 	  Choose this option if you have an ATI Radeon graphics card.  There
 	  are both PCI and AGP versions.  You don't need to choose this to
@@ -174,7 +173,6 @@ config DRM_AMDGPU
 	select HWMON
 	select BACKLIGHT_CLASS_DEVICE
 	select BACKLIGHT_LCD_SUPPORT
-	select INTERVAL_TREE
 	help
 	  Choose this option if you have a recent AMD Radeon graphics card.
 
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 183f5dc1c3f2..8a9154550f46 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -3,7 +3,6 @@ config DRM_I915
 	depends on DRM
 	depends on X86 && PCI
 	select INTEL_GTT
-	select INTERVAL_TREE
 	# we need shmfs for the swappable backing store, and in particular
 	# the shmem_readpage() which depends upon tmpfs
 	select SHMEM
diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
new file mode 100644
index 000000000000..174e13668f67
--- /dev/null
+++ b/include/linux/range_rwlock.h
@@ -0,0 +1,96 @@
+/*
+ * Range locking
+ *
+ * We allow exclusive locking of arbitrary ranges. We guarantee that each
+ * range is locked only after all conflicting range locks requested previously
+ * have been unlocked. Thus we achieve fairness and avoid livelocks.
+ *
+ * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
+ * total number of ranges and R_int is the number of ranges intersecting the
+ * operated range.
+ */
+#ifndef _LINUX_RANGE_RWLOCK_H
+#define _LINUX_RANGE_RWLOCK_H
+
+#include <linux/rbtree.h>
+#include <linux/interval_tree.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+/*
+ * The largest range will span [0,RANGE_RWLOCK_INFINITY].
+ */
+#define RANGE_RWLOCK_INFINITY  (~0UL - 1)
+
+struct range_rwlock {
+	struct interval_tree_node node;
+	struct task_struct *task;
+	/* Number of ranges which are blocking acquisition of the lock */
+	unsigned int blocking_ranges;
+	bool reader;
+};
+
+struct range_rwlock_tree {
+	struct rb_root root;
+	spinlock_t lock;
+	struct interval_tree_node *leftmost; /* compute smallest 'start' */
+};
+
+#define __RANGE_RWLOCK_TREE_INITIALIZER(name)		\
+	{ .leftmost = NULL                              \
+	, .root = RB_ROOT				\
+	, .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
+
+#define DEFINE_RANGE_RWLOCK_TREE(name) \
+       struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
+
+#define __RANGE_RWLOCK_INITIALIZER(start, last) {	\
+		.node = {			\
+			.start = (start)	\
+			,.last = (last)		\
+		}				\
+	}
+
+#define DEFINE_RANGE_RWLOCK(name, start, last)				\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
+
+#define DEFINE_RANGE_RWLOCK_INF(name)		\
+	struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
+
+static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
+{
+	tree->root = RB_ROOT;
+	spin_lock_init(&tree->lock);
+	tree->leftmost = NULL;
+}
+
+void range_rwlock_init(struct range_rwlock *lock,
+		       unsigned long start, unsigned long last);
+void range_rwlock_init_inf(struct range_rwlock *lock);
+
+/*
+ * lock for reading
+ */
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock);
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock);
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+/*
+ * lock for writing
+ */
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock);
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock);
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
+
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock);
+
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 760158d9d98d..bf054d0af82b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -2,7 +2,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT		:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
new file mode 100644
index 000000000000..4d92baa9bd80
--- /dev/null
+++ b/kernel/locking/range_rwlock.c
@@ -0,0 +1,462 @@
+/*
+ * Implementation of read/write range locks.
+ *
+ * We keep interval tree of locked and to-be-locked ranges. When new range lock
+ * is requested, we add its interval to the tree and store number of intervals
+ * intersecting it to 'blocking_ranges'.
+ *
+ * When a range is unlocked, we again walk intervals that intersect with the
+ * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
+ * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
+ * 'blocking_ranges' is only incremented if the intersecting range is not marked
+ * as a reader. In order to mitigate some of the tree walk overhead for
+ * non-intersecting ranges, the tree's min/max values are maintained and consulted
+ * in O(1) in the fastpath.
+ */
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/range_rwlock.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/debug.h>
+#include <linux/sched/wake_q.h>
+#include <linux/sched.h>
+#include <linux/export.h>
+
+
+#define range_entry(ptr, type, member) container_of(ptr, type, member)
+
+/*
+ * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
+ * is given by:
+ *
+ *        a0 <= b1 && b0 <= a1
+ *
+ * ... where A holds the lock range and B holds the smallest 'start' and
+ * largest 'last' in the tree. For the later, we rely on the root node,
+ * which by augmented interval tree property, holds the largest value in
+ * its last-in-subtree.
+ */
+static inline bool
+__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *root;
+
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
+		return false;
+
+	root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
+
+	return lock->node.start <= root->__subtree_last &&
+		tree->leftmost->start <= lock->node.last;
+}
+
+static inline void
+__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (unlikely(RB_EMPTY_ROOT(&tree->root)))
+		tree->leftmost = &lock->node;
+	else if (lock->node.start < tree->leftmost->start)
+		tree->leftmost = &lock->node;
+
+	interval_tree_insert(&lock->node, &tree->root);
+}
+
+static inline void
+__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	if (tree->leftmost == &lock->node) {
+		struct rb_node *next = rb_next(&tree->leftmost->rb);
+		tree->leftmost = range_entry(next, struct interval_tree_node, rb);
+	}
+
+	interval_tree_remove(&lock->node, &tree->root);
+}
+
+static inline void
+__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
+		  unsigned long last)
+{
+	WARN_ON(start > last);
+
+	lock->node.start = start;
+	lock->node.last = last;
+	RB_CLEAR_NODE(&lock->node.rb);
+	lock->blocking_ranges = 0;
+	lock->reader = false;
+}
+
+/**
+ * range_rwlock_init - initialize the range lock
+ * @lock: the range lock to be initialized
+ * @start: start of the interval (inclusive)
+ * @last: last location in the interval (inclusive)
+ *
+ * Initialize the range's [start, last] such that it can
+ * later be locked. User is expected to enter a sorted
+ * range, such that @start <= @last.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
+		     unsigned long last)
+{
+	__range_rwlock_init(lock, start, last);
+}
+EXPORT_SYMBOL(range_rwlock_init);
+
+/**
+ * range_rwlock_init_inf - initialize the range lock to infinity
+ * @lock: the range lock to be initialized
+ *
+ * Initialize the full range.
+ *
+ * It is not allowed to initialize an already locked range.
+ */
+void range_rwlock_init_inf(struct range_rwlock *lock)
+{
+	__range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
+}
+EXPORT_SYMBOL(range_rwlock_init_inf);
+
+static inline void
+range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
+{
+	if (!--lock->blocking_ranges)
+		wake_q_add(wake_q, lock->task);
+}
+
+static inline int wait_for_ranges(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock, long state)
+{
+	int ret = 0;
+
+	while (true) {
+		set_current_state(state);
+
+		if (unlikely(signal_pending_state(state, current))) {
+			unsigned long flags;
+
+			ret = -EINTR;
+			/*
+			 * We're not taking the lock after all, cleanup
+			 * after ourselves.
+			 */
+			spin_lock_irqsave(&tree->lock, flags);
+			lock->reader = false;
+			__range_tree_remove(tree, lock);
+			spin_unlock_irqrestore(&tree->lock, flags);
+			break;
+		}
+
+		/* do we need to go to sleep? */
+		if (!lock->blocking_ranges)
+			break;
+
+		schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+	return ret;
+}
+
+static __always_inline int
+__range_read_lock_common(struct range_rwlock_tree *tree,
+			 struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	lock->reader = true;
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					 lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!blocked_lock->reader)
+			lock->blocking_ranges++;
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->task = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+
+}
+EXPORT_SYMBOL(range_read_lock);
+
+int range_read_lock_interruptible(struct range_rwlock_tree *tree,
+				  struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+
+}
+EXPORT_SYMBOL(range_read_lock_interruptible);
+
+int range_read_lock_killable(struct range_rwlock_tree *tree,
+			     struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_read_lock_common(tree, lock, TASK_KILLABLE);
+
+}
+EXPORT_SYMBOL(range_read_lock_killable);
+
+/**
+ * range_read_trylock - Trylock for reading
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int ret = true;
+	unsigned long flags;
+	struct interval_tree_node *node;
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	/*
+	 * We have overlapping ranges in the tree, ensure that we can
+	 * in fact share the lock.
+	 */
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!blocked_lock->reader) {
+			ret = false;
+			goto unlock;
+		}
+
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+insert:
+	lock->reader = true;
+	__range_tree_insert(tree, lock);
+unlock:
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL(range_read_trylock);
+
+void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	lock->reader = false;
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (!blocked_lock->reader)
+			range_rwlock_unblock(blocked_lock, &wake_q);
+
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL(range_read_unlock);
+
+static inline int __sched
+__range_write_lock_common(struct range_rwlock_tree *tree,
+			  struct range_rwlock *lock, long state)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	lock->reader = false;
+
+	if (!__range_intersects_intree(tree, lock))
+		goto insert;
+
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		/*
+		 * As a writer, we always consider an existing node. We
+		 * need to block; either the intersecting node is another
+		 * writer or we have a reader that needs to finish.
+		 */
+		lock->blocking_ranges++;
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+insert:
+	__range_tree_insert(tree, lock);
+
+	lock->task = current;
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return wait_for_ranges(tree, lock, state);
+}
+
+void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	might_sleep();
+	__range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
+}
+EXPORT_SYMBOL(range_write_lock);
+
+int range_write_lock_interruptible(struct range_rwlock_tree *tree,
+				   struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL(range_write_lock_interruptible);
+
+int range_write_lock_killable(struct range_rwlock_tree *tree,
+			      struct range_rwlock *lock)
+{
+	might_sleep();
+	return __range_write_lock_common(tree, lock, TASK_KILLABLE);
+}
+EXPORT_SYMBOL(range_write_lock_killable);
+
+/**
+ * range_write_trylock - Trylock for writing
+ * @tree: interval tree
+ * @lock: the range lock to be trylocked
+ *
+ * The trylock is against the range itself, not the @tree->lock.
+ *
+ * Returns 1 if successful, 0 if contention (must block to acquire).
+ */
+int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	int intersects;
+	unsigned long flags;
+
+	spin_lock_irqsave(&tree->lock, flags);
+	intersects = __range_intersects_intree(tree, lock);
+
+	if (!intersects) {
+		lock->reader = false;
+		__range_tree_insert(tree, lock);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+
+	return !intersects;
+}
+EXPORT_SYMBOL(range_write_trylock);
+
+void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
+{
+	struct interval_tree_node *node;
+	unsigned long flags;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	lock->reader = false;
+	__range_tree_remove(tree, lock);
+
+	if (!__range_intersects_intree(tree, lock)) {
+		/* nobody to wakeup, we're done */
+		spin_unlock_irqrestore(&tree->lock, flags);
+		return;
+	}
+
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+				     lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		range_rwlock_unblock(blocked_lock, &wake_q);
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL(range_write_unlock);
+
+/**
+ * range_downgrade_write - Downgrade write range lock to read lock
+ * @tree: interval tree
+ * @lock: the range lock to be downgraded
+ *
+ * Wakes any blocked readers, when @lock is the only conflicting range.
+ *
+ * It is not allowed to downgrade an unacquired write lock.
+ */
+void range_downgrade_write(struct range_rwlock_tree *tree,
+			   struct range_rwlock *lock)
+{
+	unsigned long flags;
+	struct interval_tree_node *node;
+	DEFINE_WAKE_Q(wake_q);
+
+	spin_lock_irqsave(&tree->lock, flags);
+
+	WARN_ON(lock->reader);
+
+	/*
+	 * Unaccount for any blocked reader lock. Wakeup if possible.
+	 */
+	node = interval_tree_iter_first(&tree->root, lock->node.start,
+					lock->node.last);
+	while (node) {
+		struct range_rwlock *blocked_lock;
+		blocked_lock = range_entry(node, struct range_rwlock, node);
+
+		if (blocked_lock->reader)
+			range_rwlock_unblock(blocked_lock, &wake_q);
+
+		node = interval_tree_iter_next(node, lock->node.start,
+					       lock->node.last);
+	}
+
+	lock->reader = true;
+	spin_unlock_irqrestore(&tree->lock, flags);
+	wake_up_q(&wake_q);
+}
+EXPORT_SYMBOL(range_downgrade_write);
diff --git a/lib/Kconfig b/lib/Kconfig
index 0c8b78a9ae2e..52ca6668a16d 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
 config BTREE
 	bool
 
-config INTERVAL_TREE
-	bool
-	help
-	  Simple, embeddable, interval-tree. Can find the start of an
-	  overlapping range in log(n) time and then iterate over all
-	  overlapping nodes. The algorithm is implemented as an
-	  augmented rbtree.
-
-	  See:
-
-		Documentation/rbtree.txt
-
-	  for more information.
-
 config RADIX_TREE_MULTIORDER
 	bool
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 97d62c2da6c2..a8a64eb54d3a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1771,7 +1771,6 @@ config RBTREE_TEST
 config INTERVAL_TREE_TEST
 	tristate "Interval tree test"
 	depends on m && DEBUG_KERNEL
-	select INTERVAL_TREE
 	help
 	  A benchmark measuring the performance of the interval tree library
 
diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a8725..d4ff3057450a 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
 obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
 
 obj-$(CONFIG_BTREE) += btree.o
-obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
+obj-y += interval_tree.o
 obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
 obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
 obj-$(CONFIG_DEBUG_LIST) += list_debug.o
-- 
2.6.6

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ