lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130820111602.3cea203c@gandalf.local.home>
Date:	Tue, 20 Aug 2013 11:16:02 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Kent Overstreet <kmo@...erainc.com>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-bcache@...r.kernel.org, dm-devel@...hat.com
Cc:	Christoph Hellwig <hch@...radead.org>,
	David Howells <dhowells@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: [PATCH] bcache: Remove use of down/up_read_non_owner()


The down/up_read_non_owner() is a nasty hack in the API of the rwsem
operations. It was once removed, but then resurrected for use with
bcache. Not only is the API an abomination to the rwsem API, it also
prevents bcache from ever being compiled with PREEMPT_RT, as the RT
kernel requires all rwsems to have an owner.

Instead of keeping the down/up_read_non_owner() around, it is better to
modify the one user of it and have it do something a bit differently.

>From reading the bcache code, the reason for the non_owner usage is
that a request is made, and writers must wait for that request to
finish before they can continue. But because the request is completed
by another task, the rwsem can not be held for read and then released
on completion.

Instead of abusing the rwsem code for this, I added a refcount
"nr_requests" to the cached_dev structure as well as a "write_waiters"
wait queue. When a request is to be made, the rwsem is still taken for
read, but this time with an owner. The refcount is incremented and
before exiting the function, the rwsem is released.

The writer will then take the rwsem for write, check the refcount, if
it is not zero, it will release the rwsem, add itself to a wait_event()
waiting for refcount to become zero, and then try again.

This should be a suitable replacement for the abuse of the rwsem
non_owner API.

Signed-off-by: Steven Rostedt <rostedt@...dmis.org>


Index: linux-trace.git/drivers/md/bcache/bcache.h
===================================================================
--- linux-trace.git.orig/drivers/md/bcache/bcache.h
+++ linux-trace.git/drivers/md/bcache/bcache.h
@@ -486,6 +486,15 @@ struct cached_dev {
 	atomic_t		running;
 
 	/*
+	 * Requests in progress.
+	 */
+	atomic_t		nr_requests;
+	/*
+	 * Writers waiting for requests to finish.
+	 */
+	wait_queue_head_t	write_waiters;
+
+	/*
 	 * Writes take a shared lock from start to finish; scanning for dirty
 	 * data to refill the rb tree requires an exclusive lock.
 	 */
Index: linux-trace.git/drivers/md/bcache/request.c
===================================================================
--- linux-trace.git.orig/drivers/md/bcache/request.c
+++ linux-trace.git/drivers/md/bcache/request.c
@@ -998,7 +998,9 @@ static void cached_dev_write_complete(st
 	struct search *s = container_of(cl, struct search, cl);
 	struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
 
-	up_read_non_owner(&dc->writeback_lock);
+	if (atomic_dec_return(&dc->nr_requests) == 0)
+		wake_up(&dc->write_waiters);
+	
 	cached_dev_bio_complete(cl);
 }
 
@@ -1024,7 +1026,8 @@ static void request_write(struct cached_
 	bch_keybuf_check_overlapping(&s->op.c->moving_gc_keys, &start, &end);
 
 	check_should_skip(dc, s);
-	down_read_non_owner(&dc->writeback_lock);
+	down_read(&dc->writeback_lock);
+	atomic_inc(&dc->nr_requests);
 
 	if (bch_keybuf_check_overlapping(&dc->writeback_keys, &start, &end)) {
 		s->op.skip	= false;
@@ -1053,6 +1056,7 @@ static void request_write(struct cached_
 	}
 out:
 	closure_call(&s->op.cl, bch_insert_data, NULL, cl);
+	up_read(&dc->writeback_lock);
 	continue_at(cl, cached_dev_write_complete, NULL);
 skip:
 	s->op.skip = true;
Index: linux-trace.git/drivers/md/bcache/super.c
===================================================================
--- linux-trace.git.orig/drivers/md/bcache/super.c
+++ linux-trace.git/drivers/md/bcache/super.c
@@ -1081,6 +1081,8 @@ static void register_bdev(struct cache_s
 	dc->sb_bio.bi_io_vec[0].bv_page = sb_page;
 	get_page(sb_page);
 
+	init_waitqueue_head(&dc->write_waiters);
+
 	if (cached_dev_init(dc, sb->block_size << 9))
 		goto err;
 
Index: linux-trace.git/drivers/md/bcache/writeback.c
===================================================================
--- linux-trace.git.orig/drivers/md/bcache/writeback.c
+++ linux-trace.git/drivers/md/bcache/writeback.c
@@ -121,6 +121,20 @@ static void dirty_init(struct keybuf_key
 	bch_bio_map(bio, NULL);
 }
 
+/*
+ * If requests are in transit, we must wait for them to finish.
+ */
+static void down_writeback_lock(struct cached_dev *dc)
+{
+again:
+	down_write(&dc->writeback_lock);
+	if (atomic_read(&dc->nr_requests)) {
+		up_write(&dc->writeback_lock);
+		wait_event(dc->write_waiters, !atomic_read(&dc->nr_requests));
+		goto again;
+	}
+}
+
 static void refill_dirty(struct closure *cl)
 {
 	struct cached_dev *dc = container_of(cl, struct cached_dev,
@@ -134,7 +148,7 @@ static void refill_dirty(struct closure
 	    !dc->writeback_running)
 		closure_return(cl);
 
-	down_write(&dc->writeback_lock);
+	down_writeback_lock(dc);
 
 	if (!atomic_read(&dc->has_dirty)) {
 		SET_BDEV_STATE(&dc->sb, BDEV_STATE_CLEAN);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ