linux-kernel - Re: [RFC] block: fix blk_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACVXFVOtzcPpmg9Z_P6xN99dW=Sb_cCQNd-VcP0s5Nk0yHbwXQ@mail.gmail.com>
Date:	Fri, 24 Jun 2016 19:36:57 +0800
From:	Ming Lei <ming.lei@...onical.com>
To:	Lars Ellenberg <lars.ellenberg@...bit.com>
Cc:	linux-block@...r.kernel.org,
	Roland Kammerer <roland.kammerer@...bit.com>,
	Jens Axboe <axboe@...nel.dk>, NeilBrown <neilb@...e.com>,
	Kent Overstreet <kent.overstreet@...il.com>,
	Shaohua Li <shli@...nel.org>, Alasdair Kergon <agk@...hat.com>,
	Mike Snitzer <snitzer@...hat.com>,
	"open list:DEVICE-MAPPER (LVM)" <dm-devel@...hat.com>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Takashi Iwai <tiwai@...e.de>, Jiri Kosina <jkosina@...e.cz>,
	Zheng Liu <gnehzuil.liu@...il.com>,
	Keith Busch <keith.busch@...el.com>,
	"Martin K. Petersen" <martin.petersen@...cle.com>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	"open list:BCACHE (BLOCK LAYER CACHE)" <linux-bcache@...r.kernel.org>,
	"open list:SOFTWARE RAID (Multiple Disks) SUPPORT" 
	<linux-raid@...r.kernel.org>
Subject: Re: [RFC] block: fix blk_queue_split() resource exhaustion

On Wed, Jun 22, 2016 at 4:22 PM, Lars Ellenberg
<lars.ellenberg@...bit.com> wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
>
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
>
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
>
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
>
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
>
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
>
> This can lead to resource starvation deadlock.

Could you explain it a bit what the resource is and how it is
exhausted?

> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
>
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
>         struct recursion_to_iteration_bio_lists {
>                 struct bio_list recursion;
>                 struct bio_list remainder;
>         }
>
> To have all bios targeted to drivers lower in the stack processed before
> processing the next piece of a bio targeted at the higher levels,
> as long as queued bios resulting from recursion are available,
> they will continue to be processed in FIFO order.
> Pushed back bio-parts resulting from blk_queue_split() will be processed
> in LIFO order, one-by-one, whenever the recursion list becomes empty.
>
> Signed-off-by: Lars Ellenberg <lars.ellenberg@...bit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@...bit.com>
> ---
>
> This is not a theoretical problem.
> At least int DRBD, and an unfortunately high IO concurrency wrt. the
> "max-buffers" setting, without this patch we have a reproducible deadlock.

Is there any log about the deadlock? And is there any lockdep warning
if it is enabled?

Thanks,

>
> I'm unsure if it could cause problems in md raid in some corner cases
> or when a rebuild/scrub/... starts in a "bad" moment.
>
> It also may increase "stress" on any involved bio_set,
> because it may increase the number of bios that are allocated
> before the first of them is actually processed, causing more
> frequent calls of punt_bios_to_rescuer().
>
> Alternatively,
> a simple one-line change to generic_make_request() would also "fix" it,
> by processing all recursive bios in LIFO order.
> But that may change the order in which bios reach the "physical" layer,
> in case they are in fact split into many smaller pieces, for example in
> a striping setup, which may have other performance implications.
>
>  | diff --git a/block/blk-core.c b/block/blk-core.c
>  | index 2475b1c7..a5623f6 100644
>  | --- a/block/blk-core.c
>  | +++ b/block/blk-core.c
>  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>  |       * should be added at the tail
>  |       */
>  |      if (current->bio_list) {
>  | -            bio_list_add(current->bio_list, bio);
>  | +            bio_list_add_head(current->bio_list, bio);
>  |              goto out;
>  |      }
>
> Any fix would need to go to stable 4.3.x and up.
> This patch does apply as is to 4.5 and up,
> with a trivial fixup (or a cherry-pick of cda2264) to 4.4,
> and with some more (also trivial) fixup to 4.3 because of
> GFP_WAIT -> GFP_RECLAIM and comment rewrap.
>
> Any input?
> How should I proceed?
>
> ---
>  block/bio.c               | 14 +++++++-------
>  block/blk-core.c          | 32 +++++++++++++++++---------------
>  block/blk-merge.c         |  3 ++-
>  drivers/md/bcache/btree.c | 12 ++++++------
>  drivers/md/dm-bufio.c     |  2 +-
>  drivers/md/md.h           |  7 +++++++
>  drivers/md/raid1.c        |  5 ++---
>  drivers/md/raid10.c       |  5 ++---
>  include/linux/bio.h       | 11 +++++++++++
>  include/linux/sched.h     |  4 ++--
>  10 files changed, 57 insertions(+), 38 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 0e4aa42..2ffcea0 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -368,10 +368,10 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>         bio_list_init(&punt);
>         bio_list_init(&nopunt);
>
> -       while ((bio = bio_list_pop(current->bio_list)))
> +       while ((bio = bio_list_pop(&current->bio_lists->recursion)))
>                 bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
>
> -       *current->bio_list = nopunt;
> +       current->bio_lists->recursion = nopunt;
>
>         spin_lock(&bs->rescue_lock);
>         bio_list_merge(&bs->rescue_list, &punt);
> @@ -453,13 +453,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>                  *
>                  * We solve this, and guarantee forward progress, with a rescuer
>                  * workqueue per bio_set. If we go to allocate and there are
> -                * bios on current->bio_list, we first try the allocation
> -                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -                * bios we would be blocking to the rescuer workqueue before
> -                * we retry with the original gfp_flags.
> +                * bios on current->bio_lists->recursion, we first try the
> +                * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
> +                * punt those bios we would be blocking to the rescuer
> +                * workqueue before we retry with the original gfp_flags.
>                  */
>
> -               if (current->bio_list && !bio_list_empty(current->bio_list))
> +               if (current->bio_lists && !bio_list_empty(&current->bio_lists->recursion))
>                         gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>
>                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2475b1c7..f03ff4c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2031,7 +2031,7 @@ end_io:
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -       struct bio_list bio_list_on_stack;
> +       struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>         blk_qc_t ret = BLK_QC_T_NONE;
>
>         if (!generic_make_request_checks(bio))
> @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio)
>         /*
>          * We only want one ->make_request_fn to be active at a time, else
>          * stack usage with stacked devices could be a problem.  So use
> -        * current->bio_list to keep a list of requests submited by a
> -        * make_request_fn function.  current->bio_list is also used as a
> +        * current->bio_lists to keep a list of requests submited by a
> +        * make_request_fn function.  current->bio_lists is also used as a
>          * flag to say if generic_make_request is currently active in this
>          * task or not.  If it is NULL, then no make_request is active.  If
>          * it is non-NULL, then a make_request is active, and new requests
> -        * should be added at the tail
> +        * should be added at the tail of current->bio_lists->recursion;
> +        * remainder bios resulting from a call to bio_queue_split() from
> +        * within ->make_request_fn() should be added to the head of
> +        * current->bio_lists->remainder.
>          */
> -       if (current->bio_list) {
> -               bio_list_add(current->bio_list, bio);
> +       if (current->bio_lists) {
> +               bio_list_add(&current->bio_lists->recursion, bio);
>                 goto out;
>         }
>
> @@ -2057,7 +2060,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * Before entering the loop, bio->bi_next is NULL (as all callers
>          * ensure that) so we have a list with a single bio.
>          * We pretend that we have just taken it off a longer list, so
> -        * we assign bio_list to a pointer to the bio_list_on_stack,
> +        * we assign bio_list to a pointer to the bio_lists_on_stack,
>          * thus initialising the bio_list of new bios to be
>          * added.  ->make_request() may indeed add some more bios
>          * through a recursive call to generic_make_request.  If it
> @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * bio_list, and call into ->make_request() again.
>          */
>         BUG_ON(bio->bi_next);
> -       bio_list_init(&bio_list_on_stack);
> -       current->bio_list = &bio_list_on_stack;
> +       bio_list_init(&bio_lists_on_stack.recursion);
> +       bio_list_init(&bio_lists_on_stack.remainder);
> +       current->bio_lists = &bio_lists_on_stack;
>         do {
>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> @@ -2076,16 +2080,14 @@ blk_qc_t generic_make_request(struct bio *bio)
>                         ret = q->make_request_fn(q, bio);
>
>                         blk_queue_exit(q);
> -
> -                       bio = bio_list_pop(current->bio_list);
>                 } else {
> -                       struct bio *bio_next = bio_list_pop(current->bio_list);
> -
>                         bio_io_error(bio);
> -                       bio = bio_next;
>                 }
> +               bio = bio_list_pop(&current->bio_lists->recursion);
> +               if (!bio)
> +                       bio = bio_list_pop(&current->bio_lists->remainder);
>         } while (bio);
> -       current->bio_list = NULL; /* deactivate */
> +       current->bio_lists = NULL; /* deactivate */
>
>  out:
>         return ret;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..8da0c22 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>         struct bio *split, *res;
>         unsigned nsegs;
>
> +       BUG_ON(!current->bio_lists);
>         if ((*bio)->bi_rw & REQ_DISCARD)
>                 split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>         else if ((*bio)->bi_rw & REQ_WRITE_SAME)
> @@ -190,7 +191,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>
>                 bio_chain(split, *bio);
>                 trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> -               generic_make_request(*bio);
> +               bio_list_add_head(&current->bio_lists->remainder, *bio);
>                 *bio = split;
>         }
>  }
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index eab505e..eb0b41a 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
>
>         trace_bcache_btree_write(b);
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>         BUG_ON(b->written >= btree_blocks(b));
>         BUG_ON(b->written && !i->keys);
>         BUG_ON(btree_bset_first(b)->seq != i->seq);
> @@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
>
>         /* Force write if set is too big */
>         if (set_bytes(i) > PAGE_SIZE - 48 &&
> -           !current->bio_list)
> +           !current->bio_lists)
>                 bch_btree_node_write(b, NULL);
>  }
>
> @@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
>  {
>         struct btree *b;
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>
>         lockdep_assert_held(&c->bucket_lock);
>
> @@ -976,7 +976,7 @@ retry:
>         b = mca_find(c, k);
>
>         if (!b) {
> -               if (current->bio_list)
> +               if (current->bio_lists)
>                         return ERR_PTR(-EAGAIN);
>
>                 mutex_lock(&c->bucket_lock);
> @@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
>
>         return 0;
>  split:
> -       if (current->bio_list) {
> +       if (current->bio_lists) {
>                 op->lock = b->c->root->level + 1;
>                 return -EAGAIN;
>         } else if (op->lock <= b->c->root->level) {
> @@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
>         struct btree_insert_op op;
>         int ret = 0;
>
> -       BUG_ON(current->bio_list);
> +       BUG_ON(current->bio_lists);
>         BUG_ON(bch_keylist_empty(keys));
>
>         bch_btree_op_init(&op.op, 0);
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index cd77216..b64d4f0 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>  #define DM_BUFIO_CACHE(c)      (dm_bufio_caches[dm_bufio_cache_index(c)])
>  #define DM_BUFIO_CACHE_NAME(c) (dm_bufio_cache_names[dm_bufio_cache_index(c)])
>
> -#define dm_bufio_in_request()  (!!current->bio_list)
> +#define dm_bufio_in_request()  (!!current->bio_lists)
>
>  static void dm_bufio_lock(struct dm_bufio_client *c)
>  {
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b5c4be7..7afc71c 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -663,6 +663,13 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
>         }
>  }
>
> +static inline bool current_has_pending_bios(void)
> +{
> +       return current->bio_lists && (
> +             !bio_list_empty(&current->bio_lists->recursion) ||
> +             !bio_list_empty(&current->bio_lists->remainder));
> +}
> +
>  extern struct md_cluster_operations *md_cluster_ops;
>  static inline int mddev_is_clustered(struct mddev *mddev)
>  {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index c7c8cde..811f2c0 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>                                     (!conf->barrier ||
>                                      ((conf->start_next_window <
>                                        conf->next_resync + RESYNC_SECTORS) &&
> -                                     current->bio_list &&
> -                                     !bio_list_empty(current->bio_list))),
> +                                     current_has_pending_bios())),
>                                     conf->resync_lock);
>                 conf->nr_waiting--;
>         }
> @@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>         struct r1conf *conf = mddev->private;
>         struct bio *bio;
>
> -       if (from_schedule || current->bio_list) {
> +       if (from_schedule || current->bio_lists) {
>                 spin_lock_irq(&conf->device_lock);
>                 bio_list_merge(&conf->pending_bio_list, &plug->pending);
>                 conf->pending_count += plug->pending_cnt;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index c7de2a5..2c70717 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
>                 wait_event_lock_irq(conf->wait_barrier,
>                                     !conf->barrier ||
>                                     (conf->nr_pending &&
> -                                    current->bio_list &&
> -                                    !bio_list_empty(current->bio_list)),
> +                                    current_has_pending_bios()),
>                                     conf->resync_lock);
>                 conf->nr_waiting--;
>         }
> @@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>         struct r10conf *conf = mddev->private;
>         struct bio *bio;
>
> -       if (from_schedule || current->bio_list) {
> +       if (from_schedule || current->bio_lists) {
>                 spin_lock_irq(&conf->device_lock);
>                 bio_list_merge(&conf->pending_bio_list, &plug->pending);
>                 conf->pending_count += plug->pending_cnt;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 9faebf7..57e45a0 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -598,6 +598,17 @@ struct bio_list {
>         struct bio *tail;
>  };
>
> +/* for generic_make_request() */
> +struct recursion_to_iteration_bio_lists {
> +       /* For stacking drivers submitting to their respective backend.
> +        * Added to tail, processed in FIFO order. */
> +       struct bio_list recursion;
> +       /* For pushing back the "remainder" part resulting from calling
> +        * blk_queue_split(). Added to head, processed in LIFO order,
> +        * once the "recursion" list has been drained. */
> +       struct bio_list remainder;
> +};
> +
>  static inline int bio_list_empty(const struct bio_list *bl)
>  {
>         return bl->head == NULL;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada..146eedc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -128,7 +128,7 @@ struct sched_attr {
>
>  struct futex_pi_state;
>  struct robust_list_head;
> -struct bio_list;
> +struct recursion_to_iteration_bio_lists;
>  struct fs_struct;
>  struct perf_event_context;
>  struct blk_plug;
> @@ -1727,7 +1727,7 @@ struct task_struct {
>         void *journal_info;
>
>  /* stacked block device info */
> -       struct bio_list *bio_list;
> +       struct recursion_to_iteration_bio_lists *bio_lists;
>
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> --
> 1.9.1
>