linux-kernel - Re: [RFC] block: fix blk_queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACVXFVMtMi9WXYQLa3_XZi7W8ZsJe1GnnTLX+7tRMhFAMUSHtA@mail.gmail.com>
Date:	Sat, 25 Jun 2016 17:30:29 +0800
From:	Ming Lei <ming.lei@...onical.com>
To:	Lars Ellenberg <lars.ellenberg@...bit.com>
Cc:	linux-block@...r.kernel.org,
	Roland Kammerer <roland.kammerer@...bit.com>,
	Jens Axboe <axboe@...nel.dk>, NeilBrown <neilb@...e.com>,
	Kent Overstreet <kent.overstreet@...il.com>,
	Shaohua Li <shli@...nel.org>, Alasdair Kergon <agk@...hat.com>,
	Mike Snitzer <snitzer@...hat.com>,
	"open list:DEVICE-MAPPER (LVM)" <dm-devel@...hat.com>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Takashi Iwai <tiwai@...e.de>, Jiri Kosina <jkosina@...e.cz>,
	Zheng Liu <gnehzuil.liu@...il.com>,
	Keith Busch <keith.busch@...el.com>,
	"Martin K. Petersen" <martin.petersen@...cle.com>,
	"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	"open list:BCACHE (BLOCK LAYER CACHE)" <linux-bcache@...r.kernel.org>,
	"open list:SOFTWARE RAID (Multiple Disks) SUPPORT" 
	<linux-raid@...r.kernel.org>
Subject: Re: [RFC] block: fix blk_queue_split() resource exhaustion

On Fri, Jun 24, 2016 at 10:27 PM, Lars Ellenberg
<lars.ellenberg@...bit.com> wrote:
> On Fri, Jun 24, 2016 at 07:36:57PM +0800, Ming Lei wrote:
>> >
>> > This is not a theoretical problem.
>> > At least int DRBD, and an unfortunately high IO concurrency wrt. the
>> > "max-buffers" setting, without this patch we have a reproducible deadlock.
>>
>> Is there any log about the deadlock? And is there any lockdep warning
>> if it is enabled?
>
> In DRBD, to avoid potentially very long internal queues as we wait for
> our replication peer device and local backend, we limit the number of
> in-flight bios we accept, and block in our ->make_request_fn() if that
> number exceeds a configured watermark ("max-buffers").
>
> Works fine, as long as we could assume that once our make_request_fn()
> returns, any bios we "recursively" submitted against the local backend
> would be dispatched. Which used to be the case.
>
> commit 54efd50 block: make generic_make_request handle arbitrarily sized bios
> introduced blk_queue_split(), which will split any bio that is
> violating the queue limits into a smaller piece to be processed
> right away, and queue the "rest" on current->bio_list to be
> processed by the iteration in generic_make_request() once the
> current ->make_request_fn() returns.
>
> Before that, any user was supposed to go through bio_add_page(),
> which would call our merge bvec function, and that should already
> be sufficient to not violate queue limits.
>
> Previously, typically the next in line bio to be processed would
> be the cloned one we passed down to our backend device, which in
> case of further stacking devices (e.g. logical volume, raid1 or
> similar) may again push further bios on that list.
> All of which would simply be processed in turn.

Could you clarify if the issue is triggered in drbd without dm/md involved?
Or the issue is always triggered with dm/md over drbd?

As Mike mentioned, there is one known issue with dm-snapshot.

>
> Now, with blk_queue_split(), the next-in-line bio would be the
> next piece of the "too big" original bio, so we end up calling
> several times into our ->make_request_fn() without even
> dispatching one bio to the backend.

If your ->make_request_fn() is called several times, each time
at least you should dispatch one bio returnd from blk_queue_split(),
so I don't understand why even one bio isn't dispatched out.

>
> With highly concurrent IO and/or very small "max-buffers" setting,
> this can deadlock, where the current submit path would wait for
> the completion of the bios supposedly already passed to the
> backend, but which actually are not even dispatched yet, rotting
> away on current->bio_list.

Suppose your ->make_request_fn only handles the bio returned
from blk_queue_split(), I don't understand why your driver waits
for the 'rest' bios from the blk_queue_split().

Maybe drbd driver calls generic_make_request() in
->make_request_fn() and supposes the bio is always dispatched
out once generic_make_request() returns, is it right?

>
> I could code around that in various ways inside the DRBD make_request_fn,
> or always dispatch the backend bios to a different (thread or work_queue)
> context, but I'd rather have the distinction between "recursively
> submitted" bios and "pushed back part of originally passed in bio" as
> implemented in this patch.
>
> Thanks,
>
>     Lars
>
>> > I'm unsure if it could cause problems in md raid in some corner cases
>> > or when a rebuild/scrub/... starts in a "bad" moment.
>> >
>> > It also may increase "stress" on any involved bio_set,
>> > because it may increase the number of bios that are allocated
>> > before the first of them is actually processed, causing more
>> > frequent calls of punt_bios_to_rescuer().
>> >
>> > Alternatively,
>> > a simple one-line change to generic_make_request() would also "fix" it,
>> > by processing all recursive bios in LIFO order.
>> > But that may change the order in which bios reach the "physical" layer,
>> > in case they are in fact split into many smaller pieces, for example in
>> > a striping setup, which may have other performance implications.
>> >
>> >  | diff --git a/block/blk-core.c b/block/blk-core.c
>> >  | index 2475b1c7..a5623f6 100644
>> >  | --- a/block/blk-core.c
>> >  | +++ b/block/blk-core.c
>> >  | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio)
>> >  |       * should be added at the tail
>> >  |       */
>> >  |      if (current->bio_list) {
>> >  | -            bio_list_add(current->bio_list, bio);
>> >  | +            bio_list_add_head(current->bio_list, bio);
>> >  |              goto out;
>> >  |      }
>
>> > diff --git a/block/blk-core.c b/block/blk-core.c
>> > index 2475b1c7..f03ff4c 100644
>> > --- a/block/blk-core.c
>> > +++ b/block/blk-core.c
>> > @@ -2031,7 +2031,7 @@ end_io:
>> >   */
>> >  blk_qc_t generic_make_request(struct bio *bio)
>> >  {
>> > -       struct bio_list bio_list_on_stack;
>> > +       struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>> >         blk_qc_t ret = BLK_QC_T_NONE;
>> >
>> >         if (!generic_make_request_checks(bio))
>> > @@ -2040,15 +2040,18 @@ blk_qc_t generic_make_request(struct bio *bio)
>
>> > -       if (current->bio_list) {
>> > -               bio_list_add(current->bio_list, bio);
>> > +       if (current->bio_lists) {
>> > +               bio_list_add(&current->bio_lists->recursion, bio);
>> >                 goto out;
>> >         }
>> >
>
>> > @@ -2067,8 +2070,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>> >          * bio_list, and call into ->make_request() again.
>> >          */
>> >         BUG_ON(bio->bi_next);
>> > -       bio_list_init(&bio_list_on_stack);
>> > -       current->bio_list = &bio_list_on_stack;
>> > +       bio_list_init(&bio_lists_on_stack.recursion);
>> > +       bio_list_init(&bio_lists_on_stack.remainder);
>> > +       current->bio_lists = &bio_lists_on_stack;
>> >         do {
>> >                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>> >
>> > @@ -2076,16 +2080,14 @@ blk_qc_t generic_make_request(struct bio *bio)
>> >                         ret = q->make_request_fn(q, bio);
>> >
>> >                         blk_queue_exit(q);
>> > -
>> > -                       bio = bio_list_pop(current->bio_list);
>> >                 } else {
>> > -                       struct bio *bio_next = bio_list_pop(current->bio_list);
>> > -
>> >                         bio_io_error(bio);
>> > -                       bio = bio_next;
>> >                 }
>> > +               bio = bio_list_pop(&current->bio_lists->recursion);
>> > +               if (!bio)
>> > +                       bio = bio_list_pop(&current->bio_lists->remainder);
>> >         } while (bio);
>> > -       current->bio_list = NULL; /* deactivate */
>> > +       current->bio_lists = NULL; /* deactivate */
>> >
>> >  out:
>> >         return ret;
>> > diff --git a/block/blk-merge.c b/block/blk-merge.c
>> > index 2613531..8da0c22 100644
>> > --- a/block/blk-merge.c
>> > +++ b/block/blk-merge.c
>> > @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>> >         struct bio *split, *res;
>> >         unsigned nsegs;
>> >
>> > +       BUG_ON(!current->bio_lists);
>> >         if ((*bio)->bi_rw & REQ_DISCARD)
>> >                 split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>> >         else if ((*bio)->bi_rw & REQ_WRITE_SAME)
>> > @@ -190,7 +191,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>> >
>> >                 bio_chain(split, *bio);
>> >                 trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>> > -               generic_make_request(*bio);
>> > +               bio_list_add_head(&current->bio_lists->remainder, *bio);
>> >                 *bio = split;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html