linux-kernel - Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200405091446.GA3421@localhost.localdomain>
Date:   Sun, 5 Apr 2020 17:14:46 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     Doug Anderson <dianders@...omium.org>
Cc:     Jens Axboe <axboe@...nel.dk>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Paolo Valente <paolo.valente@...aro.org>,
        Salman Qazi <sqazi@...gle.com>,
        linux-block <linux-block@...r.kernel.org>,
        linux-scsi@...r.kernel.org, Guenter Roeck <groeck@...omium.org>,
        Ajay Joshi <ajay.joshi@....com>, Arnd Bergmann <arnd@...db.de>,
        Bart Van Assche <bvanassche@....org>,
        Chaitanya Kulkarni <chaitanya.kulkarni@....com>,
        Damien Le Moal <damien.lemoal@....com>,
        Hou Tao <houtao1@...wei.com>,
        Pavel Begunkov <asml.silence@...il.com>,
        Tejun Heo <tj@...nel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget
 contention

On Fri, Apr 03, 2020 at 08:49:54AM -0700, Doug Anderson wrote:
> Hi,
> 
> On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@...omium.org> wrote:
> >
> > Correct that it only happens with BFQ, but whether it's a BFQ bug or
> > not just depends on how you define the has_work() API.  If has_work()
> > is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
> > depending on how you cut it.  If has_work() must be exact then it is
> > certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
> > not a BFQ bug.  I believe that a sane API could be defined either way.
> > Either has_work() can be defined as a lightweight hint to trigger
> > heavier code or it can be defined as something exact.  It's really up
> > to blk-mq to say how they define it.
> >
> > From his response on the SCSI patch [1], it sounded like Jens was OK
> > with has_work() being a lightweight hint as long as BFQ ensures that
> > the queues run later.  ...but, as my investigation found, I believe
> > that BFQ _does_ try to ensure that the queue is run at a later time by
> > calling blk_mq_run_hw_queues().  The irony is that due to the race
> > we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
> > be reliable if has_work() is inexact.  :(  One way to address this is
> > to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.
> >
> > ...so Jens: care to clarify how you'd like has_work() to be defined?
> 
> Sorry to reply so quickly after my prior reply, but I might have found
> an extreme corner case where we can still run into the same race even
> if has_work() is exact.  This is all theoretical from code analysis.
> Maybe you can poke a hole in my scenario or tell me it's so
> implausible that we don't care, but it seems like it's theoretically
> possible.  For this example I'll assume a budget of 1 (AKA only one
> thread can get budget for a given queue):
> 
> * Threads A and B both run has_work() at the same time with the same
>   "hctx".  has_work() is exact but there's no lock, so it's OK if
>   Thread A and B both get back true.
> 
> * Thread B gets interrupted for a long time right after it decides
>   that there is work.  Maybe its CPU gets an interrupt and the
>   interrupt handler is slow.
> 
> * Thread A runs, get budget, dispatches work.
> 
> * Thread A's work finishes and budget is released.
> 
> * Thread B finally runs again and gets budget.
> 
> * Since Thread A already took care of the work and no new work has
>   come in, Thread B will get NULL from dispatch_request().  I believe
>   this is specifically why dispatch_request() is allowed to return
>   NULL in the first place if has_work() must be exact.
> 
> * Thread B will now be holding the budget and is about to call
>   put_budget(), but hasn't called it yet.
> 
> * Thread B gets interrupted for a long time (again).  Dang interrupts.
> 
> * Now Thread C (with a different "hctx" but the same queue) comes
>   along and runs blk_mq_do_dispatch_sched().
> 
> * Thread C won't do anything because it can't get budget.
> 
> * Finally Thread B will run again and put the budget without kicking
>   any queues.
> 
> Now we have a potential I/O stall because nobody will ever kick the
> queues.
> 
> 
> I think the above example could happen even on non-BFQ systems and I
> think it would also be fixed by an approach like the one in my patch.

OK, looks it isn't specific on BFQ any more.

Follows another candidate approach for this issue, given it is so hard
to trigger, we can make it more reliable by rerun queue when has_work()
returns true after ops->dispath_request() returns NULL.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 74cedea56034..4408e5d4fcd8 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -80,6 +80,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
        blk_mq_run_hw_queue(hctx, true);
 }

+#define BLK_MQ_BUDGET_DELAY    3               /* ms units */
 /*
  * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
  * its queue by itself in its completion handler, so we don't need to
@@ -103,6 +104,9 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
                rq = e->type->ops.dispatch_request(hctx);
                if (!rq) {
                        blk_mq_put_dispatch_budget(hctx);
+
+                       if (e->type->ops.has_work && e->type->ops.has_work(hctx))
+                               blk_mq_delay_run_hw_queue(hctx, BLK_MQ_BUDGET_DELAY);
                        break;
                }


Thanks, 
Ming