linux-kernel - Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=Ub6zhVvTj79SWPUv19RDvD0gt5EjJV-FZSbYxUy_T1OA@mail.gmail.com>
Date:   Fri, 3 Apr 2020 08:10:45 -0700
From:   Doug Anderson <dianders@...omium.org>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     Jens Axboe <axboe@...nel.dk>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Paolo Valente <paolo.valente@...aro.org>,
        Salman Qazi <sqazi@...gle.com>,
        linux-block <linux-block@...r.kernel.org>,
        linux-scsi@...r.kernel.org, Guenter Roeck <groeck@...omium.org>,
        Ajay Joshi <ajay.joshi@....com>, Arnd Bergmann <arnd@...db.de>,
        Bart Van Assche <bvanassche@....org>,
        Chaitanya Kulkarni <chaitanya.kulkarni@....com>,
        Damien Le Moal <damien.lemoal@....com>,
        Hou Tao <houtao1@...wei.com>,
        Pavel Begunkov <asml.silence@...il.com>,
        Tejun Heo <tj@...nel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention

Hi,

On Thu, Apr 2, 2020 at 6:34 PM Ming Lei <ming.lei@...hat.com> wrote:
>
> On Thu, Apr 02, 2020 at 08:51:30AM -0700, Douglas Anderson wrote:
> > It is possible for two threads to be running
> > blk_mq_do_dispatch_sched() at the same time with the same "hctx".
> > This is because there can be more than one caller to
> > __blk_mq_run_hw_queue() with the same "hctx" and hctx_lock() doesn't
> > prevent more than one thread from entering.
> >
> > If more than one thread is running blk_mq_do_dispatch_sched() at the
> > same time with the same "hctx", they may have contention acquiring
> > budget.  The blk_mq_get_dispatch_budget() can eventually translate
> > into scsi_mq_get_budget().  If the device's "queue_depth" is 1 (not
> > uncommon) then only one of the two threads will be the one to
> > increment "device_busy" to 1 and get the budget.
> >
> > The losing thread will break out of blk_mq_do_dispatch_sched() and
> > will stop dispatching requests.  The assumption is that when more
> > budget is available later (when existing transactions finish) the
> > queue will be kicked again, perhaps in scsi_end_request().
> >
> > The winning thread now has budget and can go on to call
> > dispatch_request().  If dispatch_request() returns NULL here then we
> > have a potential problem.  Specifically we'll now call
>
> As I mentioned before, it is a BFQ specific issue, it tells blk-mq
> that it has work to do, and now the budget is assigned to the only
> winning thread, however, dispatch_request() still returns NULL.

Correct that it only happens with BFQ, but whether it's a BFQ bug or
not just depends on how you define the has_work() API.  If has_work()
is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
depending on how you cut it.  If has_work() must be exact then it is
certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
not a BFQ bug.  I believe that a sane API could be defined either way.
Either has_work() can be defined as a lightweight hint to trigger
heavier code or it can be defined as something exact.  It's really up
to blk-mq to say how they define it.

>From his response on the SCSI patch [1], it sounded like Jens was OK
with has_work() being a lightweight hint as long as BFQ ensures that
the queues run later.  ...but, as my investigation found, I believe
that BFQ _does_ try to ensure that the queue is run at a later time by
calling blk_mq_run_hw_queues().  The irony is that due to the race
we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
be reliable if has_work() is inexact.  :(  One way to address this is
to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.

...so Jens: care to clarify how you'd like has_work() to be defined?


> > - Because this problem only trips with inexact has_work() it's
> >   believed that blk_mq_do_dispatch_ctx() does not need a similar
> >   change.
>
> Right, I prefer to fix it in BFQ, given it isn't a generic issue,
> not worth of generic solution.

Just to confirm: it sounds like you are saying that BFQ is not a first
class citizen here because not everyone uses BFQ.  Is that really the
best policy?  Couldn't I say that SCSI shouldn't be a first class
citizen because not everyone uses SCSI?  In such a case I could argue
that we should speed up the blk-mq layer by removing all the budget
code since SCSI is the only user.  I'm not actually suggesting it,
only pointing out that sometimes we need to make allowances in the
code.


> > @@ -97,12 +97,34 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >               if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
> >                       break;
> >
> > -             if (!blk_mq_get_dispatch_budget(hctx))
> > -                     break;
> > +
> > +             if (!blk_mq_get_dispatch_budget(hctx)) {
> > +                     /*
> > +                      * We didn't get budget so set contention.  If
> > +                      * someone else had the budget but didn't dispatch
> > +                      * they'll kick everything.  NOTE: we check one
> > +                      * extra time _after_ setting contention to fully
> > +                      * close the race.  If we don't actually dispatch
> > +                      * we'll detext faux contention (with ourselves)
> > +                      * but that should be rare.
> > +                      */
> > +                     atomic_set(&q->budget_contention, 1);
> > +
> > +                     if (!blk_mq_get_dispatch_budget(hctx))
>
> scsi_mq_get_budget() implies a smp_mb(), so the barrier can order
> between blk_mq_get_dispatch_budget() and atomic_set(&q->budget_contention, 0|1).

I always struggle when working with memory barriers.  I think you are
saying that this section of code is OK, though, presumably because the
SCSI code does "atomic_inc_return" which implies this barrier.

Do you happen to know if it's documented that anyone who implements
get_budget() for "struct blk_mq_ops" will have a memory barrier here?
I know SCSI is the only existing user, but we'd want to keep it
generic.


> > +                             break;
> > +             }
> >
> >               rq = e->type->ops.dispatch_request(hctx);
> >               if (!rq) {
> >                       blk_mq_put_dispatch_budget(hctx);
> > +
> > +                     /*
> > +                      * We've released the budget but us holding it might
> > +                      * have prevented someone else from dispatching.
> > +                      * Detect that case and run all queues again.
> > +                      */
> > +                     if (atomic_read(&q->budget_contention))
>
> scsi_mq_put_budget() doesn't imply smp_mb(), so one smp_mb__after_atomic()
> is needed between the above two op.

OK, thanks.  I will add that we decide to move forward with this
patch.  Again I'd wonder if this type of thing should be documented.


> > +                             blk_mq_run_hw_queues(q, true);
> >                       break;
> >               }
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 2cd8d2b49ff4..6163c43ceca5 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1528,6 +1528,9 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async)
> >       struct blk_mq_hw_ctx *hctx;
> >       int i;
> >
> > +     /* We're running the queues, so clear the contention detector */
> > +     atomic_set(&q->budget_contention, 0);
> > +
>
> You add extra cost in fast path.

Yes, but it is a fairly minor cost added.  It's called once in a place
where we're unlikely to be looping and where we're about to do a lot
heavier operations (perhaps in a loop).  This doesn't feel like a
dealbreaker.


> >       queue_for_each_hw_ctx(q, hctx, i) {
> >               if (blk_mq_hctx_stopped(hctx))
> >                       continue;
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index f629d40c645c..07f21e45d993 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -583,6 +583,8 @@ struct request_queue {
> >
> >  #define BLK_MAX_WRITE_HINTS  5
> >       u64                     write_hints[BLK_MAX_WRITE_HINTS];
> > +
> > +     atomic_t                budget_contention;
>
> It needn't to be a atomic variable, and simple 'unsigned'
> int should be fine, what matters is that the order between
> R/W this flag and get/put budget.

Apparently I'm the only one who feels atomic_t is more documenting
here.  I will cede to the will of the majority here and change to
'unsigned int' if we decide to move forward with this patch.

-Doug

[1] https://lore.kernel.org/r/d6af2344-11f7-5862-daed-e21cbd496d92@kernel.dk