linux-kernel - Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAD=FV=Vsk0SjkA+DbUwJxvO6NFcr0CO9=H1FD7okJ2PxMt5pYA@mail.gmail.com>
Date:   Fri, 3 Apr 2020 08:49:54 -0700
From:   Doug Anderson <dianders@...omium.org>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     Jens Axboe <axboe@...nel.dk>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Paolo Valente <paolo.valente@...aro.org>,
        Salman Qazi <sqazi@...gle.com>,
        linux-block <linux-block@...r.kernel.org>,
        linux-scsi@...r.kernel.org, Guenter Roeck <groeck@...omium.org>,
        Ajay Joshi <ajay.joshi@....com>, Arnd Bergmann <arnd@...db.de>,
        Bart Van Assche <bvanassche@....org>,
        Chaitanya Kulkarni <chaitanya.kulkarni@....com>,
        Damien Le Moal <damien.lemoal@....com>,
        Hou Tao <houtao1@...wei.com>,
        Pavel Begunkov <asml.silence@...il.com>,
        Tejun Heo <tj@...nel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 2/2] blk-mq: Rerun dispatching in the case of budget contention

Hi,

On Fri, Apr 3, 2020 at 8:10 AM Doug Anderson <dianders@...omium.org> wrote:
>
> Correct that it only happens with BFQ, but whether it's a BFQ bug or
> not just depends on how you define the has_work() API.  If has_work()
> is allowed to be in-exact then it's either a blk-mq bug or a SCSI bug
> depending on how you cut it.  If has_work() must be exact then it is
> certainly a BFQ bug.  If has_work() doesn't need to be exact then it's
> not a BFQ bug.  I believe that a sane API could be defined either way.
> Either has_work() can be defined as a lightweight hint to trigger
> heavier code or it can be defined as something exact.  It's really up
> to blk-mq to say how they define it.
>
> From his response on the SCSI patch [1], it sounded like Jens was OK
> with has_work() being a lightweight hint as long as BFQ ensures that
> the queues run later.  ...but, as my investigation found, I believe
> that BFQ _does_ try to ensure that the queue is run at a later time by
> calling blk_mq_run_hw_queues().  The irony is that due to the race
> we're talking about here, blk_mq_run_hw_queues() isn't guaranteed to
> be reliable if has_work() is inexact.  :(  One way to address this is
> to make blk_mq_run_hw_queues() reliable even if has_work() is inexact.
>
> ...so Jens: care to clarify how you'd like has_work() to be defined?

Sorry to reply so quickly after my prior reply, but I might have found
an extreme corner case where we can still run into the same race even
if has_work() is exact.  This is all theoretical from code analysis.
Maybe you can poke a hole in my scenario or tell me it's so
implausible that we don't care, but it seems like it's theoretically
possible.  For this example I'll assume a budget of 1 (AKA only one
thread can get budget for a given queue):

* Threads A and B both run has_work() at the same time with the same
  "hctx".  has_work() is exact but there's no lock, so it's OK if
  Thread A and B both get back true.

* Thread B gets interrupted for a long time right after it decides
  that there is work.  Maybe its CPU gets an interrupt and the
  interrupt handler is slow.

* Thread A runs, get budget, dispatches work.

* Thread A's work finishes and budget is released.

* Thread B finally runs again and gets budget.

* Since Thread A already took care of the work and no new work has
  come in, Thread B will get NULL from dispatch_request().  I believe
  this is specifically why dispatch_request() is allowed to return
  NULL in the first place if has_work() must be exact.

* Thread B will now be holding the budget and is about to call
  put_budget(), but hasn't called it yet.

* Thread B gets interrupted for a long time (again).  Dang interrupts.

* Now Thread C (with a different "hctx" but the same queue) comes
  along and runs blk_mq_do_dispatch_sched().

* Thread C won't do anything because it can't get budget.

* Finally Thread B will run again and put the budget without kicking
  any queues.

Now we have a potential I/O stall because nobody will ever kick the
queues.

I think the above example could happen even on non-BFQ systems and I
think it would also be fixed by an approach like the one in my patch.

-Doug