linux-kernel - Re: [PATCH] cfq: fix starvation of asynchronous writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAD-J=zZsQ-JDzRyLtTarkPf8u9Vm9z-r00K=AA05LDNnp=G_2w@mail.gmail.com>
Date:   Fri, 23 Sep 2016 08:19:38 -0400
From:   Glauber Costa <glauber@...lladb.com>
To:     Paolo Valente <paolo.valente@...more.it>
Cc:     Linus Walleij <linus.walleij@...aro.org>,
        Jens Axboe <axboe@...nel.dk>, linux-kernel@...r.kernel.org,
        Mark Brown <broonie@...nel.org>, linux-block@...r.kernel.org,
        Ulf Hansson <ulf.hansson@...aro.org>
Subject: Re: [PATCH] cfq: fix starvation of asynchronous writes

On Fri, Sep 23, 2016 at 7:28 AM, Glauber Costa <glauber@...lladb.com> wrote:
> On Sep 23, 2016 6:22 AM, "Paolo Valente" <paolo.valente@...more.it> wrote:
>>
>>
>> > Il giorno 23 set 2016, alle ore 02:59, Glauber Costa
>> > <glauber@...lladb.com> ha scritto:
>> >
>> > While debugging timeouts happening in my application workload
>> > (ScyllaDB), I have
>> > observed calls to open() taking a long time, ranging everywhere from 2
>> > seconds -
>> > the first ones that are enough to time out my application - to more than
>> > 30
>> > seconds.
>> >
>> > The problem seems to happen because XFS may block on pending metadata
>> > updates
>> > under certain circumnstances, and that's confirmed with the following
>> > backtrace
>> > taken by the offcputime tool (iovisor/bcc):
>> >
>> >    ffffffffb90c57b1 finish_task_switch
>> >    ffffffffb97dffb5 schedule
>> >    ffffffffb97e310c schedule_timeout
>> >    ffffffffb97e1f12 __down
>> >    ffffffffb90ea821 down
>> >    ffffffffc046a9dc xfs_buf_lock
>> >    ffffffffc046abfb _xfs_buf_find
>> >    ffffffffc046ae4a xfs_buf_get_map
>> >    ffffffffc046babd xfs_buf_read_map
>> >    ffffffffc0499931 xfs_trans_read_buf_map
>> >    ffffffffc044a561 xfs_da_read_buf
>> >    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>> >    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>> >    ffffffffc0452e0f xfs_dir2_leaf_lookup
>> >    ffffffffc044d9d3 xfs_dir_lookup
>> >    ffffffffc047d1d9 xfs_lookup
>> >    ffffffffc0479e53 xfs_vn_lookup
>> >    ffffffffb925347a path_openat
>> >    ffffffffb9254a71 do_filp_open
>> >    ffffffffb9242a94 do_sys_open
>> >    ffffffffb9242b9e sys_open
>> >    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>> >    00007fb0698162ed [unknown]
>> >
>> > Inspecting my run with blktrace, I can see that the xfsaild kthread
>> > exhibit very
>> > high "Dispatch wait" times, on the dozens of seconds range and
>> > consistent with
>> > the open() times I have saw in that run.
>> >
>> > Still from the blktrace output, we can after searching a bit, identify
>> > the
>> > request that wasn't dispatched:
>> >
>> >  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1)
>> > 141696240
>> >  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8
>> > [xfsaild/sda1]
>> >  8,0   11      154    81.092473207   804  G  WM 141698288 + 8
>> > [xfsaild/sda1]
>> >  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911)
>> > [xfsaild/sda1]
>> >  <==== 'I' means Inserted (into the IO scheduler)
>> > ===================================>
>> >  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8
>> > (15626265317) [swapper/0]
>> >  <==== Only 15s later the CFQ scheduler dispatches the request
>> > ======================>
>> >
>> > As we can see above, in this particular example CFQ took 15 seconds to
>> > dispatch
>> > this request. Going back to the full trace, we can see that the xfsaild
>> > queue
>> > had plenty of opportunity to run, and it was selected as the active
>> > queue many
>> > times. It would just always be preempted by something else (example):
>> >
>> >  8,0    1        0    81.117912979     0  m   N cfq1618SN /
>> > insert_request
>> >  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>> >  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>> >  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired
>> > t=1
>> >  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
>> >  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448
>> > min_vt=1948520448
>> >  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1
>> > disp=0 charge=0 iops=1 sect=0
>> >
>> > where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the
>> > ScyllaDB
>> > IO dispatchers.
>> >
>> > The requests preempting the xfsaild queue are synchronous requests.
>> > That's a
>> > characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT
>> > requests.
>> > While it can be argued that preempting ASYNC requests in favor of SYNC
>> > is part
>> > of the CFQ logic, I don't believe that doing so for 15+ seconds is
>> > anyone's
>> > goal.
>> >
>> > Moreover, unless I am misunderstanding something, that breaks the
>> > expectation
>> > set by the "fifo_expire_async" tunable, which in my system is set to the
>> > default.
>> >
>> > Looking at the code, it seems to me that the issue is that after we make
>> > an async queue active, there is no guarantee that it will execute any
>> > request.
>> >
>> > When the queue itself tests if it cfq_may_dispatch() it can bail if it
>> > sees SYNC
>> > requests in flight. An incoming request from another queue can also
>> > preempt it
>> > in such situation before we have the chance to execute anything (as seen
>> > in the
>> > trace above).
>> >
>>
>> The BFQ scheduler should not suffer from this issue, because it does
>> not consider this condition at all for preemption.  Even better, BFQ
>> performs preemption as a function of just one condition, which is
>> formally proven not to break any request-delay guarantee.
>>
>> Would you be willing to give it a try?  It might solve or at least
>> mitigate the specific latency problem addressed by your patch, and at
>> the same time provide many other benefits, in terms of low latency and
>> high throughput.
>
> We are definitely interested in trying different elevators that could
> benefit our workload. Specially because right now we don't really have a
> default recommendation.
>
> But just to be clear, I saw very early on that the problem didn't happen
> with noop or deadline, so if this was just a characteristic of the CFQ I
> would have stopped there.
>
> I do believe however, as I have said in my commitlog message, that this is
> just a bug. And no bug shall remain unfixed in my presence.

Sorry this was sent in HTML. Just got mail from vger.kernel.org saying
the message was refused
due to that. I don't deal with LKML in ages and my gmail phone
interface was sending
HTML messages by default.

>
>>
>> If you are willing to try it, then in [1] you can find the last BFQ
>> development branch, rebased against 4.8-rc7.  These commits add BFQ as an
>> extra scheduler.
>>
>> Otherwise you can find BFQ for 4.7 here [2], and BFQ for 4.6 here [3].
>>
>> If you prefer patches, then you can find them in [4] for 4.7, or in
>> [5] for 4.6.
>>
>> Finally, if you need details about BFQ, have a look at [6], or just ask!
>>
>> Thanks,
>> Paolo
>>
>> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
>> [2] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.7
>> [3] https://github.com/linusw/linux-bfq/tree/bfq-v8-v4.6
>> [4]
>> http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.7.0-v8r3/
>> [5]
>> http://algogroup.unimore.it/people/paolo/disk_sched/patches/4.6.0-v8r3/
>> [6] http://algogroup.unimore.it/people/paolo/disk_sched/
>>
>> > This patch sets the must_dispatch flag if we notice that we have
>> > requests
>> > that are already fifo_expired. This flag is always cleared after
>> > cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't
>> > pin
>> > the queue for subsequent requests (unless they are themselves expired)
>> >
>> > Care is taken during preempt to still allow rt requests to preempt us
>> > regardless.
>> >
>> > Testing my workload with this patch applied produces much better
>> > results.
>> > From the application side I see no timeouts, and the open() latency
>> > histogram
>> > generated by systemtap looks much better, with the worst outlier at
>> > 131ms:
>> >
>>
>>
>>
>> > Latency histogram of xfs_buf_lock acquisition (microseconds):
>> > value |-------------------------------------------------- count
>> >     0 |                                                     11
>> >     1 |@@@@                                                161
>> >     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
>> >     4 |@                                                    54
>> >     8 |                                                     36
>> >    16 |                                                      7
>> >    32 |                                                      0
>> >    64 |                                                      0
>> >       ~
>> >  1024 |                                                      0
>> >  2048 |                                                      0
>> >  4096 |                                                      1
>> >  8192 |                                                      1
>> > 16384 |                                                      2
>> > 32768 |                                                      0
>> > 65536 |                                                      0
>> > 131072 |                                                      1
>> > 262144 |                                                      0
>> > 524288 |                                                      0
>> >
>> > Signed-off-by: Glauber Costa <glauber@...lladb.com>
>> > CC: Jens Axboe <axboe@...nel.dk>
>> > CC: linux-block@...r.kernel.org
>> > CC: linux-kernel@...r.kernel.org
>> >
>> > Signed-off-by: Glauber Costa <glauber@...lladb.com>
>> > ---
>> > block/cfq-iosched.c | 13 ++++++++++---
>> > 1 file changed, 10 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> > index cc2f6db..5e24d88 100644
>> > --- a/block/cfq-iosched.c
>> > +++ b/block/cfq-iosched.c
>> > @@ -3042,7 +3042,6 @@ static struct request *cfq_check_fifo(struct
>> > cfq_queue *cfqq)
>> >       if (ktime_get_ns() < rq->fifo_time)
>> >               rq = NULL;
>> >
>> > -     cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
>> >       return rq;
>> > }
>> >
>> > @@ -3420,6 +3419,9 @@ static bool cfq_may_dispatch(struct cfq_data
>> > *cfqd, struct cfq_queue *cfqq)
>> > {
>> >       unsigned int max_dispatch;
>> >
>> > +     if (cfq_cfqq_must_dispatch(cfqq))
>> > +             return true;
>> > +
>> >       /*
>> >        * Drain async requests before we start sync IO
>> >        */
>> > @@ -3511,15 +3513,20 @@ static bool cfq_dispatch_request(struct cfq_data
>> > *cfqd, struct cfq_queue *cfqq)
>> >
>> >       BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
>> >
>> > +     rq = cfq_check_fifo(cfqq);
>> > +     if (rq)
>> > +             cfq_mark_cfqq_must_dispatch(cfqq);
>> > +
>> >       if (!cfq_may_dispatch(cfqd, cfqq))
>> >               return false;
>> >
>> >       /*
>> >        * follow expired path, else get first next available
>> >        */
>> > -     rq = cfq_check_fifo(cfqq);
>> >       if (!rq)
>> >               rq = cfqq->next_rq;
>> > +     else
>> > +             cfq_log_cfqq(cfqq->cfqd, cfqq, "fifo=%p", rq);
>> >
>> >       /*
>> >        * insert request into driver dispatch list
>> > @@ -3989,7 +3996,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct
>> > cfq_queue *new_cfqq,
>> >        * if the new request is sync, but the currently running queue is
>> >        * not, let the sync request have priority.
>> >        */
>> > -     if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>> > +     if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq) &&
>> > !cfq_cfqq_must_dispatch(cfqq))
>> >               return true;
>> >
>> >       /*
>> > --
>> > 2.5.5
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-block"
>> > in
>> > the body of a message to majordomo@...r.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> Paolo Valente
>> Algogroup
>> Dipartimento di Scienze Fisiche, Informatiche e Matematiche
>> Via Campi 213/B
>> 41125 Modena - Italy
>> http://algogroup.unimore.it/people/paolo/
>>
>>
>>
>>
>>