linux-kernel - Re: [PATCH] cfq: fix starvation of asynchronous writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <52daacdc-bb2f-3672-d2e8-30c5e376b9d4@kernel.dk>
Date:   Thu, 22 Sep 2016 20:10:48 -0600
From:   Jens Axboe <axboe@...nel.dk>
To:     Glauber Costa <glauber@...lladb.com>, linux-block@...r.kernel.org
Cc:     linux-kernel@...r.kernel.org
Subject: Re: [PATCH] cfq: fix starvation of asynchronous writes

On 09/22/2016 06:59 PM, Glauber Costa wrote:
> While debugging timeouts happening in my application workload (ScyllaDB), I have
> observed calls to open() taking a long time, ranging everywhere from 2 seconds -
> the first ones that are enough to time out my application - to more than 30
> seconds.
>
> The problem seems to happen because XFS may block on pending metadata updates
> under certain circumnstances, and that's confirmed with the following backtrace
> taken by the offcputime tool (iovisor/bcc):
>
>     ffffffffb90c57b1 finish_task_switch
>     ffffffffb97dffb5 schedule
>     ffffffffb97e310c schedule_timeout
>     ffffffffb97e1f12 __down
>     ffffffffb90ea821 down
>     ffffffffc046a9dc xfs_buf_lock
>     ffffffffc046abfb _xfs_buf_find
>     ffffffffc046ae4a xfs_buf_get_map
>     ffffffffc046babd xfs_buf_read_map
>     ffffffffc0499931 xfs_trans_read_buf_map
>     ffffffffc044a561 xfs_da_read_buf
>     ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
>     ffffffffc0452b90 xfs_dir2_leaf_lookup_int
>     ffffffffc0452e0f xfs_dir2_leaf_lookup
>     ffffffffc044d9d3 xfs_dir_lookup
>     ffffffffc047d1d9 xfs_lookup
>     ffffffffc0479e53 xfs_vn_lookup
>     ffffffffb925347a path_openat
>     ffffffffb9254a71 do_filp_open
>     ffffffffb9242a94 do_sys_open
>     ffffffffb9242b9e sys_open
>     ffffffffb97e42b2 entry_SYSCALL_64_fastpath
>     00007fb0698162ed [unknown]
>
> Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
> high "Dispatch wait" times, on the dozens of seconds range and consistent with
> the open() times I have saw in that run.
>
> Still from the blktrace output, we can after searching a bit, identify the
> request that wasn't dispatched:
>
>   8,0   11      152    81.092472813   804  A  WM 141698288 + 8 <- (8,1) 141696240
>   8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
>   8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
>   8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
>   <==== 'I' means Inserted (into the IO scheduler) ===================================>
>   8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
>   <==== Only 15s later the CFQ scheduler dispatches the request ======================>
>
> As we can see above, in this particular example CFQ took 15 seconds to dispatch
> this request. Going back to the full trace, we can see that the xfsaild queue
> had plenty of opportunity to run, and it was selected as the active queue many
> times. It would just always be preempted by something else (example):
>
>   8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
>   8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
>   8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
>   8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
>   8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
>   8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
>   8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0
>
> where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
> IO dispatchers.
>
> The requests preempting the xfsaild queue are synchronous requests. That's a
> characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
> While it can be argued that preempting ASYNC requests in favor of SYNC is part
> of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
> goal.
>
> Moreover, unless I am misunderstanding something, that breaks the expectation
> set by the "fifo_expire_async" tunable, which in my system is set to the
> default.
>
> Looking at the code, it seems to me that the issue is that after we make
> an async queue active, there is no guarantee that it will execute any request.
>
> When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
> requests in flight. An incoming request from another queue can also preempt it
> in such situation before we have the chance to execute anything (as seen in the
> trace above).
>
> This patch sets the must_dispatch flag if we notice that we have requests
> that are already fifo_expired. This flag is always cleared after
> cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
> the queue for subsequent requests (unless they are themselves expired)
>
> Care is taken during preempt to still allow rt requests to preempt us
> regardless.
>
> Testing my workload with this patch applied produces much better results.
> From the application side I see no timeouts, and the open() latency histogram
> generated by systemtap looks much better, with the worst outlier at 131ms:

Good analysis and the fix looks nice and clean. I'll take a closer look 
tomorrow. Thanks!

-- 
Jens Axboe