linux-kernel - Re: [Fwd: [RFC] cfq: adapt slice to number of processes doing I/O (v2.1)]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4e5e476b0909250105m18983420g20e39cb2b124af04@mail.gmail.com>
Date:	Fri, 25 Sep 2009 10:05:52 +0200
From:	Corrado Zoccolo <czoccolo@...il.com>
To:	Shan Wei <shanwei@...fujitsu.com>
Cc:	Jens Axboe <jens.axboe@...cle.com>, Jeff Moyer <jmoyer@...hat.com>,
	linux-kernel@...r.kernel.org, Tobias Oetiker <tobi@...iker.ch>
Subject: Re: [Fwd: [RFC] cfq: adapt slice to number of processes doing I/O 
	(v2.1)]

Hi Shan,
On Thu, Sep 24, 2009 at 11:21 AM, Shan Wei <shanwei@...fujitsu.com> wrote:
>> Subject:
>> [RFC] cfq: adapt slice to number of processes doing I/O (v2.1)
>>
>> When the number of processes performing I/O concurrently increases,
>> a fixed time slice per process will cause large latencies.
>>
>> This (v2.1) patch will scale the time slice assigned to each process,
>> according to a target latency (tunable from sysfs, default 300ms).
>>
>> In order to keep fairness among processes, we adopt two devices, w.r.t. v1.
>>
>> * The number of active processes is computed using a special form of
>> running average, that quickly follows sudden increases (to keep latency low),
>> and decrease slowly (to have fairness in spite of rapid decreases of this
>> value).
>>
>> * The idle time is computed using remaining slice as a maximum.
>>
>> To safeguard sequential bandwidth, we impose a minimum time slice
>> (computed using 2*cfq_slice_idle as base, adjusted according to priority
>> and async-ness).
>>
>> Signed-off-by: Corrado Zoccolo <czoccolo@...il.com>
>>
>
> I'm interested in the idea of dynamically tuning the time slice according to the number of
> processes. I have tested your patch using Jeff's tool on kernel-2.6.30-rc4.
> From the following test result, after applying your patch, the fairness is not good
> as original kernel, e.g. io priority of 4 vs 5 in be0-through-7.fio case.

In Jeff's test, the results are always in the same order, so you can
compare different runs, e.g. be4-x-8.fio with be0-through-7.fio.
If we compare the processes that caused the priority inversion, i.e.
5th and 6th process, in the equal priority case we have:

for original cfq:
(5th) be      4       64548   42388   -35
(6th) be      4       64548   73780   14

for patched cfq:
(5th) be      4       57384   55028   -5
(6th) be      4       57384   69620   21

It is clear that the 5th file has slower (between 30% and 50%) access
than the 6th (maybe caused by fragmentation, or bad disk placement),
so this can easily explain the 11% priority inversion.
(5th) be      4       56116   18420   -68
(6th) be      5       44893   19700   -57

BTW, with smaller slices, even a single seek while performing
sequential read may cause a bigger degradation, so the effect on the
patched cfq can be more evident.

>From your numbers, actually, I see much better fairness than what Jeff
experienced with the first version of my patch, so the improvements
have actually paid out.

> And the throughout(total data transferred) becomes lower.
That is expected. We trade some throughput for better latency. You can
increase the target_latency to have back your throughput (e.g. for
non-interactive servers). Decreasing target_latency too much will not
improve the latency indefinitely, since we have a lower bound for the
slice, though (and hardware limitations).

> Have you tested buffered write, multi-threads?
Yes. One interesting test was done by Tobias Oetiker
(http://oss.oetiker.ch/optools/wiki/fsopbench) on his Areca HW Raid6
with ext3. He put target_latency to 200ms, and used a lower
slice_idle, to match his faster disks.

Here the writers are creating many files (metadata & journal writes)
of varying sizes (the distribution is a realistic one, that simulates
an user home), writing to them (this will cause buffered writes for
big files), and then syncing them (also sync writes, both to data and
metadata), so this is a torture test for ext3.
There is just 1 reader, instead, that is trying to access files and
measuring the latencies in doing so.

On 2.6.31 with cfq and ordered data it gives the following results

 ./fsopbench.pl --writers=6 --continuous /tmp/mnt/home_a/fsopbench_tree/

In-Competition with 6 Writers - Mode  30s Interval
-----------------------------------------------------------------
A read dir        cnt    371    min    0.001 ms    max  348.297 ms
mean    2.499 ms    stdev  22.800
B stat file       cnt    351    min    0.007 ms    max 1130.490 ms
mean    4.881 ms    stdev  61.887
C open file       cnt    293    min    0.014 ms    max    0.195 ms
mean    0.022 ms    stdev   0.017
D rd 1st byte     cnt    293    min    0.178 ms    max 1117.346 ms
mean   57.902 ms    stdev 153.807
E read rate       1.123 MB/s

On 2.6.31 with my cfq patch

In-Competition with 6 Writers - Mode  31s Interval
-----------------------------------------------------------------
A read dir        cnt    388    min    0.001 ms    max  240.531 ms
mean    1.145 ms    stdev  12.878
B stat file       cnt    366    min    0.006 ms    max  199.893 ms
mean    0.920 ms    stdev  10.840
C open file       cnt    300    min    0.014 ms    max    0.340 ms
mean    0.025 ms    stdev   0.028
D rd 1st byte     cnt    300    min    0.184 ms    max 1443.066 ms
mean   69.599 ms    stdev 193.984
E read rate       1.023 MB/s

As you can see, my patch inproved both average and maximum latency for
readdir and stat, while reading the first byte needed more time, and
read throughput decreased 10%.
The total time in the worst case to reach a file and read data from it
decreased from 2.5s to about 1.8s, and also average case gained few
ms.

>
> Additionally i have a question about the minimum time slice, see the comment in your patch.
>
> *Original*(2.6.30-rc4 without patch):
> /cfq-regression-tests/2.6.30-rc4-log/be0-through-7.fio
> total priority: 880
> total data transferred: 535872
> class   prio    ideal   xferred %diff
> be      0       109610  149748  36
> be      1       97431   104436  7
> be      2       85252   91124   6
> be      3       73073   64244   -13
> be      4       60894   59028   -4
> be      5       48715   38132   -22
> be      6       36536   21492   -42
> be      7       24357   7668    -69
>
> /cfq-regression-tests/2.6.30-rc4-log/be0-vs-be1.fio
> total priority: 340
> total data transferred: 556008
> class   prio    ideal   xferred %diff
> be      0       294357  402164  36
> be      1       261650  153844  -42
>
> /cfq-regression-tests/2.6.30-rc4-log/be0-vs-be7.fio
> total priority: 220
> total data transferred: 537064
> class   prio    ideal   xferred %diff
> be      0       439416  466164  6
> be      7       97648   70900   -28
>
> /cfq-regression-tests/2.6.30-rc4-log/be4-x-3.fio
> total priority: 300
> total data transferred: 532964
> class   prio    ideal   xferred %diff
> be      4       177654  199260  12
> be      4       177654  149748  -16
> be      4       177654  183956  3
>
> /cfq-regression-tests/2.6.30-rc4-log/be4-x-8.fio
> total priority: 800
> total data transferred: 516384
> class   prio    ideal   xferred %diff
> be      4       64548   78580   21
> be      4       64548   76436   18
> be      4       64548   75764   17
> be      4       64548   70900   9
> be      4       64548   42388   -35
> be      4       64548   73780   14
> be      4       64548   30708   -53
> be      4       64548   67828   5
>
>
> *Applied patch*(2.6.30-rc4 with patch):
>
> /cfq-regression-tests/log-result/be0-through-7.fio
> total priority: 880
> total data transferred: 493824
> class   prio    ideal   xferred %diff
> be      0       101009  224852  122
> be      1       89786   106996  19
> be      2       78562   70388   -11
> be      3       67339   38900   -43
> be      4       56116   18420   -68
> be      5       44893   19700   -57
> be      6       33669   9972    -71
> be      7       22446   4596    -80
>
> /cfq-regression-tests/log-result/be0-vs-be1.fio
> total priority: 340
> total data transferred: 537064
> class   prio    ideal   xferred %diff
> be      0       284328  375540  32
> be      1       252736  161524  -37
>
> /cfq-regression-tests/log-result/be0-vs-be7.fio
> total priority: 220
> total data transferred: 551912
> class   prio    ideal   xferred %diff
> be      0       451564  499956  10
> be      7       100347  51956   -49
>
> /cfq-regression-tests/log-result/be4-x-3.fio
> total priority: 300
> total data transferred: 509404
> class   prio    ideal   xferred %diff
> be      4       169801  196596  15
> be      4       169801  198388  16
> be      4       169801  114420  -33
>
> /cfq-regression-tests/log-result/be4-x-8.fio
> total priority: 800
> total data transferred: 459072
> class   prio    ideal   xferred %diff
> be      4       57384   70644   23
> be      4       57384   52980   -8
> be      4       57384   62356   8
> be      4       57384   60660   5
> be      4       57384   55028   -5
> be      4       57384   69620   21
> be      4       57384   51956   -10
> be      4       57384   35828   -38
>
>
> Hardware infos.:
> CPU: GenuineIntel Intel(R) Xeon(TM) CPU 3.00GHz
>     (4 logic cpus with hyper-thread on)
> memory:2G
> HDD:scsi
>
>> ---
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 0e3814b..ca90d42 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -27,6 +27,8 @@ static const int cfq_slice_sync = HZ / 10;
>>  static int cfq_slice_async = HZ / 25;
>>  static const int cfq_slice_async_rq = 2;
>>  static int cfq_slice_idle = HZ / 125;
>> +static int cfq_target_latency = HZ * 3/10; /* 300 ms */
>> +static int cfq_hist_divisor = 4;
>>
>>  /*
>>   * offset from end of service tree
>> @@ -134,6 +136,9 @@ struct cfq_data {
>>       struct rb_root prio_trees[CFQ_PRIO_LISTS];
>>
>>       unsigned int busy_queues;
>> +     unsigned int busy_queues_avg;
>> +     unsigned int busy_rt_queues;
>> +     unsigned int busy_rt_queues_avg;
>>
>>       int rq_in_driver[2];
>>       int sync_flight;
>> @@ -173,6 +178,8 @@ struct cfq_data {
>>       unsigned int cfq_slice[2];
>>       unsigned int cfq_slice_async_rq;
>>       unsigned int cfq_slice_idle;
>> +     unsigned int cfq_target_latency;
>> +     unsigned int cfq_hist_divisor;
>>
>>       struct list_head cic_list;
>>
>> @@ -301,10 +308,40 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>>       return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
>>  }
>>
>> +static inline unsigned
>> +cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
>> +     unsigned min_q, max_q;
>> +     unsigned mult = cfqd->cfq_hist_divisor - 1;
>> +     unsigned round = cfqd->cfq_hist_divisor / 2;
>> +     if (rt) {
>> +             min_q = min(cfqd->busy_rt_queues_avg, cfqd->busy_rt_queues);
>> +             max_q = max(cfqd->busy_rt_queues_avg, cfqd->busy_rt_queues);
>> +             cfqd->busy_rt_queues_avg = (mult * max_q + min_q + round) /
>> +                     cfqd->cfq_hist_divisor;
>> +             return cfqd->busy_rt_queues_avg;
>> +     } else {
>> +             min_q = min(cfqd->busy_queues_avg, cfqd->busy_queues);
>> +             max_q = max(cfqd->busy_queues_avg, cfqd->busy_queues);
>> +             cfqd->busy_queues_avg = (mult * max_q + min_q + round) /
>> +                     cfqd->cfq_hist_divisor;
>> +             return cfqd->busy_queues_avg;
>> +     }
>> +}
>> +
>>  static inline void
>>  cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>>  {
>> -     cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
>> +     unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1];
>> +     unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
>> +     unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
>> +
>> +     if (iq > process_thr) {
>> +             unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle
>> +                     / cfqd->cfq_slice[1];
>
> For sync queue, the minimum time slice is decided by slice_idle, base time slice and io priority.
> But for async queue, why is the minimum time slice also limited by base time slice of sync queue?

You should consider that slice, computed calling cfq_prio_to_slice,
will already be:
for sync:  cfqd->cfq_slice[1] altered to consider the priority
for async: cfqd->cfq_slice[0] altered to consider the priority
Now, I take that number, and multiply it by twice the slice_idle, and
divide it by cfqd->cfq_slice[1].
So it becomes (approximately):
for sync: 2 * slice altered to considet the priority
for async: 2 * slice * cfqd->cfq_slice[0]/cfqd->cfq_slice[1] altered
to considet the priority
Now if we ignore priorities for a while, we can see that the ratio
between sync slice and async slice in original cfq is:
cfqd->cfq_slice[1]/cfqd->cfq_slice[0]
and in my patch:
 2 * slice / (2 * slice * cfqd->cfq_slice[0]/cfqd->cfq_slice[1]) =
cfqd->cfq_slice[1]/cfqd->cfq_slice[0]
i.e. exactly the same.
So the intent is to preserve the ratio between sync and async slices.

Corrado

>
>
> Best Regards
> -----
> Shan Wei
>
>
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@...il.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/