linux-kernel - Re: [PATCH V3 00/11] block-throttle: add .high limit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <A63CAE03-EC73-41D6-AD71-F36136A15A58@unimore.it>
Date:   Wed, 5 Oct 2016 22:07:26 +0200
From:   Paolo Valente <paolo.valente@...more.it>
To:     Shaohua Li <shli@...com>
Cc:     Tejun Heo <tj@...nel.org>, Vivek Goyal <vgoyal@...hat.com>,
        linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
        Jens Axboe <axboe@...com>, Kernel-team@...com,
        jmoyer@...hat.com, Mark Brown <broonie@...nel.org>,
        Linus Walleij <linus.walleij@...aro.org>,
        Ulf Hansson <ulf.hansson@...aro.org>
Subject: Re: [PATCH V3 00/11] block-throttle: add .high limit


> Il giorno 05 ott 2016, alle ore 21:47, Paolo Valente <paolo.valente@...more.it> ha scritto:
> 
>> 
>> Il giorno 05 ott 2016, alle ore 20:30, Shaohua Li <shli@...com> ha scritto:
>> 
>> On Wed, Oct 05, 2016 at 10:49:46AM -0400, Tejun Heo wrote:
>>> Hello, Paolo.
>>> 
>>> On Wed, Oct 05, 2016 at 02:37:00PM +0200, Paolo Valente wrote:
>>>> In this respect, for your generic, unpredictable scenario to make
>>>> sense, there must exist at least one real system that meets the
>>>> requirements of such a scenario.  Or, if such a real system does not
>>>> yet exist, it must be possible to emulate it.  If it is impossible to
>>>> achieve this last goal either, then I miss the usefulness
>>>> of looking for solutions for such a scenario.
>>>> 
>>>> That said, let's define the instance(s) of the scenario that you find
>>>> most representative, and let's test BFQ on it/them.  Numbers will give
>>>> us the answers.  For example, what about all or part of the following
>>>> groups:
>>>> . one cyclically doing random I/O for some second and then sequential I/O
>>>> for the next seconds
>>>> . one doing, say, quasi-sequential I/O in ON/OFF cycles
>>>> . one starting an application cyclically
>>>> . one playing back or streaming a movie
>>>> 
>>>> For each group, we could then measure the time needed to complete each
>>>> phase of I/O in each cycle, plus the responsiveness in the group
>>>> starting an application, plus the frame drop in the group streaming
>>>> the movie.  In addition, we can measure the bandwidth/iops enjoyed by
>>>> each group, plus, of course, the aggregate throughput of the whole
>>>> system.  In particular we could compare results with throttling, BFQ,
>>>> and CFQ.
>>>> 
>>>> Then we could write resulting numbers on the stone, and stick to them
>>>> until something proves them wrong.
>>>> 
>>>> What do you (or others) think about it?
>>> 
>>> That sounds great and yeah it's lame that we didn't start with that.
>>> Shaohua, would it be difficult to compare how bfq performs against
>>> blk-throttle?
>> 
>> I had a test of BFQ.
> 
> Thank you very much for testing BFQ!
> 
>> I'm using BFQ found at
>> http://algogroup.unimore.it/people/paolo/disk_sched/sources.php. version is
>> 4.7.0-v8r3.
> 
> That's the latest stable version.  The development version [1] already
> contains further improvements for fairness, latency and throughput.
> It is however still a release candidate.
> 
> [1] https://github.com/linusw/linux-bfq/tree/bfq-v8
> 
>> It's a LSI SSD, queue depth 32. I use default setting. fio script
>> is:
>> 
>> [global]
>> ioengine=libaio
>> direct=1
>> readwrite=randread
>> bs=4k
>> runtime=60
>> time_based=1
>> file_service_type=random:36
>> overwrite=1
>> thread=0
>> group_reporting=1
>> filename=/dev/sdb
>> iodepth=1
>> numjobs=8
>> 
>> [groupA]
>> prio=2
>> 
>> [groupB]
>> new_group
>> prio=6
>> 
>> I'll change iodepth, numjobs and prio in different tests. result unit is MB/s.
>> 
>> iodepth=1 numjobs=1 prio 4:4
>> CFQ: 28:28 BFQ: 21:21 deadline: 29:29
>> 
>> iodepth=8 numjobs=1 prio 4:4
>> CFQ: 162:162 BFQ: 102:98 deadline: 205:205
>> 
>> iodepth=1 numjobs=8 prio 4:4
>> CFQ: 157:157 BFQ: 81:92 deadline: 196:197
>> 
>> iodepth=1 numjobs=1 prio 2:6
>> CFQ: 26.7:27.6 BFQ: 20:6 deadline: 29:29
>> 
>> iodepth=8 numjobs=1 prio 2:6
>> CFQ: 166:174 BFQ: 139:72  deadline: 202:202
>> 
>> iodepth=1 numjobs=8 prio 2:6
>> CFQ: 148:150 BFQ: 90:77 deadline: 198:197
>> 
>> CFQ isn't fair at all. BFQ is very good in this side, but has poor throughput
>> even prio is the default value.
>> 
> 
> Throughput is lower with BFQ for two reasons.
> 
> First, you certainly left the low_latency in its default state, i.e.,
> on.  As explained, e.g., here [2], low_latency mode is totally geared
> towards maximum responsiveness and minimum latency for soft real-time
> applications (e.g., video players).  To achieve this goal, BFQ is
> willing to perform more idling, when necessary.  This lowers
> throughput (I'll get back on this at the end of the discussion of the
> second reason).
> 
> The second, most important reason, is that a minimum of idling is the
> *only* way to achieve differentiated bandwidth distribution, as you
> requested by setting different ioprios.  I stress that this constraint
> is not a technological accident, but a intrinsic, logical necessity.
> The proof is simple, and if the following explanation is too boring or
> confusing, I can show it to you with any trace of sync I/O.
> 
> First, to provide differentiated service, you need per-process
> scheduling, i.e., schedulers in which there is a separate queue
> associated with each process.  Now, let A be the process with higher
> weight (ioprio), and B the process with lower weight.  Both processes
> are sync, thus, by definition, they issue requests as follows: a few
> requests (probably two, or a little bit more with larger iodepth),
> then a little break to wait for request completion, then the next
> small batch and so on.  For each process, the queue associated with
> the process (in the scheduler) is necessarily empty on the break.  As
> a consequence, if there is no idling, then every time A reaches its
> break, the scheduler has only the option to switch to B (which is
> extremely likely to have pending requests).
> 
> The service pattern of the processes then unavoidably becomes:
> 
> A B A B A B ...
> 
> where each letter represents a full small batch served for the
> process.  That is, 50% of the bw for each process, and complete loss
> of control on the desired bandwidth distribution.
> 
> So, to sum up, the reason why BFQ achieves a lower total bw is that it
> behaves in the only correct way to respect weights with sync I/O,
> i.e., it performs a little idling.  If low_latency is on, then BFQ
> increases idling further, and this may be have caused further bw loss
> in your test (but this varies greatly with devices, so you can
> discover it only by trying).
> 
> The bottom line is that if you do want to achieve differentiation with
> sync I/O, you have to pay a price in terms of bw, because of idling.
> Actually, the recent preemption mechanism that I have introduced in
> BFQ is proving so effective in preserving differentiation, that I'm
> tempted to try some almost idleness solution.  A little of accuracy
> should however be sacrificed.  Anyway, this is still work in progress.
> 

Just for completeness, if the weights of the processes or groups are
equal, then BFQ preserves service guarantees while achieving the same
bw as deadline.  And equal weights is the most common case according
to my limited experience.

Thanks,
Paolo

> Thank you very much,
> Paolo
> 
> [2] http://algogroup.unimore.it/people/paolo/disk_sched/description.php
> 
>> Thanks,
>> Shaohua
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> Paolo Valente
> Algogroup
> Dipartimento di Scienze Fisiche, Informatiche e Matematiche
> Via Campi 213/B
> 41125 Modena - Italy
> http://algogroup.unimore.it/people/paolo/
> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Paolo Valente
Algogroup
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Via Campi 213/B
41125 Modena - Italy
http://algogroup.unimore.it/people/paolo/