linux-kernel - Re: [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <F64D5CC8-E650-4AE6-8452-7FA0C1976271@linaro.org>
Date:   Sun, 21 Mar 2021 12:04:53 +0100
From:   Paolo Valente <paolo.valente@...aro.org>
To:     brookxu <brookxu.cn@...il.com>
Cc:     axboe@...nel.dk, tj@...nel.org, linux-block@...r.kernel.org,
        cgroups@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 00/11] bfq: introduce bfq.ioprio for cgroup



> Il giorno 12 mar 2021, alle ore 12:08, brookxu <brookxu.cn@...il.com> ha scritto:
> 
> From: Chunguang Xu <brookxu@...cent.com>
> 

Hi Chunguang,

> Tasks in the production environment can be roughly divided into
> three categories: emergency tasks, ordinary tasks and offline
> tasks. Emergency tasks need to be scheduled in real time, such
> as system agents. Offline tasks do not need to guarantee QoS,
> but can improve system resource utilization during system idle
> periods, such as background tasks. The above requirements need
> to achieve IO preemption. At present, we can use weights to
> simulate IO preemption, but since weights are more of a shared
> concept, they cannot be simulated well. For example, the weights
> of emergency tasks and ordinary tasks cannot be determined well,
> offline tasks (with the same weight) actually occupy different
> resources on disks with different performance, and the tail
> latency caused by offline tasks cannot be well controlled. Using
> ioprio's concept of preemption, we can solve the above problems
> very well. Since ioprio will eventually be converted to weight,
> using ioprio alone can also achieve weight isolation within the
> same class. But we can still use bfq.weight to control resource,
> achieving better IO Qos control.
> 
> However, currently the class of bfq_group is always be class, and
> the ioprio class of the task can only be reflected in a single
> cgroup. We cannot guarantee that real-time tasks in a cgroup are
> scheduled in time. Therefore, we introduce bfq.ioprio, which
> allows us to configure ioprio class for cgroup. In this way, we
> can ensure that the real-time tasks of a cgroup can be scheduled
> in time. Similarly, the processing of offline task groups can
> also be simpler.
> 

I find this contribution very interesting.  Anyway, given the
relevance of such a contribution, I'd like to hear from relevant
people (Jens, Tejun, ...?), before revising individual patches.

Yet I already have a general question.  How does this mechanism comply
with per-process ioprios and ioprio classes?  For example, what
happens if a process belongs to BE-class group according to your
mechanism, but to a RT class according to its ioprio?  Does the
pre-group class dominate the per-process class?  Is all clean and
predictable?

> The bfq.ioprio interface now is available for cgroup v1 and cgroup
> v2. Users can configure the ioprio for cgroup through this interface,
> as shown below:
> 
> echo "1 2"> blkio.bfq.ioprio

Wouldn't it be nicer to have acronyms for classes (RT, BE, IDLE),
instead of numbers?

Thank you very much for this improvement proposal,
Paolo

> 
> The above two values respectively represent the values of ioprio
> class and ioprio for cgroup. The ioprio of tasks within the cgroup
> is uniformly equal to the ioprio of the cgroup. If the ioprio of
> the cgroup is disabled, the ioprio of the task remains the same,
> usually from io_context.
> 
> When testing, using fio and fio_generate_plots we can clearly see
> that the IO delay of the task satisfies RT> BE> IDLE. When RT is
> running, BE and IDLE are guaranteed minimum bandwidth. When used
> with bfq.weight, we can also isolate the resource within the same
> class.
> 
> The test process is as follows:
> # prepare data disk
> mount /dev/sdb /data1
> 
> # create cgroup v1 hierarchy
> cd /sys/fs/cgroup/blkio
> mkdir rt be idle
> echo "1 0" > rt/blkio.bfq.ioprio
> echo "2 0" > be/blkio.bfq.ioprio
> echo "3 0" > idle/blkio.bfq.ioprio
> 
> # run fio test
> fio fio.ini
> 
> # generate svg graph
> fio_generate_plots res
> 
> The contents of fio.ini are as follows:
> [global]
> ioengine=libaio
> group_reporting=1
> log_avg_msec=500
> direct=1
> time_based=1
> iodepth=16
> size=100M
> rw=write
> bs=1M
> [rt]
> name=rt
> write_bw_log=rt
> write_lat_log=rt
> write_iops_log=rt
> filename=/data1/rt.bin
> cgroup=rt
> runtime=30s
> nice=-10
> [be]
> name=be
> new_group
> write_bw_log=be
> write_lat_log=be
> write_iops_log=be
> filename=/data1/be.bin
> cgroup=be
> runtime=60s
> [idle]
> name=idle
> new_group
> write_bw_log=idle
> write_lat_log=idle
> write_iops_log=idle
> filename=/data1/idle.bin
> cgroup=idle
> runtime=90s
> 
> V2:
> 1. Optmise bfq_select_next_class().
> 2. Introduce bfq_group [] to track the number of groups for each CLASS.
> 3. Optimse IO injection, EMQ and Idle mechanism for CLASS_RT.
> 
> Chunguang Xu (11):
>  bfq: introduce bfq_entity_to_bfqg helper method
>  bfq: limit the IO depth of idle_class to 1
>  bfq: keep the minimun bandwidth for be_class
>  bfq: expire other class if CLASS_RT is waiting
>  bfq: optimse IO injection for CLASS_RT
>  bfq: disallow idle if CLASS_RT waiting for service
>  bfq: disallow merge CLASS_RT with other class
>  bfq: introduce bfq.ioprio for cgroup
>  bfq: convert the type of bfq_group.bfqd to bfq_data*
>  bfq: remove unnecessary initialization logic
>  bfq: optimize the calculation of bfq_weight_to_ioprio()
> 
> block/bfq-cgroup.c  |  99 +++++++++++++++++++++++++++++++----
> block/bfq-iosched.c |  47 ++++++++++++++---
> block/bfq-iosched.h |  28 ++++++++--
> block/bfq-wf2q.c    | 124 +++++++++++++++++++++++++++++++++-----------
> 4 files changed, 244 insertions(+), 54 deletions(-)
> 
> -- 
> 2.30.0
>