lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK1f24kx7W=aQEodg4eyii6R8a4e-f3MfCbB4ozAV3AV6urdTg@mail.gmail.com>
Date: Thu, 15 Aug 2024 09:59:18 +0800
From: Lance Yang <ioworker0@...il.com>
To: Yu Kuai <yukuai1@...weicloud.com>
Cc: Michal Koutný <mkoutny@...e.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, linux-block@...r.kernel.org, 
	cgroups@...r.kernel.org, josef@...icpanda.com, tj@...nel.org, 
	fujita.tomonori@....ntt.co.jp, boqun.feng@...il.com, a.hindborg@...sung.com, 
	paolo.valente@...more.it, axboe@...nel.dk, vbabka@...nel.org, 
	david@...hat.com, 21cnbao@...il.com, baolin.wang@...ux.alibaba.com, 
	libang.li@...group.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [BUG] cgroupv2/blk: inconsistent I/O behavior in Cgroup v2 with
 set device wbps and wiops

On Tue, Aug 13, 2024 at 3:19 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>
> Hi,
>
> 在 2024/08/13 14:39, Yu Kuai 写道:
> > Hi,
> >
> > 在 2024/08/13 13:00, Lance Yang 写道:
> >> Hi Kuai,
> >>
> >> Thanks a lot for jumping in!
> >>
> >> On Tue, Aug 13, 2024 at 9:37 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> 在 2024/08/12 23:43, Michal Koutný 写道:
> >>>> +Cc Kuai
> >>>>
> >>>> On Mon, Aug 12, 2024 at 11:00:30PM GMT, Lance Yang
> >>>> <ioworker0@...il.com> wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> I've run into a problem with Cgroup v2 where it doesn't seem to
> >>>>> correctly limit
> >>>>> I/O operations when I set both wbps and wiops for a device.
> >>>>> However, if I only
> >>>>> set wbps, then everything works as expected.
> >>>>>
> >>>>> To reproduce the problem, we can follow these command-based steps:
> >>>>>
> >>>>> 1. **System Information:**
> >>>>>      - Kernel Version and OS Release:
> >>>>>        ```
> >>>>>        $ uname -r
> >>>>>        6.10.0-rc5+
> >>>>>
> >>>>>        $ cat /etc/os-release
> >>>>>        PRETTY_NAME="Ubuntu 24.04 LTS"
> >>>>>        NAME="Ubuntu"
> >>>>>        VERSION_ID="24.04"
> >>>>>        VERSION="24.04 LTS (Noble Numbat)"
> >>>>>        VERSION_CODENAME=noble
> >>>>>        ID=ubuntu
> >>>>>        ID_LIKE=debian
> >>>>>        HOME_URL="https://www.ubuntu.com/"
> >>>>>        SUPPORT_URL="https://help.ubuntu.com/"
> >>>>>        BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
> >>>>>
> >>>>> PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
> >>>>>
> >>>>>        UBUNTU_CODENAME=noble
> >>>>>        LOGO=ubuntu-logo
> >>>>>        ```
> >>>>>
> >>>>> 2. **Device Information and Settings:**
> >>>>>      - List Block Devices and Scheduler:
> >>>>>        ```
> >>>>>        $ lsblk
> >>>>>        NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
> >>>>>        sda     8:0    0   4.4T  0 disk
> >>>>>        └─sda1  8:1    0   4.4T  0 part /data
> >>>>>        ...
> >>>>>
> >>>>>        $ cat /sys/block/sda/queue/scheduler
> >>>>>        none [mq-deadline] kyber bfq
> >>>>>
> >>>>>        $ cat /sys/block/sda/queue/rotational
> >>>>>        1
> >>>>>        ```
> >>>>>
> >>>>> 3. **Reproducing the problem:**
> >>>>>      - Navigate to the cgroup v2 filesystem and configure I/O
> >>>>> settings:
> >>>>>        ```
> >>>>>        $ cd /sys/fs/cgroup/
> >>>>>        $ stat -fc %T /sys/fs/cgroup
> >>>>>        cgroup2fs
> >>>>>        $ mkdir test
> >>>>>        $ echo "8:0 wbps=10485760 wiops=100000" > io.max
> >>>>>        ```
> >>>>>        In this setup:
> >>>>>        wbps=10485760 sets the write bytes per second limit to 10 MB/s.
> >>>>>        wiops=100000 sets the write I/O operations per second limit
> >>>>> to 100,000.
> >>>>>
> >>>>>      - Add process to the cgroup and verify:
> >>>>>        ```
> >>>>>        $ echo $$ > cgroup.procs
> >>>>>        $ cat cgroup.procs
> >>>>>        3826771
> >>>>>        3828513
> >>>>>        $ ps -ef|grep 3826771
> >>>>>        root     3826771 3826768  0 22:04 pts/1    00:00:00 -bash
> >>>>>        root     3828761 3826771  0 22:06 pts/1    00:00:00 ps -ef
> >>>>>        root     3828762 3826771  0 22:06 pts/1    00:00:00 grep
> >>>>> --color=auto 3826771
> >>>>>        ```
> >>>>>
> >>>>>      - Observe I/O performance using `dd` commands and `iostat`:
> >>>>>        ```
> >>>>>        $ dd if=/dev/zero of=/data/file1 bs=512M count=1 &
> >>>>>        $ dd if=/dev/zero of=/data/file1 bs=512M count=1 &
> >>>
> >>> You're testing buffer IO here, and I don't see that write back cgroup is
> >>> enabled. Is this test intentional? Why not test direct IO?
> >>
> >> Yes, I was testing buffered I/O and can confirm that
> >> CONFIG_CGROUP_WRITEBACK
> >> was enabled.
> >>
> >> $ cat /boot/config-6.10.0-rc5+ |grep CONFIG_CGROUP_WRITEBACK
> >> CONFIG_CGROUP_WRITEBACK=y
> >>
> >> We intend to configure both wbps (write bytes per second) and wiops
> >> (write I/O operations
> >> per second) for the containers. IIUC, this setup will effectively
> >> restrict both their block device
> >> I/Os and buffered I/Os.
> >>
> >>> Why not test direct IO?
> >>
> >> I was testing direct IO as well. However it did not work as expected with
> >> `echo "8:0 wbps=10485760 wiops=100000" > io.max`.
> >>
> >> $ time dd if=/dev/zero of=/data/file7 bs=512M count=1 oflag=direct
> >
> > So, you're issuing one huge IO, with 512M.
> >> 1+0 records in
> >> 1+0 records out
> >> 536870912 bytes (537 MB, 512 MiB) copied, 51.5962 s, 10.4 MB/s
> >
> > And this result looks correct. Please noted that blk-throtl works before
> > IO submit, while iostat reports IO that are done. A huge IO can be
> > throttled for a long time.
> >>
> >> real 0m51.637s
> >> user 0m0.000s
> >> sys 0m0.313s
> >>
> >> $ iostat -d 1 -h -y -p sda
> >>   tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn
> >> kB_dscd Device
> >>       9.00         0.0k         1.3M         0.0k       0.0k       1.3M
> >>        0.0k sda
> >>       9.00         0.0k         1.3M         0.0k       0.0k       1.3M
> >>        0.0k sda1
> >
> > I don't understand yet is why there are few IO during the wait. Can you
> > test for a raw disk to bypass filesystem?
>
> To be updated, I add a debug patch for this:

Kuai, sorry for the delayed response ;(

I'll give this debug patch a try and do other tests for a raw disk to bypass
the file system as well, and get back to you ASAP.

Thanks a lot for reaching out!
Lance

>
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index dc6140fa3de0..3b2648c17079 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1119,8 +1119,10 @@ static void blk_throtl_dispatch_work_fn(struct
> work_struct *work)
>
>          if (!bio_list_empty(&bio_list_on_stack)) {
>                  blk_start_plug(&plug);
> -               while ((bio = bio_list_pop(&bio_list_on_stack)))
> +               while ((bio = bio_list_pop(&bio_list_on_stack))) {
> +                       printk("%s: bio done %lu %px\n", __func__,
> bio_sectors(bio), bio);
>                          submit_bio_noacct_nocheck(bio);
> +               }
>                  blk_finish_plug(&plug);
>          }
>   }
> @@ -1606,6 +1608,8 @@ bool __blk_throtl_bio(struct bio *bio)
>          bool throttled = false;
>          struct throtl_data *td = tg->td;
>
> +       printk("%s: bio start %lu %px\n", __func__, bio_sectors(bio), bio);
> +
>          rcu_read_lock();
>          spin_lock_irq(&q->queue_lock);
>          sq = &tg->service_queue;
> @@ -1649,6 +1653,7 @@ bool __blk_throtl_bio(struct bio *bio)
>                  tg = sq_to_tg(sq);
>                  if (!tg) {
>                          bio_set_flag(bio, BIO_BPS_THROTTLED);
> +                       printk("%s: bio done %lu %px\n", __func__,
> bio_sectors(bio), bio);
>                          goto out_unlock;
>                  }
>          }
>
> For dirct IO with raw disk:
>
> with or without wiops, the result is the same:
>
> [  469.736098] __blk_throtl_bio: bio start 2128 ffff8881014c08c0
> [  469.736903] __blk_throtl_bio: bio start 2144 ffff88817852ec80
> [  469.737585] __blk_throtl_bio: bio start 2096 ffff88817852f080
> [  469.738392] __blk_throtl_bio: bio start 2096 ffff88817852f480
> [  469.739358] __blk_throtl_bio: bio start 2064 ffff88817852e880
> [  469.740330] __blk_throtl_bio: bio start 2112 ffff88817852fa80
> [  469.741262] __blk_throtl_bio: bio start 2080 ffff88817852e280
> [  469.742280] __blk_throtl_bio: bio start 2096 ffff88817852e080
> [  469.743281] __blk_throtl_bio: bio start 2104 ffff88817852f880
> [  469.744309] __blk_throtl_bio: bio start 2240 ffff88817852e680
> [  469.745050] __blk_throtl_bio: bio start 2184 ffff88817852e480
> [  469.745857] __blk_throtl_bio: bio start 2120 ffff88817852f680
> [  469.746779] __blk_throtl_bio: bio start 2512 ffff88817852fe80
> [  469.747611] __blk_throtl_bio: bio start 2488 ffff88817852f280
> [  469.748242] __blk_throtl_bio: bio start 2120 ffff88817852ee80
> [  469.749159] __blk_throtl_bio: bio start 2256 ffff88817852fc80
> [  469.750087] __blk_throtl_bio: bio start 2576 ffff88817852ea80
> [  469.750802] __blk_throtl_bio: bio start 2112 ffff8881014a3a80
> [  469.751586] __blk_throtl_bio: bio start 2240 ffff8881014a2880
> [  469.752383] __blk_throtl_bio: bio start 2160 ffff8881014a2e80
> [  469.753289] __blk_throtl_bio: bio start 2248 ffff8881014a3c80
> [  469.754024] __blk_throtl_bio: bio start 2536 ffff8881014a2680
> [  469.754913] __blk_throtl_bio: bio start 2088 ffff8881014a3080
> [  469.766036] __blk_throtl_bio: bio start 211344 ffff8881014a3280
> [  469.842366] blk_throtl_dispatch_work_fn: bio done 2128 ffff8881014c08c0
> [  469.952627] blk_throtl_dispatch_work_fn: bio done 2144 ffff88817852ec80
> [  470.048729] blk_throtl_dispatch_work_fn: bio done 2096 ffff88817852f080
> [  470.152642] blk_throtl_dispatch_work_fn: bio done 2096 ffff88817852f480
> [  470.256661] blk_throtl_dispatch_work_fn: bio done 2064 ffff88817852e880
> [  470.360662] blk_throtl_dispatch_work_fn: bio done 2112 ffff88817852fa80
> [  470.464626] blk_throtl_dispatch_work_fn: bio done 2080 ffff88817852e280
> [  470.568652] blk_throtl_dispatch_work_fn: bio done 2096 ffff88817852e080
> [  470.672623] blk_throtl_dispatch_work_fn: bio done 2104 ffff88817852f880
> [  470.776620] blk_throtl_dispatch_work_fn: bio done 2240 ffff88817852e680
> [  470.889801] blk_throtl_dispatch_work_fn: bio done 2184 ffff88817852e480
> [  470.992686] blk_throtl_dispatch_work_fn: bio done 2120 ffff88817852f680
> [  471.112633] blk_throtl_dispatch_work_fn: bio done 2512 ffff88817852fe80
> [  471.232680] blk_throtl_dispatch_work_fn: bio done 2488 ffff88817852f280
> [  471.336695] blk_throtl_dispatch_work_fn: bio done 2120 ffff88817852ee80
> [  471.448645] blk_throtl_dispatch_work_fn: bio done 2256 ffff88817852fc80
> [  471.576632] blk_throtl_dispatch_work_fn: bio done 2576 ffff88817852ea80
> [  471.680709] blk_throtl_dispatch_work_fn: bio done 2112 ffff8881014a3a80
> [  471.792680] blk_throtl_dispatch_work_fn: bio done 2240 ffff8881014a2880
> [  471.896682] blk_throtl_dispatch_work_fn: bio done 2160 ffff8881014a2e80
> [  472.008698] blk_throtl_dispatch_work_fn: bio done 2248 ffff8881014a3c80
> [  472.136630] blk_throtl_dispatch_work_fn: bio done 2536 ffff8881014a2680
> [  472.240678] blk_throtl_dispatch_work_fn: bio done 2088 ffff8881014a3080
> [  482.560633] blk_throtl_dispatch_work_fn: bio done 211344 ffff8881014a3280
>
> Hence the upper layer issue some small IO first, then with a 100+MB IO,
> and wait time looks correct.
>
> Then, I retest for xfs, result are still the same with or without wiops:
>
> [ 1175.907019] __blk_throtl_bio: bio start 8192 ffff88816daf8480
> [ 1175.908224] __blk_throtl_bio: bio start 8192 ffff88816daf8e80
> [ 1175.910618] __blk_throtl_bio: bio start 8192 ffff88816daf9280
> [ 1175.911991] __blk_throtl_bio: bio start 8192 ffff88816daf8280
> [ 1175.913187] __blk_throtl_bio: bio start 8192 ffff88816daf9080
> [ 1175.914904] __blk_throtl_bio: bio start 8192 ffff88816daf9680
> [ 1175.916099] __blk_throtl_bio: bio start 8192 ffff88816daf8880
> [ 1175.917844] __blk_throtl_bio: bio start 8192 ffff88816daf8c80
> [ 1175.919025] __blk_throtl_bio: bio start 8192 ffff88816daf8a80
> [ 1175.920868] __blk_throtl_bio: bio start 8192 ffff888178a84080
> [ 1175.922068] __blk_throtl_bio: bio start 8192 ffff888178a84280
> [ 1175.923819] __blk_throtl_bio: bio start 8192 ffff888178a84480
> [ 1175.925017] __blk_throtl_bio: bio start 8192 ffff888178a84680
> [ 1175.926851] __blk_throtl_bio: bio start 8192 ffff888178a84880
> [ 1175.928025] __blk_throtl_bio: bio start 8192 ffff888178a84a80
> [ 1175.929806] __blk_throtl_bio: bio start 8192 ffff888178a84c80
> [ 1175.931007] __blk_throtl_bio: bio start 8192 ffff888178a84e80
> [ 1175.932852] __blk_throtl_bio: bio start 8192 ffff888178a85080
> [ 1175.934041] __blk_throtl_bio: bio start 8192 ffff888178a85280
> [ 1175.935892] __blk_throtl_bio: bio start 8192 ffff888178a85480
> [ 1175.937074] __blk_throtl_bio: bio start 8192 ffff888178a85680
> [ 1175.938860] __blk_throtl_bio: bio start 8192 ffff888178a85880
> [ 1175.940053] __blk_throtl_bio: bio start 8192 ffff888178a85a80
> [ 1175.941824] __blk_throtl_bio: bio start 8192 ffff888178a85c80
> [ 1175.943040] __blk_throtl_bio: bio start 8192 ffff888178a85e80
> [ 1175.944945] __blk_throtl_bio: bio start 8192 ffff88816b046080
> [ 1175.946156] __blk_throtl_bio: bio start 8192 ffff88816b046280
> [ 1175.948261] __blk_throtl_bio: bio start 8192 ffff88816b046480
> [ 1175.949521] __blk_throtl_bio: bio start 8192 ffff88816b046680
> [ 1175.950877] __blk_throtl_bio: bio start 8192 ffff88816b046880
> [ 1175.952051] __blk_throtl_bio: bio start 8192 ffff88816b046a80
> [ 1175.954313] __blk_throtl_bio: bio start 8192 ffff88816b046c80
> [ 1175.955530] __blk_throtl_bio: bio start 8192 ffff88816b046e80
> [ 1175.957370] __blk_throtl_bio: bio start 8192 ffff88816b047080
> [ 1175.958818] __blk_throtl_bio: bio start 8192 ffff88816b047280
> [ 1175.960093] __blk_throtl_bio: bio start 8192 ffff88816b047480
> [ 1175.961900] __blk_throtl_bio: bio start 8192 ffff88816b047680
> [ 1175.963070] __blk_throtl_bio: bio start 8192 ffff88816b047880
> [ 1175.965262] __blk_throtl_bio: bio start 8192 ffff88816b047a80
> [ 1175.966527] __blk_throtl_bio: bio start 8192 ffff88816b047c80
> [ 1175.967928] __blk_throtl_bio: bio start 8192 ffff88816b047e80
> [ 1175.969124] __blk_throtl_bio: bio start 8192 ffff888170e84080
> [ 1175.971369] __blk_throtl_bio: bio start 8192 ffff888170e84280
>
>
> Hence xfs is always issuing 4MB IO, that's whay stable wbps can be
> observed by iostat. The main difference is that a 100+MB IO is issued
> from the last test and throttle for about 10+s.
>
> Then for your case, you might want to comfirm what kind of IO are
> submitted from upper layer.
>
> Thanks,
> Kuai
> >
> > Thanks,
> > Kuai
> >
> >
> > .
> >
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ