lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGWkznEGKFWdvSfa87BveT_=yfQQpzerOCFn=x+a7PJTDNJUYA@mail.gmail.com>
Date: Mon, 5 Feb 2024 17:09:33 +0800
From: Zhaoyang Huang <huangzhaoyang@...il.com>
To: "zhaoyang.huang" <zhaoyang.huang@...soc.com>
Cc: Jens Axboe <axboe@...nel.dk>, Matthew Wilcox <willy@...radead.org>, Yu Zhao <yuzhao@...gle.com>, 
	linux-block@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, steve.kang@...soc.com
Subject: Re: [PATCHv8 1/1] block: introduce content activity based ioprio

On Mon, Feb 5, 2024 at 5:06 PM zhaoyang.huang <zhaoyang.huang@...soc.com> wrote:
>
> From: Zhaoyang Huang <zhaoyang.huang@...soc.com>
>
> Currently, request's ioprio are set via task's schedule priority(when no
> blkcg configured), which has high priority tasks possess the privilege on
> both of CPU and IO scheduling. Furthermore, most of the write requestes
> are launched asynchronosly from kworker which can't know the submitter's
> priorities.
> This commit works as a hint of original policy by promoting the request
> ioprio based on the page/folio's activity. The original idea comes from
> LRU_GEN which provides more precised folio activity than before. This
> commit try to adjust the request's ioprio when certain part of its folios
> are hot, which indicate that this request carry important contents and
> need be scheduled ealier.
>
> This commit provide two sets of exclusive APIs.
>
> *counting activities by iterating the bio's pages
> The filesystem should call bio_set_active_ioprio() before submit_bio on the
> spot where they want(buffered read/write/sync etc).
>
> *counting activities during each call
> The filesystem should call bio_set_active_ioprio_page/folio() after
> calling bio_add_page/folio. Please be noted that this set of API can not
> handle bvec_try_merge_page cases.
>
> This commit is verified on a v6.6 6GB RAM android14 system via 4 test cases
> by calling bio_set_active_ioprio in erofs, ext4, f2fs and blkdev(raw
> partition of gendisk)
>
> Case 1:
> script[a] which get significant improved fault time as expected[b]*
> where dd's cost also shrink from 55s to 40s.
> (1). fault_latency.bin is an ebpf based test tool which measure all task's
>    iowait latency during page fault when scheduled out/in.
> (2). costmem generate page fault by mmaping a file and access the VA.
> (3). dd generate concurrent vfs io.
>
> [a]
> ./fault_latency.bin 1 5 > /data/dd_costmem &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> costmem -c0 -a2048000 -b128000 -o0 1>/dev/null &
> dd if=/dev/block/sda of=/data/ddtest bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest1 bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest2 bs=1024 count=2048000 &
> dd if=/dev/block/sda of=/data/ddtest3 bs=1024 count=2048000
> [b]
>                        mainline         commit
> io wait                736us            523us
>
> * provide correct result for test case 1 in v7 which was compared between
> EMMC and UFS wrongly.
>
> Case 2:
> fio -filename=/dev/block/by-name/userdata -rw=randread -direct=0 -bs=4k -size=2000M -numjobs=8 -group_reporting -name=mytest
> mainline: 513MiB/s
> READ: bw=531MiB/s (557MB/s), 531MiB/s-531MiB/s (557MB/s-557MB/s), io=15.6GiB (16.8GB), run=30137-30137msec
> READ: bw=543MiB/s (569MB/s), 543MiB/s-543MiB/s (569MB/s-569MB/s), io=15.6GiB (16.8GB), run=29469-29469msec
> READ: bw=474MiB/s (497MB/s), 474MiB/s-474MiB/s (497MB/s-497MB/s), io=15.6GiB (16.8GB), run=33724-33724msec
> READ: bw=535MiB/s (561MB/s), 535MiB/s-535MiB/s (561MB/s-561MB/s), io=15.6GiB (16.8GB), run=29928-29928msec
> READ: bw=523MiB/s (548MB/s), 523MiB/s-523MiB/s (548MB/s-548MB/s), io=15.6GiB (16.8GB), run=30617-30617msec
> READ: bw=492MiB/s (516MB/s), 492MiB/s-492MiB/s (516MB/s-516MB/s), io=15.6GiB (16.8GB), run=32518-32518msec
> READ: bw=533MiB/s (559MB/s), 533MiB/s-533MiB/s (559MB/s-559MB/s), io=15.6GiB (16.8GB), run=29993-29993msec
> READ: bw=524MiB/s (550MB/s), 524MiB/s-524MiB/s (550MB/s-550MB/s), io=15.6GiB (16.8GB), run=30526-30526msec
> READ: bw=529MiB/s (554MB/s), 529MiB/s-529MiB/s (554MB/s-554MB/s), io=15.6GiB (16.8GB), run=30269-30269msec
> READ: bw=449MiB/s (471MB/s), 449MiB/s-449MiB/s (471MB/s-471MB/s), io=15.6GiB (16.8GB), run=35629-35629msec
>
> commit: 633MiB/s
> READ: bw=668MiB/s (700MB/s), 668MiB/s-668MiB/s (700MB/s-700MB/s), io=15.6GiB (16.8GB), run=23952-23952msec
> READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=15.6GiB (16.8GB), run=27164-27164msec
> READ: bw=638MiB/s (669MB/s), 638MiB/s-638MiB/s (669MB/s-669MB/s), io=15.6GiB (16.8GB), run=25071-25071msec
> READ: bw=714MiB/s (749MB/s), 714MiB/s-714MiB/s (749MB/s-749MB/s), io=15.6GiB (16.8GB), run=22409-22409msec
> READ: bw=600MiB/s (629MB/s), 600MiB/s-600MiB/s (629MB/s-629MB/s), io=15.6GiB (16.8GB), run=26669-26669msec
> READ: bw=592MiB/s (621MB/s), 592MiB/s-592MiB/s (621MB/s-621MB/s), io=15.6GiB (16.8GB), run=27036-27036msec
> READ: bw=691MiB/s (725MB/s), 691MiB/s-691MiB/s (725MB/s-725MB/s), io=15.6GiB (16.8GB), run=23150-23150msec
> READ: bw=569MiB/s (596MB/s), 569MiB/s-569MiB/s (596MB/s-596MB/s), io=15.6GiB (16.8GB), run=28142-28142msec
> READ: bw=563MiB/s (590MB/s), 563MiB/s-563MiB/s (590MB/s-590MB/s), io=15.6GiB (16.8GB), run=28429-28429msec
> READ: bw=712MiB/s (746MB/s), 712MiB/s-712MiB/s (746MB/s-746MB/s), io=15.6GiB (16.8GB), run=22478-22478msec
>
> Case 3:
> This commit is also verified by the case of launching camera APP which is
> usually considered as heavy working load on both of memory and IO, which
> shows 12%-24% improvement.
>
>                 ttl = 0         ttl = 50        ttl = 100
> mainline        2267ms          2420ms          2316ms
> commit          1992ms          1806ms          1998ms
>
> case 4:
> androbench has no improvment as well as regression in RD/WR test item
> while make a 3% improvement in sqlite items.
>
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@...soc.com>
> ---
> change of v2: calculate page's activity via helper function
> change of v3: solve layer violation by move API into mm
> change of v4: keep block clean by removing the page related API
> change of v5: introduce the macros of bio_add_folio/page for read dir.
> change of v6: replace the macro of bio_add_xxx by submit_bio which
>                 iterating the bio_vec before launching bio to block layer
> change of v7: introduce the function bio_set_active_ioprio
>               provide updated test result
> change of v8: provide two sets of APIs for bio_set_active_ioprio_xxx
> ---
> ---
>  block/Kconfig             | 27 +++++++++++
>  block/bio.c               | 94 +++++++++++++++++++++++++++++++++++++++
>  include/linux/bio.h       |  3 ++
>  include/linux/blk_types.h |  7 ++-
>  4 files changed, 130 insertions(+), 1 deletion(-)
>
> diff --git a/block/Kconfig b/block/Kconfig
> index f1364d1c0d93..5e721678ea3d 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -228,6 +228,33 @@ config BLOCK_HOLDER_DEPRECATED
>  config BLK_MQ_STACKING
>         bool
>
> +config BLK_CONT_ACT_BASED_IOPRIO
> +       bool "Enable content activity based ioprio"
> +       depends on LRU_GEN
> +       default n
> +       help
> +         This item enable the feature of adjust bio's priority by
> +         calculating its content's activity.
> +         This feature works as a hint of original bio_set_ioprio
> +         which means rt task get no change of its bio->bi_ioprio
> +         while other tasks have the opportunity to raise the ioprio
> +         if the bio take certain numbers of active pages.
> +         The file system should use this by modifying their buffered
> +         read/write/sync function to raise the bio->bi_ioprio before
> +         calling submit_bio or after bio_add_page/folio
> +
> +config BLK_CONT_ACT_BASED_IOPRIO_ITER_BIO
> +       bool "Counting bio's activity by iterating bio's pages"
> +       depends on BLK_CONT_ACT_BASED_IOPRIO
> +       help
> +         The API under this config counts bio's activity by iterating the bio.
> +
> +config BLK_CONT_ACT_BASED_IOPRIO_ADD_PAGE
> +       bool "Counting bio's activity when adding page or folio"
> +       depends on BLK_CONT_ACT_BASED_IOPRIO && !BLK_CONT_ACT_BASED_IOPRIO_ITER_BIO
> +       help
> +         The API under this config count activity during each call buy can't
> +          handle bvec_try_merge_page cases, please be sure you are ok with that.
counting activities during each call can not handle
bvec_try_merge_page cases as it returns the valid value. So I provide
two sets of exclusive APIs by keeping the iteration one
int bio_add_page(struct bio *bio, struct page *page,
                 unsigned int len, unsigned int offset)
{
        bool same_page = false;

        if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
                return 0;
        if (bio->bi_iter.bi_size > UINT_MAX - len)
                return 0;

        if (bio->bi_vcnt > 0 &&
            bvec_try_merge_page(&bio->bi_io_vec[bio->bi_vcnt - 1],
                                page, len, offset, &same_page)) {
                bio->bi_iter.bi_size += len;
                return len;
        }

        if (bio->bi_vcnt >= bio->bi_max_vecs)
                return 0;
        __bio_add_page(bio, page, len, offset);
        return len;
}

>  source "block/Kconfig.iosched"
>
>  endif # BLOCK
> diff --git a/block/bio.c b/block/bio.c
> index 816d412c06e9..73916a6c319f 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1476,6 +1476,100 @@ void bio_set_pages_dirty(struct bio *bio)
>  }
>  EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
>
> +/*
> + * bio_set_active_ioprio() is helper function for fs to adjust the bio's ioprio via
> + * calculating the content's activity which measured from MGLRU.
> + * The file system should call this function before submit_bio for the buffered
> + * read/write/sync.
> + */
> +#ifdef CONFIG_BLK_CONT_ACT_BASED_IOPRIO
> +#ifdef CONFIG_BLK_CONT_ACT_BASED_IOPRIO_ITER_BIO
> +void bio_set_active_ioprio(struct bio *bio)
> +{
> +       struct bio_vec bv;
> +       struct bvec_iter iter;
> +       struct page *page;
> +       int class, level, hint;
> +       int activity = 0;
> +       int cnt = 0;
> +
> +       class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> +       level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> +       hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> +       /*apply legacy ioprio policy on RT task*/
> +       if (task_is_realtime(current)) {
> +               bio->bi_ioprio = IOPRIO_PRIO_VALUE_HINT(IOPRIO_CLASS_RT, level, hint);
> +               return;
> +       }
> +       bio_for_each_bvec(bv, bio, iter) {
> +               page = bv.bv_page;
> +               activity += PageWorkingset(page) ? 1 : 0;
> +               cnt++;
> +               if (activity > bio->bi_vcnt / 2) {
> +                       class = IOPRIO_CLASS_RT;
> +                       break;
> +               } else if (activity > bio->bi_vcnt / 4) {
> +                       /*
> +                        * all itered pages are all active so far
> +                        * then raise to RT directly
> +                        */
> +                       if (activity == cnt) {
> +                               class = IOPRIO_CLASS_RT;
> +                               break;
> +                       } else
> +                               class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()),
> +                                               IOPRIO_CLASS_BE);
> +               }
> +       }
> +       if (!class && activity > cnt / 2)
> +               class = IOPRIO_CLASS_RT;
> +       else if (!class && activity > cnt / 4)
> +               class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> +
> +       bio->bi_ioprio = IOPRIO_PRIO_VALUE_HINT(class, level, hint);
> +}
> +void bio_set_active_ioprio_folio(struct bio *bio, struct folio *folio) {}
> +void bio_set_active_ioprio_page(struct bio *bio, struct page *page) {}
> +#endif
> +#ifdef CONFIG_BLK_CONT_ACT_BASED_IOPRIO_ADD_PAGE
> +/*
> + * bio_set_active_ioprio_page/folio are helper functions for counting
> + * the bio's activity during each all. However, it can't handle the
> + * scenario of bvec_try_merge_page. The submitter can use them if there
> + * is no such case in the system(block size < page size)
> + */
> +void bio_set_active_ioprio_page(struct bio *bio, struct page *page)
> +{
> +       int class, level, hint;
> +
> +       class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
> +       level = IOPRIO_PRIO_LEVEL(bio->bi_ioprio);
> +       hint = IOPRIO_PRIO_HINT(bio->bi_ioprio);
> +       bio->bi_cont_act += PageWorkingset(page) ? 1 : 0;
> +
> +       if (bio->bi_cont_act > bio->bi_vcnt / 2)
> +               class = IOPRIO_CLASS_RT;
> +       else if (bio->bi_cont_act > bio->bi_vcnt / 4)
> +               class = max(IOPRIO_PRIO_CLASS(get_current_ioprio()), IOPRIO_CLASS_BE);
> +
> +       bio->bi_ioprio = IOPRIO_PRIO_VALUE_HINT(class, level, hint);
> +}
> +
> +void bio_set_active_ioprio_folio(struct bio *bio, struct folio *folio)
> +{
> +       bio_set_active_ioprio_page(bio, &folio->page);
> +}
> +void bio_set_active_ioprio(struct bio *bio) {}
> +#endif
> +#else
> +void bio_set_active_ioprio(struct bio *bio) {}
> +void bio_set_active_ioprio_page(struct bio *bio, struct page *page) {}
> +void bio_set_active_ioprio_folio(struct bio *bio, struct folio *folio) {}
> +#endif
> +EXPORT_SYMBOL_GPL(bio_set_active_ioprio);
> +EXPORT_SYMBOL_GPL(bio_set_active_ioprio_page);
> +EXPORT_SYMBOL_GPL(bio_set_active_ioprio_folio);
> +
>  /*
>   * bio_check_pages_dirty() will check that all the BIO's pages are still dirty.
>   * If they are, then fine.  If, however, some pages are clean then they must
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 41d417ee1349..35221ee3dd54 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -487,6 +487,9 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter);
>  void __bio_release_pages(struct bio *bio, bool mark_dirty);
>  extern void bio_set_pages_dirty(struct bio *bio);
>  extern void bio_check_pages_dirty(struct bio *bio);
> +extern void bio_set_active_ioprio(struct bio *bio);
> +extern void bio_set_active_ioprio_folio(struct bio *bio, struct folio *folio);
> +extern void bio_set_active_ioprio_page(struct bio *bio, struct page *page);
>
>  extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
>                                struct bio *src, struct bvec_iter *src_iter);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index d5c5e59ddbd2..a3a18b9a5168 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -314,7 +314,12 @@ struct bio {
>         struct bio_vec          *bi_io_vec;     /* the actual vec list */
>
>         struct bio_set          *bi_pool;
> -
> +#ifdef CONFIG_BLK_CONT_ACT_BASED_IOPRIO
> +       /*
> +        * bi_cont_act record total activities of bi_io_vec->pages
> +        */
> +       u64                     bi_cont_act;
> +#endif
>         /*
>          * We can inline a number of vecs at the end of the bio, to avoid
>          * double allocations for a small number of bio_vecs. This member
> --
> 2.25.1
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ