[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ozjsdoiqa2uem65qqj4fjbrwm6toxlj5bzv7f5dg5xfiljv3zi@wcaamboo2r6h>
Date: Tue, 29 Jul 2025 12:16:50 +0200
From: Jan Kara <jack@...e.cz>
To: Yu Kuai <yukuai1@...weicloud.com>
Cc: jack@...e.cz, axboe@...nel.dk, akpm@...ux-foundation.org,
yang.yang@...o.com, dlemoal@...nel.org, ming.lei@...hat.com,
linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, yukuai3@...wei.com,
yi.zhang@...wei.com, yangerkun@...wei.com, johnny.chenyi@...wei.com,
Omar Sandoval <osandov@...com>
Subject: Re: [PATCH v2 1/2] lib/sbitmap: convert shallow_depth from one word
to the whole sbitmap
On Tue 29-07-25 11:19:05, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@...wei.com>
>
> Currently elevators will record internal 'async_depth' to throttle
> asynchronous requests, and they both calculate shallow_dpeth based on
> sb->shift, with the respect that sb->shift is the available tags in one
> word.
>
> However, sb->shift is not the availbale tags in the last word, see
> __map_depth:
>
> if (index == sb->map_nr - 1)
> return sb->depth - (index << sb->shift);
>
> For consequence, if the last word is used, more tags can be get than
> expected, for example, assume nr_requests=256 and there are four words,
> in the worst case if user set nr_requests=32, then the first word is
> the last word, and still use bits per word, which is 64, to calculate
> async_depth is wrong.
>
> One the other hand, due to cgroup qos, bfq can allow only one request
> to be allocated, and set shallow_dpeth=1 will still allow the number
> of words request to be allocated.
>
> Fix this problems by using shallow_depth to the whole sbitmap instead
> of per word, also change kyber, mq-deadline and bfq to follow this.
>
> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
I agree with these problems but AFAIU this implementation of shallow depth
has been done for a reason. Omar can chime in here as the original author
or perhaps Jens but the idea of current shallow depth implementation is
that each sbitmap user regardless of used shallow depth has a chance to
allocate from each sbitmap word which evenly distributes pressure among
available sbitmap words. With the implementation you've chosen there will
be higher pressure (and thus contention) on words with low indices.
So I think we would be good to fix issues with shallow depth for small
number of sbitmap words (because that's where these buggy cornercases may
matter in practice) but I believe the logic which constrains number of used
bits from each *word* when shallow_depth is specified should be kept. It
might make sense to change the API so that shallow_depth is indeed
specified compared to the total size of the bitmap, not to the size of the
word (because that's confusing practically everybody I've met and is a
constant source of bugs) if it can be made to perform well.
Honza
> ---
> block/bfq-iosched.c | 35 ++++++++++++--------------
> block/bfq-iosched.h | 3 +--
> block/kyber-iosched.c | 9 ++-----
> block/mq-deadline.c | 16 +-----------
> include/linux/sbitmap.h | 6 ++---
> lib/sbitmap.c | 55 +++++++++++++++++++++--------------------
> 6 files changed, 51 insertions(+), 73 deletions(-)
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index f71ec0887733..a6a574a8eac9 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -694,17 +694,13 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
> {
> struct bfq_data *bfqd = data->q->elevator->elevator_data;
> struct bfq_io_cq *bic = bfq_bic_lookup(data->q);
> - int depth;
> - unsigned limit = data->q->nr_requests;
> - unsigned int act_idx;
> + unsigned int limit, act_idx;
>
> /* Sync reads have full depth available */
> - if (op_is_sync(opf) && !op_is_write(opf)) {
> - depth = 0;
> - } else {
> - depth = bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)];
> - limit = (limit * depth) >> bfqd->full_depth_shift;
> - }
> + if (op_is_sync(opf) && !op_is_write(opf))
> + limit = data->q->nr_requests;
> + else
> + limit = bfqd->async_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)];
>
> for (act_idx = 0; bic && act_idx < bfqd->num_actuators; act_idx++) {
> /* Fast path to check if bfqq is already allocated. */
> @@ -718,14 +714,16 @@ static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
> * available requests and thus starve other entities.
> */
> if (bfqq_request_over_limit(bfqd, bic, opf, act_idx, limit)) {
> - depth = 1;
> + limit = 1;
> break;
> }
> }
> +
> bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
> - __func__, bfqd->wr_busy_queues, op_is_sync(opf), depth);
> - if (depth)
> - data->shallow_depth = depth;
> + __func__, bfqd->wr_busy_queues, op_is_sync(opf), limit);
> +
> + if (limit < data->q->nr_requests)
> + data->shallow_depth = limit;
> }
>
> static struct bfq_queue *
> @@ -7114,9 +7112,8 @@ void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)
> */
> static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
> {
> - unsigned int depth = 1U << bt->sb.shift;
> + unsigned int nr_requests = bfqd->queue->nr_requests;
>
> - bfqd->full_depth_shift = bt->sb.shift;
> /*
> * In-word depths if no bfq_queue is being weight-raised:
> * leaving 25% of tags only for sync reads.
> @@ -7128,13 +7125,13 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
> * limit 'something'.
> */
> /* no more than 50% of tags for async I/O */
> - bfqd->word_depths[0][0] = max(depth >> 1, 1U);
> + bfqd->async_depths[0][0] = max(nr_requests >> 1, 1U);
> /*
> * no more than 75% of tags for sync writes (25% extra tags
> * w.r.t. async I/O, to prevent async I/O from starving sync
> * writes)
> */
> - bfqd->word_depths[0][1] = max((depth * 3) >> 2, 1U);
> + bfqd->async_depths[0][1] = max((nr_requests * 3) >> 2, 1U);
>
> /*
> * In-word depths in case some bfq_queue is being weight-
> @@ -7144,9 +7141,9 @@ static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)
> * shortage.
> */
> /* no more than ~18% of tags for async I/O */
> - bfqd->word_depths[1][0] = max((depth * 3) >> 4, 1U);
> + bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);
> /* no more than ~37% of tags for sync writes (~20% extra tags) */
> - bfqd->word_depths[1][1] = max((depth * 6) >> 4, 1U);
> + bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);
> }
>
> static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
> diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
> index 687a3a7ba784..31217f196f4f 100644
> --- a/block/bfq-iosched.h
> +++ b/block/bfq-iosched.h
> @@ -813,8 +813,7 @@ struct bfq_data {
> * Depth limits used in bfq_limit_depth (see comments on the
> * function)
> */
> - unsigned int word_depths[2][2];
> - unsigned int full_depth_shift;
> + unsigned int async_depths[2][2];
>
> /*
> * Number of independent actuators. This is equal to 1 in
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index 4dba8405bd01..bfd9a40bb33d 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -157,10 +157,7 @@ struct kyber_queue_data {
> */
> struct sbitmap_queue domain_tokens[KYBER_NUM_DOMAINS];
>
> - /*
> - * Async request percentage, converted to per-word depth for
> - * sbitmap_get_shallow().
> - */
> + /* Number of allowed async requests. */
> unsigned int async_depth;
>
> struct kyber_cpu_latency __percpu *cpu_latency;
> @@ -454,10 +451,8 @@ static void kyber_depth_updated(struct blk_mq_hw_ctx *hctx)
> {
> struct kyber_queue_data *kqd = hctx->queue->elevator->elevator_data;
> struct blk_mq_tags *tags = hctx->sched_tags;
> - unsigned int shift = tags->bitmap_tags.sb.shift;
> -
> - kqd->async_depth = (1U << shift) * KYBER_ASYNC_PERCENT / 100U;
>
> + kqd->async_depth = hctx->queue->nr_requests * KYBER_ASYNC_PERCENT / 100U;
> sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, kqd->async_depth);
> }
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 2edf1cac06d5..9ab6c6256695 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -487,20 +487,6 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> return rq;
> }
>
> -/*
> - * 'depth' is a number in the range 1..INT_MAX representing a number of
> - * requests. Scale it with a factor (1 << bt->sb.shift) / q->nr_requests since
> - * 1..(1 << bt->sb.shift) is the range expected by sbitmap_get_shallow().
> - * Values larger than q->nr_requests have the same effect as q->nr_requests.
> - */
> -static int dd_to_word_depth(struct blk_mq_hw_ctx *hctx, unsigned int qdepth)
> -{
> - struct sbitmap_queue *bt = &hctx->sched_tags->bitmap_tags;
> - const unsigned int nrr = hctx->queue->nr_requests;
> -
> - return ((qdepth << bt->sb.shift) + nrr - 1) / nrr;
> -}
> -
> /*
> * Called by __blk_mq_alloc_request(). The shallow_depth value set by this
> * function is used by __blk_mq_get_tag().
> @@ -517,7 +503,7 @@ static void dd_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)
> * Throttle asynchronous requests and writes such that these requests
> * do not block the allocation of synchronous requests.
> */
> - data->shallow_depth = dd_to_word_depth(data->hctx, dd->async_depth);
> + data->shallow_depth = dd->async_depth;
> }
>
> /* Called by blk_mq_update_nr_requests(). */
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 189140bf11fc..4adf4b364fcd 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -213,12 +213,12 @@ int sbitmap_get(struct sbitmap *sb);
> * sbitmap_get_shallow() - Try to allocate a free bit from a &struct sbitmap,
> * limiting the depth used from each word.
> * @sb: Bitmap to allocate from.
> - * @shallow_depth: The maximum number of bits to allocate from a single word.
> + * @shallow_depth: The maximum number of bits to allocate from the bitmap.
> *
> * This rather specific operation allows for having multiple users with
> * different allocation limits. E.g., there can be a high-priority class that
> * uses sbitmap_get() and a low-priority class that uses sbitmap_get_shallow()
> - * with a @shallow_depth of (1 << (@sb->shift - 1)). Then, the low-priority
> + * with a @shallow_depth of (sb->depth >> 1). Then, the low-priority
> * class can only allocate half of the total bits in the bitmap, preventing it
> * from starving out the high-priority class.
> *
> @@ -478,7 +478,7 @@ unsigned long __sbitmap_queue_get_batch(struct sbitmap_queue *sbq, int nr_tags,
> * sbitmap_queue, limiting the depth used from each word, with preemption
> * already disabled.
> * @sbq: Bitmap queue to allocate from.
> - * @shallow_depth: The maximum number of bits to allocate from a single word.
> + * @shallow_depth: The maximum number of bits to allocate from the queue.
> * See sbitmap_get_shallow().
> *
> * If you call this, make sure to call sbitmap_queue_min_shallow_depth() after
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index d3412984170c..f2e90ac6b56e 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -208,8 +208,27 @@ static int sbitmap_find_bit_in_word(struct sbitmap_word *map,
> return nr;
> }
>
> +static unsigned int __map_depth_with_shallow(const struct sbitmap *sb,
> + int index,
> + unsigned int shallow_depth)
> +{
> + unsigned int lower_bound = 0;
> +
> + if (shallow_depth >= sb->depth)
> + return __map_depth(sb, index);
> +
> + if (index > 0)
> + lower_bound += (index - 1) << sb->shift;
> +
> + if (shallow_depth <= lower_bound)
> + return 0;
> +
> + return min_t(unsigned int, __map_depth(sb, index),
> + shallow_depth - lower_bound);
> +}
> +
> static int sbitmap_find_bit(struct sbitmap *sb,
> - unsigned int depth,
> + unsigned int shallow_depth,
> unsigned int index,
> unsigned int alloc_hint,
> bool wrap)
> @@ -218,12 +237,12 @@ static int sbitmap_find_bit(struct sbitmap *sb,
> int nr = -1;
>
> for (i = 0; i < sb->map_nr; i++) {
> - nr = sbitmap_find_bit_in_word(&sb->map[index],
> - min_t(unsigned int,
> - __map_depth(sb, index),
> - depth),
> - alloc_hint, wrap);
> + unsigned int depth = __map_depth_with_shallow(sb, index,
> + shallow_depth);
>
> + if (depth)
> + nr = sbitmap_find_bit_in_word(&sb->map[index], depth,
> + alloc_hint, wrap);
> if (nr != -1) {
> nr += index << sb->shift;
> break;
> @@ -406,27 +425,9 @@ EXPORT_SYMBOL_GPL(sbitmap_bitmap_show);
> static unsigned int sbq_calc_wake_batch(struct sbitmap_queue *sbq,
> unsigned int depth)
> {
> - unsigned int wake_batch;
> - unsigned int shallow_depth;
> -
> - /*
> - * Each full word of the bitmap has bits_per_word bits, and there might
> - * be a partial word. There are depth / bits_per_word full words and
> - * depth % bits_per_word bits left over. In bitwise arithmetic:
> - *
> - * bits_per_word = 1 << shift
> - * depth / bits_per_word = depth >> shift
> - * depth % bits_per_word = depth & ((1 << shift) - 1)
> - *
> - * Each word can be limited to sbq->min_shallow_depth bits.
> - */
> - shallow_depth = min(1U << sbq->sb.shift, sbq->min_shallow_depth);
> - depth = ((depth >> sbq->sb.shift) * shallow_depth +
> - min(depth & ((1U << sbq->sb.shift) - 1), shallow_depth));
> - wake_batch = clamp_t(unsigned int, depth / SBQ_WAIT_QUEUES, 1,
> - SBQ_WAKE_BATCH);
> -
> - return wake_batch;
> + return clamp_t(unsigned int,
> + min(depth, sbq->min_shallow_depth) / SBQ_WAIT_QUEUES,
> + 1, SBQ_WAKE_BATCH);
> }
>
> int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
> --
> 2.39.2
>
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists