linux-kernel - Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1D08B61A9CF0974AA09887BE32D889DA0DA145@ULS-OP-MBXIP03.sdcorp.global.sandisk.com>
Date:   Mon, 6 Mar 2017 19:40:15 +0000
From:   Bart Van Assche <Bart.VanAssche@...disk.com>
To:     Paolo Valente <paolo.valente@...aro.org>,
        Jens Axboe <axboe@...nel.dk>, Tejun Heo <tj@...nel.org>
CC:     Fabio Checconi <fchecconi@...il.com>,
        Arianna Avanzini <avanzini.arianna@...il.com>,
        "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "ulf.hansson@...aro.org" <ulf.hansson@...aro.org>,
        "linus.walleij@...aro.org" <linus.walleij@...aro.org>,
        "broonie@...nel.org" <broonie@...nel.org>
Subject: Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O
 scheduler as an extra scheduler

On 03/04/2017 08:01 AM, Paolo Valente wrote:
> BFQ is a proportional-share I/O scheduler, whose general structure,
> plus a lot of code, are borrowed from CFQ.
> [ ... ]

This description is very useful. However, since it is identical to the
description this patch adds to Documentation/block/bfq-iosched.txt I
propose to leave it out from the patch description.

What seems missing to me is an overview of the limitations of BFQ. Does
BFQ e.g. support multiple hardware queues?

> +3. What are BFQ's tunable?
> +==========================
> +[ ... ]

A thorough knowledge of BFQ is required to tune it properly. Users don't
want to tune I/O schedulers. Has it been considered to invent algorithms
to tune these parameters automatically?

> + * Licensed under GPL-2.

The COPYING file at the top of the tree mentions that GPL-v2 licensing
should be specified as follows close to the start of each source file:

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
    02110-1301 USA

> + * BFQ is a proportional-share I/O scheduler, with some extra
> + * low-latency capabilities. BFQ also supports full hierarchical
> + * scheduling through cgroups. Next paragraphs provide an introduction
> + * on BFQ inner workings. Details on BFQ benefits and usage can be
> + * found in Documentation/block/bfq-iosched.txt.

That reference should be sufficient - please do not duplicate
Documentation/block/bfq-iosched.txt in block/bfq-iosched.c.

> +/**
> + * struct bfq_service_tree - per ioprio_class service tree.
> + *
> + * Each service tree represents a B-WF2Q+ scheduler on its own.  Each
> + * ioprio_class has its own independent scheduler, and so its own
> + * bfq_service_tree.  All the fields are protected by the queue lock
> + * of the containing bfqd.
> + */
> +struct bfq_service_tree {
> +	/* tree for active entities (i.e., those backlogged) */
> +	struct rb_root active;
> +	/* tree for idle entities (i.e., not backlogged, with V <= F_i)*/
> +	struct rb_root idle;
> +
> +	struct bfq_entity *first_idle;	/* idle entity with minimum F_i */
> +	struct bfq_entity *last_idle;	/* idle entity with maximum F_i */
> +
> +	u64 vtime; /* scheduler virtual time */
> +	/* scheduler weight sum; active and idle entities contribute to it */
> +	unsigned long wsum;
> +};

Inline comments next to structure members are ugly and make the
structure definition hard to read. Please follow the instructions in
Documentation/kernel-doc-nano-HOWTO.txt for documenting structure members.

> +	u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */
> +	u64 start;  /* B-WF2Q+ start timestamp (aka S_i) */

For all times and timestamps, please document the time unit (e.g. s, ms,
us, ns, jiffies, ...).

> +enum bfq_device_speed {
> +	BFQ_BFQD_FAST,
> +	BFQ_BFQD_SLOW,
> +};

What is the meaning of "fast" and "slow" devices in this context?
Anyway, since the first patch that uses this enum is patch 6, please
defer introduction of this enum until patch 6.

> +
> +/**
> + * struct bfq_data - per-device data structure.
> + *
> + * All the fields are protected by @lock.
> + */
> +struct bfq_data {
> +	/* device request queue */
> +	struct request_queue *queue;
> +	[ ... ]
> +
> +	/* on-disk position of the last served request */
> +	sector_t last_position;

What is the relevance of last_position if there are multiple hardware
queues? Will the BFQ algorithm fail to realize its guarantees in that case?

What is the relevance of this structure member for block devices that
have multiple spindles, e.g. arrays of hard disks?

> +enum bfqq_state_flags {
> +	BFQ_BFQQ_FLAG_busy = 0,		/* has requests or is in service */
> +	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */
> +	BFQ_BFQQ_FLAG_non_blocking_wait_rq, /*
> +					     * waiting for a request
> +					     * without idling the device
> +					     */
> +	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
> +	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */
> +	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */
> +	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */
> +	BFQ_BFQQ_FLAG_IO_bound,		/*
> +					 * bfqq has timed-out at least once
> +					 * having consumed at most 2/10 of
> +					 * its budget
> +					 */
> +};

The "BFQ_BFQQ_FLAG_" prefix looks silly and too long to me. How about
e.g. using the prefix "BFQQF_" instead?

> +#define BFQ_BFQQ_FNS(name)						\
> +static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\
> +{									\
> +	(bfqq)->flags |= (1 << BFQ_BFQQ_FLAG_##name);			\
> +}									\
> +static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)		\
> +{									\
> +	(bfqq)->flags &= ~(1 << BFQ_BFQQ_FLAG_##name);			\
> +}									\
> +static int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\
> +{									\
> +	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) != 0;	\
> +}

Are the bodies of the above functions duplicates of __set_bit(),
__clear_bit() and test_bit()?

> +/* Expiration reasons. */
> +enum bfqq_expiration {
> +	BFQ_BFQQ_TOO_IDLE = 0,		/*
> +					 * queue has been idling for
> +					 * too long
> +					 */
> +	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */
> +	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */
> +	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */
> +	BFQ_BFQQ_PREEMPTED		/* preemption in progress */
> +};

The prefix of these constants refers twice to "BFQ" and does not make it
clear that these constants are about expiration. How about using the
"BFQQE_" prefix instead?

> +/* Maximum backwards seek, in KiB. */
> +static const int bfq_back_max = 16 * 1024;

Where does this constant come from? Should it depend on geometry data
like e.g. the number of sectors in a cylinder?

> +#define for_each_entity(entity)	\
> +	for (; entity ; entity = NULL)

Why has this confusing #define been introduced? Shouldn't all
occurrences of this macro be changed into the equivalent "if (entity)"?
We don't want silly macros like this in the Linux kernel.

> +#define for_each_entity_safe(entity, parent) \
> +	for (parent = NULL; entity ; entity = parent)

Same question here - why has this macro been introduced and how has its
name been chosen? Since this macro is used only once and since no value
is assigned to 'parent' in the code controlled by this construct, please
remove this macro and use something that is less confusing than a "for"
loop for something that is not a loop.

> +/**
> + * bfq_weight_to_ioprio - calc an ioprio from a weight.
> + * @weight: the weight value to convert.
> + *
> + * To preserve as much as possible the old only-ioprio user interface,
> + * 0 is used as an escape ioprio value for weights (numerically) equal or
> + * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.
> + */
> +static unsigned short bfq_weight_to_ioprio(int weight)
> +{
> +	return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?
> +		0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;
> +}

Please consider using max() or max_t() to make this function less verbose.

> +
> +/**
> + * bfq_active_extract - remove an entity from the active tree.
> + * @st: the service_tree containing the tree.
> + * @entity: the entity being removed.
> + */
> +static void bfq_active_extract(struct bfq_service_tree *st,
> +			       struct bfq_entity *entity)
> +{
> +	struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
> +	struct rb_node *node;
> +
> +	node = bfq_find_deepest(&entity->rb_node);
> +	bfq_extract(&st->active, entity);
> +
> +	if (node)
> +		bfq_update_active_tree(node);
> +
> +	if (bfqq)
> +		list_del(&bfqq->bfqq_list);
> +}

Which locks protect the data structures manipulated by this and other
functions? Have you considered to use lockdep_assert_held() to document
these assumptions?

> +	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */

Please don't use parentheses if no confusion is possible. Additionally,
checkpatch should have requested you to insert a space before and after
the logical or operator.

> +static void __bfq_set_in_service_queue(struct bfq_data *bfqd,
> +				       struct bfq_queue *bfqq)
> +{
> +	if (bfqq) {
> +		bfq_mark_bfqq_budget_new(bfqq);
> +		bfq_clear_bfqq_fifo_expire(bfqq);
> +
> +		bfqd->budgets_assigned = (bfqd->budgets_assigned*7 + 256) / 8;

Checkpatch should have asked you to insert spaces around the
multiplication operator.

> +/*
> + * bfq_default_budget - return the default budget for @bfqq on @bfqd.
> + * @bfqd: the device descriptor.
> + * @bfqq: the queue to consider.
> + *
> + * We use 3/4 of the @bfqd maximum budget as the default value
> + * for the max_budget field of the queues.  This lets the feedback
> + * mechanism to start from some middle ground, then the behavior
> + * of the process will drive the heuristics towards high values, if
> + * it behaves as a greedy sequential reader, or towards small values
> + * if it shows a more intermittent behavior.
> + */
> +static unsigned long bfq_default_budget(struct bfq_data *bfqd,
> +					struct bfq_queue *bfqq)
> +{
> +	unsigned long budget;
> +
> +	/*
> +	 * When we need an estimate of the peak rate we need to avoid
> +	 * to give budgets that are too short due to previous measurements.
> +	 * So, in the first 10 assignments use a ``safe'' budget value.
> +	 */
> +	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget == 0)
> +		budget = bfq_default_max_budget;
> +	else
> +		budget = bfqd->bfq_max_budget;
> +
> +	return budget - budget / 4;
> +}

Where does the magic constant "194" come from?


> +	} else
> +		/*
> +		 * Async queues get always the maximum possible
> +		 * budget, as for them we do not care about latency
> +		 * (in addition, their ability to dispatch is limited
> +		 * by the charging factor).
> +		 */
> +		budget = bfqd->bfq_max_budget;
> +

Please balance braces. Checkpatch should have warned about the use of "}
else" instead of "} else {".

> +static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)
> +{
> +	unsigned long max_budget;
> +
> +	/*
> +	 * The max_budget calculated when autotuning is equal to the
> +	 * amount of sectors transferred in timeout at the
> +	 * estimated peak rate.
> +	 */
> +	max_budget = (unsigned long)(peak_rate * 1000 *
> +				     timeout >> BFQ_RATE_SHIFT);
> +
> +	return max_budget;
> +}

Where does the constant 1000 come from? What are the units of peak_rate
and timeout? What is the maximum value of peak_rate? Can the
multiplication overflow?

> +/*
> + * In addition to updating the peak rate, checks whether the process
> + * is "slow", and returns 1 if so. This slow flag is used, in addition
> + * to the budget timeout, to reduce the amount of service provided to
> + * seeky processes, and hence reduce their chances to lower the
> + * throughput. See the code for more details.
> + */
> +static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue *bfqq,
> +				 bool compensate)
> +{
> +	u64 bw, usecs, expected, timeout;
> +	ktime_t delta;
> +	int update = 0;
> +
> +	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))
> +		return false;
> +
> +	if (compensate)
> +		delta = bfqd->last_idling_start;
> +	else
> +		delta = ktime_get();
> +	delta = ktime_sub(delta, bfqd->last_budget_start);
> +	usecs = ktime_to_us(delta);
> +
> +	/* Don't trust short/unrealistic values. */
> +	if (usecs < 100 || usecs >= LONG_MAX)
> +		return false;

If usecs >= LONG_MAX that indicates a kernel bug. Please consider
triggering a kernel warning in that case.

> +/*
> + * Budget timeout is not implemented through a dedicated timer, but
> + * just checked on request arrivals and completions, as well as on
> + * idle timer expirations.
> + */
> +static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)
> +{
> +	if (bfq_bfqq_budget_new(bfqq) ||
> +	    time_before(jiffies, bfqq->budget_timeout))
> +		return false;
> +	return true;
> +}

Have you considered to use time_is_after_jiffies() instead of
time_before(jiffies, ...)?

> +static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
> +			  struct bfq_io_cq *bic, pid_t pid, int is_sync)
> +{
> +	RB_CLEAR_NODE(&bfqq->entity.rb_node);
> +	INIT_LIST_HEAD(&bfqq->fifo);
> +
> +	bfqq->ref = 0;
> +	bfqq->bfqd = bfqd;
> +
> +	if (bic)
> +		bfq_set_next_ioprio_data(bfqq, bic);
> +
> +	if (is_sync) {
> +		if (!bfq_class_idle(bfqq))
> +			bfq_mark_bfqq_idle_window(bfqq);
> +		bfq_mark_bfqq_sync(bfqq);
> +	} else
> +		bfq_clear_bfqq_sync(bfqq);
> +
> +	bfqq->ttime.last_end_request = ktime_get_ns() - (1ULL<<32);
> +
> +	bfq_mark_bfqq_IO_bound(bfqq);
> +
> +	bfqq->pid = pid;
> +
> +	/* Tentative initial value to trade off between thr and lat */
> +	bfqq->max_budget = bfq_default_budget(bfqd, bfqq);
> +	bfqq->budget_timeout = bfq_smallest_from_now();
> +	bfqq->pid = pid;
> +
> +	/* first request is almost certainly seeky */
> +	bfqq->seek_history = 1;
> +}

What is the meaning of the 1ULL << 32 constant?

> +static int __init bfq_init(void)
> +{
> +	int ret;
> +	char msg[50] = "BFQ I/O-scheduler: v0";

Please leave out "[50]" and use "static const char" instead of "char".

> diff --git a/block/elevator.c b/block/elevator.c
> index 01139f5..786fdcd 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -221,14 +221,20 @@ int elevator_init(struct request_queue *q, char *name)
>  
>  	if (!e) {
>  		/*
> -		 * For blk-mq devices, we default to using mq-deadline,
> -		 * if available, for single queue devices. If deadline
> -		 * isn't available OR we have multiple queues, default
> -		 * to "none".
> +		 * For blk-mq devices, we default to using bfq, if
> +		 * available, for single queue devices. If bfq isn't
> +		 * available, we try mq-deadline. If neither is
> +		 * available, OR we have multiple queues, default to
> +		 * "none".
>  		 */
>  		if (q->mq_ops) {
> +			if (q->nr_hw_queues == 1) {
> +				e = elevator_get("bfq", false);
> +				if (!e)
> +					e = elevator_get("mq-deadline", false);
> +			}
>  			if (q->nr_hw_queues == 1)
> -				e = elevator_get("mq-deadline", false);
> +				e = elevator_get("bfq", false);
>  			if (!e)
>  				return 0;
>  		} else
> 

As Jens wrote, it's way too early to make BFQ the default scheduler.

Bart.