linux-kernel - Re: [RFC 2/3] blk-throttling: weight based throttling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20160121210043.GA3696942@devbig084.prn1.facebook.com>
Date:	Thu, 21 Jan 2016 13:00:44 -0800
From:	Shaohua Li <shli@...com>
To:	Vivek Goyal <vgoyal@...hat.com>
CC:	<linux-kernel@...r.kernel.org>, <axboe@...nel.dk>, <tj@...nel.org>,
	<jmoyer@...hat.com>, <Kernel-team@...com>
Subject: Re: [RFC 2/3] blk-throttling: weight based throttling

On Thu, Jan 21, 2016 at 03:33:32PM -0500, Vivek Goyal wrote:
> On Wed, Jan 20, 2016 at 09:49:18AM -0800, Shaohua Li wrote:
> > We know total bandwidth of a disk and can calculate cgroup's bandwidth
> > percentage against disk bandwidth according to its weight. We can easily
> > calculate cgroup bandwidth.
> > 
> > Signed-off-by: Shaohua Li <shli@...com>
> > ---
> >  block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 134 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index 2149a1d..b3f847d 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -12,6 +12,9 @@
> >  #include <linux/blk-cgroup.h>
> >  #include "blk.h"
> >  
> > +#define MAX_WEIGHT (1000)
> > +#define WEIGHT_RATIO_SHIFT (12)
> > +#define WEIGHT_RATIO (1 << WEIGHT_RATIO_SHIFT)
> >  /* Max dispatch from a group in 1 round */
> >  static int throtl_grp_quantum = 8;
> >  
> > @@ -74,6 +77,10 @@ struct throtl_service_queue {
> >  	unsigned int		nr_pending;	/* # queued in the tree */
> >  	unsigned long		first_pending_disptime;	/* disptime of the first tg */
> >  	struct timer_list	pending_timer;	/* fires on first_pending_disptime */
> > +
> > +	unsigned int		weight;
> > +	unsigned int		children_weight;
> > +	unsigned int		ratio;
> 
> Will it be better to call it "share" instead of "ratio". It is basically
> a measure of % disk share of the group and share seems more intuitive.

Ok
> 
> [..]
> > +static void tg_update_bps(struct throtl_grp *tg)
> > +{
> > +	struct throtl_service_queue *sq, *parent_sq;
> > +
> > +	sq = &tg->service_queue;
> > +	parent_sq = sq->parent_sq;
> > +
> > +	if (!tg->td->weight_based || !parent_sq)
> > +		return;
> > +	sq->ratio = max_t(unsigned int,
> > +		parent_sq->ratio * sq->weight / parent_sq->children_weight,
> > +		1);
> > +
> 
> It might be good to decouple updation of "share/ratio" and updation of
> bps. Change of share can happen any time either weight is changed or
> an active group is queue/dequeued and we don't have to do it every time
> a bio is submitted.

Ok
> > +	tg->bps[READ] = max_t(uint64_t,
> > +		(queue_bandwidth(tg->td, READ) * sq->ratio) >>
> > +			WEIGHT_RATIO_SHIFT,
> > +		1024);
> > +	tg->bps[WRITE] = max_t(uint64_t,
> > +		(queue_bandwidth(tg->td, WRITE) * sq->ratio) >>
> > +			WEIGHT_RATIO_SHIFT,
> > +		1024);
> > +}
> > +
> > +static void tg_update_ratio(struct throtl_grp *tg)
> > +{
> > +	struct throtl_data *td = tg->td;
> > +	struct cgroup_subsys_state *pos_css;
> > +	struct blkcg_gq *blkg;
> > +
> > +	blkg_for_each_descendant_pre(blkg, pos_css, td->queue->root_blkg) {
> 
> Is it possible to traverse only the affected subtree instead of whole
> tree of groups. Because if weight is updated on a group, then we just
> need to traverse the subtree under that group's parent.

makes sense
> [..]
> > @@ -1415,6 +1546,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
> >  
> >  	sq = &tg->service_queue;
> >  
> > +	tg_update_bps(tg);
> 
> Updating bps for every bio submitted sounds like a lot. We probably could
> do it when first bio gets queued in the group and then refresh it at
> some regular interval. Say when next set of dispatch happens from group
> we could update bandwidth of group after dispatch.

That calculation isn't very heavy. I'll revisit this if it's a problem.

Thanks,
Shaohua