linux-kernel - Re: [PATCH 7/8] wbt: add general throttling mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 3 May 2016 11:34:10 +0200
From:	Jan Kara <jack@...e.cz>
To:	Jens Axboe <axboe@...com>
Cc:	Jan Kara <jack@...e.cz>, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-block@...r.kernel.org,
	dchinner@...hat.com, sedat.dilek@...il.com
Subject: Re: [PATCH 7/8] wbt: add general throttling mechanism

On Thu 28-04-16 12:53:50, Jens Axboe wrote:
> >2) As far as I can see in patch 8/8, you have plugged the throttling above
> >    the IO scheduler. When there are e.g. multiple cgroups with different IO
> >    limits operating, this throttling can lead to strange results (like a
> >    cgroup with low limit using up all available background "slots" and thus
> >    effectively stopping background writeback for other cgroups)? So won't
> >    it make more sense to plug this below the IO scheduler? Now I understand
> >    there may be other problems with this but I think we should put more
> >    though to that and provide some justification in changelogs.
> 
> One complexity is that we have to do this early for blk-mq, since once you
> get a request, you're already sitting on the hw tag. CoDel should actually
> work fine at each hop, so hopefully this will as well.

OK, I see. But then this suggests that any IO scheduling and / or
cgroup-related throttling should happen before we get a request for blk-mq
as well? And then we can still do writeback throttling below that layer?

> But yes, fairness is something that we have to pay attention to. Right now
> the wait queue has no priority associated with it, that should probably be
> improved to be able to wakeup in a more appropriate order.
> Needs testing, but hopefully it works out since if you do run into
> starvation, then you'll go to the back of the queue for the next attempt.

Yeah, once I'll hunt down that regression with old disk, I can have a look
into how writeback throttling plays together with blkio-controller.

> >>+static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
> >>+{
> >>+	u64 thislat;
> >>+
> >>+	/*
> >>+	 * If our stored sync issue exceeds the window size, or it
> >>+	 * exceeds our min target AND we haven't logged any entries,
> >>+	 * flag the latency as exceeded.
> >>+	 */
> >>+	thislat = rwb_sync_issue_lat(rwb);
> >>+	if (thislat > rwb->cur_win_nsec ||
> >>+	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
> >>+		trace_wbt_lat(rwb->bdi, thislat);
> >>+		return LAT_EXCEEDED;
> >>+	}
> >
> >So I'm trying to wrap my head around this. If I read the code right,
> >rwb_sync_issue_lat() this returns time that has passed since issuing sync
> >request that is still running. We basically randomly pick which sync
> >request we track as we always start tracking a sync request when some is
> >issued and we are not tracking any at that moment. This is to detect the
> >case when latency of sync IO is very large compared to measurement window
> >so we would not get enough samples to make it valid?
> 
> Right, that's pretty close. Since wbt uses the completion latencies to make
> decisions, if an IO hasn't completed, we don't know about it. If the device
> is flooded with writes, and we then issue a read, maybe that read won't
> complete for multiple monitoring windows. During that time, we keep thinking
> everything is fine. But in reality, it's not completing because of the write
> load. So this logic attempts to track the single sync IO request case. If
> that exceeds a monitoring window of time and we saw no other sync IO in that
> window, then treat that case as if it had completed but exceeded the min
> latency. And then scale back.
> 
> We'll always treat a state sample with 1 read as valuable, but for this
> case, we don't have that sample until it completes.
> 
> Does that make more sense?

OK, makes sense. Thanks for explanation.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR