linux-kernel - Re: [PATCH 7/8] wbt: add general throttling mechanism

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57225C3E.7060504@fb.com>
Date:	Thu, 28 Apr 2016 12:53:50 -0600
From:	Jens Axboe <axboe@...com>
To:	Jan Kara <jack@...e.cz>
CC:	<linux-kernel@...r.kernel.org>, <linux-fsdevel@...r.kernel.org>,
	<linux-block@...r.kernel.org>, <dchinner@...hat.com>,
	<sedat.dilek@...il.com>
Subject: Re: [PATCH 7/8] wbt: add general throttling mechanism

On 04/28/2016 05:05 AM, Jan Kara wrote:
> I have some comments below...
>
>> +struct rq_wb {
>> +	/*
>> +	 * Settings that govern how we throttle
>> +	 */
>> +	unsigned int wb_background;		/* background writeback */
>> +	unsigned int wb_normal;			/* normal writeback */
>> +	unsigned int wb_max;			/* max throughput writeback */
>> +	unsigned int scale_step;
>> +
>> +	u64 win_nsec;				/* default window size */
>> +	u64 cur_win_nsec;			/* current window size */
>> +
>> +	unsigned int unknown_cnt;
>
> It would be useful to have a comment here explaining that 'unknown_cnt' is
> a number of consecutive periods in which we didn't have enough data to
> decide about queue scaling (at least this is what I understood from the
> code).

Agree, I'll add that comment.

>> +
>> +	struct timer_list window_timer;
>> +
>> +	s64 sync_issue;
>> +	void *sync_cookie;
>
> So I'm somewhat wondering: What is protecting consistency of this
> structure? The limits, scale_step, cur_win_nsec, unknown_cnt are updated only
> from timer so those should be safe. However sync_issue & sync_cookie are
> accessed from IO submission and completion path and there we need some
> protection to keep those two in sync. It seems q->queue_lock should mostly
> achieve those except for blk-mq submission path calling wbt_wait() which
> doesn't hold queue_lock.

Right, it's designed such that only the timer will be updating these 
values, and that part is serialized. For sync_issue and sync_cookie, the 
important part there is that we never dereference sync_cookie. That's 
why it's a void * now. So we just use it as a hint. And yes, if the IO 
happens to complete at just the time we are looking at it, we could get 
a false positive or false negative. That's going to be noise, and 
nothing we need to worry about. It's deliberate that I don't do any 
locking for that, the only reason we pass in the queue_lock is to be 
able to drop it for sleeping.

> It seems you were aware of the possible races and the code handles them
> mostly fine (although I wouldn't bet too much there is not some weird
> corner case). However it would be good to comment on this somewhere and
> explain what the rules for these two fields are.

Agree, it does warrant a good code comment. If we look at the edge 
cases, one would be:

We look at sync_issue and decide that we're now too late, at the same 
time as the sync_cookie gets cleared. For this case, we'll count it as 
an exceed and scale down. In reality we were late, so it doesn't matter. 
Even if it was the exact time, it's still prudent to scale down as we're 
going to miss soon.

A more worrying case would be two issues that happen at the same time, 
and only one gets set. Let's assume the one that doesn't get set is the 
one that ends up taking a long time to complete. We'll miss scaling down 
in this case, we'll only notice when it completes and shows up in the 
stats. Not idea, but it's still being handled in the fashion that was 
originally intended, at completion time.

>> diff --git a/lib/wbt.c b/lib/wbt.c
>> new file mode 100644
>> index 000000000000..650da911f24f
>> --- /dev/null
>> +++ b/lib/wbt.c
>> @@ -0,0 +1,524 @@
>> +/*
>> + * buffered writeback throttling. losely based on CoDel. We can't drop
>> + * packets for IO scheduling, so the logic is something like this:
>> + *
>> + * - Monitor latencies in a defined window of time.
>> + * - If the minimum latency in the above window exceeds some target, increment
>> + *   scaling step and scale down queue depth by a factor of 2x. The monitoring
>> + *   window is then shrunk to 100 / sqrt(scaling step + 1).
>> + * - For any window where we don't have solid data on what the latencies
>> + *   look like, retain status quo.
>> + * - If latencies look good, decrement scaling step.
>
> I'm wondering about two things:
>
> 1) There is a logic somewhat in this direction in blk_queue_start_tag().
>     Probably it should be removed after your patches land?

You're referring to the read/write separation in the legacy tagging? Yes 
agree, we can kill that once this goes in.

> 2) As far as I can see in patch 8/8, you have plugged the throttling above
>     the IO scheduler. When there are e.g. multiple cgroups with different IO
>     limits operating, this throttling can lead to strange results (like a
>     cgroup with low limit using up all available background "slots" and thus
>     effectively stopping background writeback for other cgroups)? So won't
>     it make more sense to plug this below the IO scheduler? Now I understand
>     there may be other problems with this but I think we should put more
>     though to that and provide some justification in changelogs.

One complexity is that we have to do this early for blk-mq, since once 
you get a request, you're already sitting on the hw tag. CoDel should 
actually work fine at each hop, so hopefully this will as well.

But yes, fairness is something that we have to pay attention to. Right 
now the wait queue has no priority associated with it, that should 
probably be improved to be able to wakeup in a more appropriate order.
Needs testing, but hopefully it works out since if you do run into 
starvation, then you'll go to the back of the queue for the next attempt.

>> +static void calc_wb_limits(struct rq_wb *rwb)
>> +{
>> +	unsigned int depth;
>> +
>> +	if (!rwb->min_lat_nsec) {
>> +		rwb->wb_max = rwb->wb_normal = rwb->wb_background = 0;
>> +		return;
>> +	}
>> +
>> +	depth = min_t(unsigned int, RWB_MAX_DEPTH, rwb->queue_depth);
>> +
>> +	/*
>> +	 * Reduce max depth by 50%, and re-calculate normal/bg based on that
>> +	 */
>
> The comment looks a bit out of place here since we don't reduce max depth
> here. We just use whatever is set in scale_step...

True, it does get called for both scaling up and down now. I'll update 
the comment.

>> +static int __latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
>> +{
>> +	u64 thislat;
>> +
>> +	/*
>> +	 * If our stored sync issue exceeds the window size, or it
>> +	 * exceeds our min target AND we haven't logged any entries,
>> +	 * flag the latency as exceeded.
>> +	 */
>> +	thislat = rwb_sync_issue_lat(rwb);
>> +	if (thislat > rwb->cur_win_nsec ||
>> +	    (thislat > rwb->min_lat_nsec && !stat[0].nr_samples)) {
>> +		trace_wbt_lat(rwb->bdi, thislat);
>> +		return LAT_EXCEEDED;
>> +	}
>
> So I'm trying to wrap my head around this. If I read the code right,
> rwb_sync_issue_lat() this returns time that has passed since issuing sync
> request that is still running. We basically randomly pick which sync
> request we track as we always start tracking a sync request when some is
> issued and we are not tracking any at that moment. This is to detect the
> case when latency of sync IO is very large compared to measurement window
> so we would not get enough samples to make it valid?

Right, that's pretty close. Since wbt uses the completion latencies to 
make decisions, if an IO hasn't completed, we don't know about it. If 
the device is flooded with writes, and we then issue a read, maybe that 
read won't complete for multiple monitoring windows. During that time, 
we keep thinking everything is fine. But in reality, it's not completing 
because of the write load. So this logic attempts to track the single 
sync IO request case. If that exceeds a monitoring window of time and we 
saw no other sync IO in that window, then treat that case as if it had 
completed but exceeded the min latency. And then scale back.

We'll always treat a state sample with 1 read as valuable, but for this 
case, we don't have that sample until it completes.

Does that make more sense?

> Probably the comment could explain more of "why we do this?" than pure
> "what we do".

Agree, if you find it confusing, then it needs updating. I'll update the 
comment.


-- 
Jens Axboe