linux-kernel - Re: [RFC] sched/fair: hard lockup in sched_cfs_period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <xm26ef7ay8ie.fsf@bsegall-linux.svl.corp.google.com>
Date:   Wed, 13 Mar 2019 10:44:09 -0700
From:   bsegall@...gle.com
To:     Phil Auld <pauld@...hat.com>
Cc:     mingo@...hat.com, peterz@...radead.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC]  sched/fair: hard lockup in sched_cfs_period_timer

Phil Auld <pauld@...hat.com> writes:

> On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
>> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@...gle.com wrote:
>> > Letting it spin for 100ms and then only increasing by 6% seems extremely
>> > generous. If we went this route I'd probably say "after looping N
>> > times, set the period to time taken / N + X%" where N is like 8 or
>> > something. I think I'd probably perfer something like this to the
>> > previous "just abort and let it happen again next interrupt" one.
>> 
>> Okay. I'll try to spin something up that does this. It may be a little 
>> trickier to keep the quota proportional to the new period. I think that's 
>> important since we'll be changing the user's setting.
>> 
>> Do you mean to have it break when it hits N and recalculates the period or 
>> reset the counter and keep going?
>> 
>
> Let me know what you think of the below. It's working nicely. I like your 
> suggestion to limit it quickly based on number of loops and use that to 
> scale up. I think it is best to break out and let it fire again if needed. 
> The warning fires once, very occasionally twice, and then things are quiet.
>
> If that looks reasonable I'll do some more testing and spin it up as a real 
> patch submission. 

Yeah, this looks reasonable. I should probably see how unreasonable the
other thing would be, but if your previous periods were kinda small (and
it's just that the machine crashing isn't an ok failure mode) I suppose
it's not a big deal.

>
> Cheers,
> Phil
> ---
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 310d0637fe4b..54b30adfc89e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>  	return HRTIMER_NORESTART;
>  }
>  
> +extern const u64 max_cfs_quota_period;
> +int cfs_period_autotune_loop_limit   = 8;
> +int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period recalculation */
> +
>  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>  {
>  	struct cfs_bandwidth *cfs_b =
>  		container_of(timer, struct cfs_bandwidth, period_timer);
> +	s64 nsstart, nsnow, new_period;
>  	int overrun;
>  	int idle = 0;
> +	int count = 0;
>  
>  	raw_spin_lock(&cfs_b->lock);
> +	nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
>  	for (;;) {
>  		overrun = hrtimer_forward_now(timer, cfs_b->period);
>  		if (!overrun)
>  			break;
>  
> +		if (++count > cfs_period_autotune_loop_limit) {
> +			ktime_t old_period = ktime_to_ns(cfs_b->period);
> +
> +			nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> +			new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
> +
> +			/*  Make sure new period will be larger than old. */
> +			if (new_period < old_period) {
> +				new_period = old_period;
> +			}
> +			new_period += (new_period *  cfs_period_autotune_cushion_pct) / 100;

This ordering means that it will always increase by at least 15%. This
is a bit odd but probably a good thing; I'd just change the comment to
make it clear this is deliberate.

> +
> +			if (new_period >  max_cfs_quota_period)
> +				new_period = max_cfs_quota_period;
> +
> +			cfs_b->period = ns_to_ktime(new_period);
> +			cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;

In general it makes sense to do fixed point via 1024 or something that
can be optimized into shifts (and a larger number is better in general
for better precision).

> +			pr_warn_ratelimited(
> +				"cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> +				smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> +
> +			idle = 0;
> +			break;
> +		}
> +
>  		idle = do_sched_cfs_period_timer(cfs_b, overrun);
>  	}
>  	if (idle)