linux-kernel - Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xm264n0p4ndk.fsf@sword-of-the-dawn.mtv.corp.google.com>
Date:	Fri, 16 May 2014 10:57:59 -0700
From:	bsegall@...gle.com
To:	Roman Gushchin <klamm@...dex-team.ru>
Cc:	linux-kernel@...r.kernel.org, peterz@...radead.org, pjt@...gle.com,
	chris.j.arges@...onical.com, gregkh@...uxfoundation.org
Subject: Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock

Roman Gushchin <klamm@...dex-team.ru> writes:

> At Thu, 15 May 2014 10:43:14 -0700,
> bsegall@...gle.com wrote:
>> 
>> Roman Gushchin <klamm@...dex-team.ru> writes:
>> 
>> > tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
>> > force the period timer restart. It's not safe, because
>> > can lead to deadlock, described in commit 927b54fccbf0:
>> > "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
>> > waiting for the hrtimer to finish. However, if sched_cfs_period_timer
>> > runs for another loop iteration, the hrtimer can attempt to take
>> > rq->lock, resulting in deadlock."
>> > If tg_set_cfs_bandwidth() resets cfs_b->timer_active concurrently
>> > to the second iteration of sched_cfs_period_timer, deadlock occurs.
>> 
>> I do not see this deadlock. cfs_b->timer_active is protected by
>> cfs_b->lock on both read and write, and I don't think it would even
>> directly cause deadlock if it wasn't. In particular this patch makes us
>> run for strictly longer outside of the interrupt (although I think the
>> patched version is also correct). The old issue was calling
>> hrtimer_cancel, which would mean we hold cfs_b->lock while waiting for
>> the interrupt to complete, while the interrupt was waiting to take
>> cfs_b->lock.
>
> I still think, there is a deadlock. I'll try to explain.
> Three CPUs must be involved:
> CPU0				CPU1				CPU2
> take rq->lock			period timer fired		
> ...				take cfs_b lock
> ...				...				tg_set_cfs_bandwidth()
> throttle_cfs_rq()		release cfs_b lock		take cfs_b lock
> ...				distribute_cfs_runtime()	timer_active = 0
> take cfs_b->lock		wait for rq->lock		...
> __start_cfs_bandwidth()	
> {wait for timer callback
>  break if timer_active == 1}
>
> So, CPU0 and CPU1 are deadlocked.

Ahhh, yes, there is a three-cpu race here. Could you clarify the commit
message that this is a race involving all three of tg_set_cfs_bandwidth,
period timer, and then a second __start_cfs_bandwidth call?

The code looks good however.

>> I just did notice however that sched_cfs_period_timer reads
>> cfs_b->period without any locks, which can in fact race with an update
>> to period (and isn't fixed by this patch). If this write doesn't happen
>> to be atomic (which is certainly not guaranteed, and is entirely
>> plausible on say 32-bit x86, much less other archs), we could read a
>> partially written value and move the timer incorrectly. Pulling the
>> lock/unlock out of do_sched_cfs_period_timer should fix this easily
>> enough.
>
> Looks like an another issue.

Yeah.
>
> Thanks,
> Roman
>
>> 
>> unthrottle_offline_cfs_rqs could cause similar issues with its unlocked
>> read of quota, and can also be easily fixed.
>> 
>> 
>> >
>> > Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
>> > wait for period timer callbacks (ignoring cfs_b->timer_active) and
>> > restart the timer explicitly.
>> >
>> > Signed-off-by: Roman Gushchin <klamm@...dex-team.ru>
>> > Cc: Ben Segall <bsegall@...gle.com>
>> > Cc: Peter Zijlstra <peterz@...radead.org>
>> > Cc: pjt@...gle.com
>> > Cc: Chris J Arges <chris.j.arges@...onical.com>
>> > Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
>> > ---
>> >  kernel/sched/core.c  |    3 +--
>> >  kernel/sched/fair.c  |    8 ++++----
>> >  kernel/sched/sched.h |    2 +-
>> >  3 files changed, 6 insertions(+), 7 deletions(-)
>> >
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index d9d8ece..e9b9c66 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -7717,8 +7717,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> >  	/* restart the period timer (if active) to handle new period expiry */
>> >  	if (runtime_enabled && cfs_b->timer_active) {
>> >  		/* force a reprogram */
>> > -		cfs_b->timer_active = 0;
>> > -		__start_cfs_bandwidth(cfs_b);
>> > +		__start_cfs_bandwidth(cfs_b, true);
>> >  	}
>> >  	raw_spin_unlock_irq(&cfs_b->lock);
>> >  
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 7570dd9..55a0a5b 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -3129,7 +3129,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> >  		 */
>> >  		if (!cfs_b->timer_active) {
>> >  			__refill_cfs_bandwidth_runtime(cfs_b);
>> > -			__start_cfs_bandwidth(cfs_b);
>> > +			__start_cfs_bandwidth(cfs_b, false);
>> >  		}
>> >  
>> >  		if (cfs_b->runtime > 0) {
>> > @@ -3308,7 +3308,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> >  	raw_spin_lock(&cfs_b->lock);
>> >  	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
>> >  	if (!cfs_b->timer_active)
>> > -		__start_cfs_bandwidth(cfs_b);
>> > +		__start_cfs_bandwidth(cfs_b, false);
>> >  	raw_spin_unlock(&cfs_b->lock);
>> >  }
>> >  
>> > @@ -3690,7 +3690,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>> >  }
>> >  
>> >  /* requires cfs_b->lock, may release to reprogram timer */
>> > -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> > +void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
>> >  {
>> >  	/*
>> >  	 * The timer may be active because we're trying to set a new bandwidth
>> > @@ -3705,7 +3705,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> >  		cpu_relax();
>> >  		raw_spin_lock(&cfs_b->lock);
>> >  		/* if someone else restarted the timer then we're done */
>> > -		if (cfs_b->timer_active)
>> > +		if (!force && cfs_b->timer_active)
>> >  			return;
>> >  	}
>> >  
>> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> > index 456e492..369b4d6 100644
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -278,7 +278,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>> >  extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
>> >  
>> >  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>> > -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
>> > +extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
>> >  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>> >  
>> >  extern void free_rt_sched_group(struct task_group *tg);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/