linux-kernel - Re: [RFC PATCH] sched/core: Fix premature p->migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210204153040.qqkoa5sjztqeskoc@e107158-lin>
Date:   Thu, 4 Feb 2021 15:30:40 +0000
From:   Qais Yousef <qais.yousef@....com>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     linux-kernel@...r.kernel.org, tglx@...utronix.de, mingo@...nel.org,
        bigeasy@...utronix.de, swood@...hat.com, peterz@...radead.org,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, vincent.donnefort@....com,
        tj@...nel.org
Subject: Re: [RFC PATCH] sched/core: Fix premature p->migration_pending
 completion

On 02/03/21 18:59, Valentin Schneider wrote:
> On 03/02/21 17:23, Qais Yousef wrote:
> > On 01/27/21 19:30, Valentin Schneider wrote:
> >> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
> >> unearthed one more outstanding issue. This doesn't even involve
> >> migrate_disable(), but rather affinity changes and execution of the stopper
> >> racing with each other.
> >> 
> >> My own interpretation of the (lengthy) TLA+ splat (note the potential for
> >> errors at each level) is:
> >> 
> >>   Initial conditions:
> >>     victim.cpus_mask = {CPU0, CPU1}
> >> 
> >>   CPU0                             CPU1                             CPU<don't care>
> >> 
> >>   switch_to(victim)
> >> 								    set_cpus_allowed(victim, {CPU1})
> >> 								      kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
> >>   switch_to(stopper/0)
> >> 								    // e.g. CFS load balance
> >> 								    move_queued_task(CPU0, victim, CPU1);
> >> 				   switch_to(victim)
> >> 								    set_cpus_allowed(victim, {CPU0});
> >> 								      task_rq_unlock();
> >>   migration_cpu_stop(dest_cpu=CPU1)
> >
> > This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?
> >
> 
> Right
> 
> >>     task_rq(p) != rq && pending
> >>       kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
> >> 
> >> 				   switch_to(stopper/1)
> >> 				   migration_cpu_stop(dest_cpu=CPU1)
> >
> > And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?
> >
> 
> Nein! This is a retriggering of the "current" stopper (triggered by
> set_cpus_allowed(victim, {CPU1})), see the tail of that
> 
>   else if (dest_cpu < 0 || pending)
> 
> branch in migration_cpu_stop(), is what I'm trying to hint at with that 
> 
> task_rq(p) != rq && pending

Okay I see. But AFAIU, the work will be queued in order. So we should first
handle the set_cpus_allowed_ptr(victim, {CPU0}) before the retrigger, no?

So I see migration_cpu_stop() running 3 times

	1. because of set_cpus_allowed(victim, {CPU1}) on CPU0
	2. because of set_cpus_allowed(victim, {CPU0}) on CPU1
	3. because of retrigger of '1' on CPU0

Thanks

--
Qais Yousef