lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210204153040.qqkoa5sjztqeskoc@e107158-lin>
Date:   Thu, 4 Feb 2021 15:30:40 +0000
From:   Qais Yousef <qais.yousef@....com>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     linux-kernel@...r.kernel.org, tglx@...utronix.de, mingo@...nel.org,
        bigeasy@...utronix.de, swood@...hat.com, peterz@...radead.org,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, vincent.donnefort@....com,
        tj@...nel.org
Subject: Re: [RFC PATCH] sched/core: Fix premature p->migration_pending
 completion

On 02/03/21 18:59, Valentin Schneider wrote:
> On 03/02/21 17:23, Qais Yousef wrote:
> > On 01/27/21 19:30, Valentin Schneider wrote:
> >> Fiddling some more with a TLA+ model of set_cpus_allowed_ptr() & friends
> >> unearthed one more outstanding issue. This doesn't even involve
> >> migrate_disable(), but rather affinity changes and execution of the stopper
> >> racing with each other.
> >> 
> >> My own interpretation of the (lengthy) TLA+ splat (note the potential for
> >> errors at each level) is:
> >> 
> >>   Initial conditions:
> >>     victim.cpus_mask = {CPU0, CPU1}
> >> 
> >>   CPU0                             CPU1                             CPU<don't care>
> >> 
> >>   switch_to(victim)
> >> 								    set_cpus_allowed(victim, {CPU1})
> >> 								      kick CPU0 migration_cpu_stop({.dest_cpu = CPU1})
> >>   switch_to(stopper/0)
> >> 								    // e.g. CFS load balance
> >> 								    move_queued_task(CPU0, victim, CPU1);
> >> 				   switch_to(victim)
> >> 								    set_cpus_allowed(victim, {CPU0});
> >> 								      task_rq_unlock();
> >>   migration_cpu_stop(dest_cpu=CPU1)
> >
> > This migration stop is due to set_cpus_allowed(victim, {CPU1}), right?
> >
> 
> Right
> 
> >>     task_rq(p) != rq && pending
> >>       kick CPU1 migration_cpu_stop({.dest_cpu = CPU1})
> >> 
> >> 				   switch_to(stopper/1)
> >> 				   migration_cpu_stop(dest_cpu=CPU1)
> >
> > And this migration stop is due to set_cpus_allowed(victim, {CPU0}), right?
> >
> 
> Nein! This is a retriggering of the "current" stopper (triggered by
> set_cpus_allowed(victim, {CPU1})), see the tail of that
> 
>   else if (dest_cpu < 0 || pending)
> 
> branch in migration_cpu_stop(), is what I'm trying to hint at with that 
> 
> task_rq(p) != rq && pending

Okay I see. But AFAIU, the work will be queued in order. So we should first
handle the set_cpus_allowed_ptr(victim, {CPU0}) before the retrigger, no?

So I see migration_cpu_stop() running 3 times

	1. because of set_cpus_allowed(victim, {CPU1}) on CPU0
	2. because of set_cpus_allowed(victim, {CPU0}) on CPU1
	3. because of retrigger of '1' on CPU0

Thanks

--
Qais Yousef

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ