lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmh7cvgnbdf.mognet@vschneid.remote.csb>
Date:   Thu, 16 Mar 2023 12:12:28 +0000
From:   Valentin Schneider <vschneid@...hat.com>
To:     Yicong Yang <yangyicong@...wei.com>, mingo@...hat.com,
        peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, linux-kernel@...r.kernel.org
Cc:     yangyicong@...ilicon.com, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, linuxarm@...wei.com, prime.zeng@...wei.com,
        wangjie125@...wei.com
Subject: Re: [PATCH] sched/fair: Don't balance migration disabled tasks

On 16/03/23 17:13, Yicong Yang wrote:
> Hi Valentin,
>
> On 2023/3/15 23:34, Valentin Schneider wrote:
>> That cpumask check should cover migration_disabled tasks, unless they
>> haven't gone through migrate_disable_switch() yet
>> (p->migration_disabled == 1, but the cpus_ptr hasn't been touched yet).
>>
>> Now, if that's the case, the task has to be src_rq's current (since it
>> hasn't switched out), which means can_migrate_task() should exit via:
>>
>>         if (task_on_cpu(env->src_rq, p)) {
>>                 schedstat_inc(p->stats.nr_failed_migrations_running);
>>                 return 0;
>>         }
>>
>> and thus not try to detach_task(). With that in mind, I don't get how your
>> splat can happen, nor how the change change can help (a remote task p could
>> execute migrate_disable() concurrently with can_migrate_task(p)).
>>
>
> I see, for migrate disabled tasks, if !p->on_cpu the migration can be avoid by
> the cpus_ptr check and if p->on_cpu migration can be avoid by the task_on_cpu()
> check. So this patch won't help.
>

Right.

>> I'm a bit confused here, detach_tasks() happens entirely with src_rq
>> rq_lock held, so there shouldn't be any surprises.
>>
>
> Since it's a arm64 machine, could the WARN_ON_ONCE() test be false positive?
> I mean the update of p->migration_disabled is not observed by the balance
> CPU and trigger this warning incorrectly.
>

Since the balance CPU holds the src_rq's rq_lock for the entire duration of
detach_tasks(), the actual value of p->migration_disabled shouldn't matter.

Either p has been scheduled out while being migration_disabled, in which
case its ->cpus_ptr has been updated, or p is still running, in which case
can_migrate_task() should return false (p->on_cpu). But from your report,
we're somehow not seeing one of these.

>> Can you share any extra context? E.g. exact HEAD of your tree, maybe the
>> migrate_disable task in question if you have that info.
>>
>
> We met this on our internal version based on 6.1-rc4, the scheduler is not
> patched. We also saw this in previous version like 6.0. This patch is applied
> since 6.2 so we haven't seen this recently, but as you point out this patch doesn't
> solve the problem.

Could you try to reproduce this on an unpatched upstream kernel?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ