linux-kernel - Re: BUG: HANG_DETECT waiting for migration_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e6153b89-1f41-3fff-241b-a767e41a1e7e@quicinc.com>
Date:   Fri, 23 Sep 2022 19:50:04 +0530
From:   Mukesh Ojha <quic_mojha@...cinc.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Waiman Long <longman@...hat.com>
CC:     Tejun Heo <tj@...nel.org>,
        Jing-Ting Wu <jing-ting.wu@...iatek.com>,
        Valentin Schneider <vschneid@...hat.com>,
        <wsd_upstream@...iatek.com>, <linux-kernel@...r.kernel.org>,
        <linux-arm-kernel@...ts.infradead.org>,
        <linux-mediatek@...ts.infradead.org>,
        <Jonathan.JMChen@...iatek.com>,
        "chris.redpath@....com" <chris.redpath@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Vincent Donnefort <vdonnefort@...il.com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Christian Brauner <brauner@...nel.org>,
        <cgroups@...r.kernel.org>, <lixiong.liu@...iatek.com>,
        <wenju.xu@...iatek.com>
Subject: Re: BUG: HANG_DETECT waiting for migration_cpu_stop() complete

Hi Peter,


On 9/7/2022 2:20 AM, Peter Zijlstra wrote:
> On Tue, Sep 06, 2022 at 04:40:03PM -0400, Waiman Long wrote:
> 
> I've not followed the earlier stuff due to being unreadable; just
> reacting to this..

We are able to reproduce this issue explained at this link

https://lore.kernel.org/lkml/88b2910181bda955ac46011b695c53f7da39ac47.camel@mediatek.com/


> 
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 838623b68031..5d9ea1553ec0 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2794,9 +2794,9 @@ static int __set_cpus_allowed_ptr_locked(struct
>> task_struct *p,
>>                  if (cpumask_equal(&p->cpus_mask, new_mask))
>>                          goto out;
>>
>> -               if (WARN_ON_ONCE(p == current &&
>> -                                is_migration_disabled(p) &&
>> -                                !cpumask_test_cpu(task_cpu(p), new_mask)))
>> {
>> +               if (is_migration_disabled(p) &&
>> +                   !cpumask_test_cpu(task_cpu(p), new_mask)) {
>> +                       WARN_ON_ONCE(p == current);
>>                          ret = -EBUSY;
>>                          goto out;
>>                  }
>> @@ -2818,7 +2818,11 @@ static int __set_cpus_allowed_ptr_locked(struct
>> task_struct *p,
>>          if (flags & SCA_USER)
>>                  user_mask = clear_user_cpus_ptr(p);
>>
>> -       ret = affine_move_task(rq, p, rf, dest_cpu, flags);
>> +       if (!is_migration_disabled(p) || (flags & SCA_MIGRATE_ENABLE)) {
>> +               ret = affine_move_task(rq, p, rf, dest_cpu, flags);
>> +       } else {
>> +               task_rq_unlock(rq, p, rf);
>> +       }
> 
> This cannot be right. There might be previous set_cpus_allowed_ptr()
> callers that are blocked and waiting for the task to land on a valid
> CPU.
> 

Was thinking if just skipping as below will help here, well i am not sure .

But thinking what if we keep the task as it is on the same cpu and let's 
wait for migration to be enabled for the task to take care of it later.

------------------->O------------------------------------------

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d90d37c..7717733 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2390,8 +2390,10 @@ static int migration_cpu_stop(void *data)
          * we're holding p->pi_lock.
          */
         if (task_rq(p) == rq) {
-               if (is_migration_disabled(p))
+               if (is_migration_disabled(p)) {
+                       complete = true;
                         goto out;
+               }

                 if (pending) {


-Mukesh