lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 25 Sep 2020 18:49:26 +0100
From:   Valentin Schneider <valentin.schneider@....com>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>, tglx@...utronix.de,
        mingo@...nel.org, linux-kernel@...r.kernel.org,
        bigeasy@...utronix.de, qais.yousef@....com, swood@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vincent.donnefort@....com
Subject: Re: [PATCH 0/9] sched: Migrate disable support


On 25/09/20 13:19, Valentin Schneider wrote:
> On 25/09/20 12:58, Dietmar Eggemann wrote:
>> With Valentin's print_rq() inspired test snippet I always see one of the
>> RT user tasks as the second guy? BTW, it has to be RT tasks, never
>> triggered with CFS tasks.
>>
>> [   57.849268] CPU2 nr_running=2
>> [   57.852241]  p=migration/2
>> [   57.854967]  p=task0-0
>
> I can also trigger the BUG_ON() using the built-in locktorture module
> (+enabling hotplug torture), and it happens very early on. I can't trigger
> it under qemu sadly :/ Also, in my case it's always a kworker:
>
> [    0.830462] CPU3 nr_running=2
> [    0.833443]  p=migration/3
> [    0.836150]  p=kworker/3:0
>
> I'm looking into what workqueue.c is doing about hotplug...

So with
- The pending migration fixup (20200925095615.GA2651@...ez.programming.kicks-ass.net)
- The workqueue set_cpus_allowed_ptr() change (from IRC)
- The set_rq_offline() move + DL/RT pull && rq->online (also from IRC)

my Juno survives rtmutex + hotplug locktorture, where it would previously
explode < 1s after boot (mostly due to the workqueue thing).

I stared a bit more at the rq_offline() + DL/RT bits and they look fine to
me.

The one thing I'm not entirely sure about is while you plugged the
class->balance() hole, AIUI we might still get RT (DL?) pull callbacks
enqueued - say if we just unthrottled an RT RQ and something changes the
priority of one of the freshly-released tasks (user or rtmutex
interaction), I don't see any stopgap preventing a pull from happening.

I slapped the following on top of my kernel and it didn't die, although I'm
not sure I'm correctly stressing this path. Perhaps we could limit that to
the pull paths, since technically we're okay with pushing out of an !online
RQ.

---
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50aac5b6db26..00d1a7b85e97 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1403,7 +1403,7 @@ queue_balance_callback(struct rq *rq,
 {
        lockdep_assert_held(&rq->lock);

-	if (unlikely(head->next))
+	if (unlikely(head->next || !rq->online))
                return;

        head->func = (void (*)(struct callback_head *))func;
---

Powered by blists - more mailing lists