linux-kernel - Re: [PATCH RFC 1/3] sched/pe: Exclude balance callback queuing during proxy()'s migrate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEXW_YQdTgdoPbPWsf3zwvSdk=KPyPPEGG2m5aEWS4e5z9zo+Q@mail.gmail.com>
Date:   Thu, 15 Dec 2022 18:12:19 -0500
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     linux-kernel@...r.kernel.org, Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Ingo Molnar <mingo@...hat.com>,
        Josh Triplett <josh@...htriplett.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Valentin Schneider <vschneid@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        kernel-team@...roid.com, John Stultz <jstultz@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Qais Yousef <qais.yousef@....com>,
        Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>,
        Boqun Feng <boqun.feng@...il.com>,
        "Connor O'Brien" <connoro@...gle.com>
Subject: Re: [PATCH RFC 1/3] sched/pe: Exclude balance callback queuing during
 proxy()'s migrate

On Mon, Dec 12, 2022 at 9:39 AM Dietmar Eggemann
<dietmar.eggemann@....com> wrote:
>
> Hi Joel,
>
> On 09/12/2022 17:52, Joel Fernandes wrote:
> > Hi Dietmar!
> >
> > On Fri, Dec 9, 2022 at 3:07 PM Dietmar Eggemann
> > <dietmar.eggemann@....com> wrote:
> >>
> >> On 23/11/2022 02:21, Joel Fernandes (Google) wrote:
>
> [...]
>
> >> You consider rt_mutex_setprio() and __sched_setscheduler() versus
> >> proxy() but what about all the other places like load_balance(),
> >> update_blocked_averages(),  __set_cpus_allowed_ptr() and many
> >> more in which we take the rq lock (via task_rq_lock() or
> >> rq_lock{_xxx}())?
> >
> > You are right. Those paths potentially need updates as well. Any
>
> IMHO, you would still have to lock the entire `queue->execute` (1)->(2)
> thing, like keeping the rq lock currently.
>
> __schedule()
>
>   pick_next_task()
>     pick_next_task_{rt/dl}()
>       set_next_task_{rt/dl}()
>        {rt/deadline}_queue_push_tasks()
>          queue_balance_callback() -->    (1)
>
>   proxy()                         ---    !!!
>
>   finish_lock_switch()
>     __balance_callbacks()         <--    (2)
>
>   __balance_callbacks(rq)         <--    (2)

Instead of locking it throughout, I think we can keep my initial patch
and just execute the balance callbacks in proxy() before dropping the
rq lock. I *think* that will make it work properly, but I could be
missing something.

Anyway I see the issue you bring up, I took care of balance CBs queued
from *other CPUs* while the rq lock is dropped, but the current CPU
executing proxy() could itself have queued balance callbacks. Dropping
the rq lock then causes other CPUs to see the proxy() CPU's balance
CBs in the callback list.

Anyway I will try the above and get back to you. Thanks so much again
for the insights. Will test as you suggested below.

Thanks,

 - Joel


> Otherwise. something like this could happen:
>
> With `x-x` : smp_processor_id()-cpu_of(rq)
>
> lock_torture_wr-1745 [003] 338.270963: queue_balance_callback: 3-3->
> lock_torture_wr-1745 [003] 338.270965: queue_balance_callback: 3-3<-
>             cat-1726 [001] 338.270969: __schedule: proxy() 1-1->
> lock_torture_wr-1745 [003] 338.270971: __schedule: proxy() 3-3->
>             cat-1726 [001] 338.270972: __schedule: proxy() 1-1<-
> lock_torture_wr-1745 [003] 338.270979: __schedule: proxy() 3-3<-
>          <idle>-0    [002] 338.270981: __schedule: proxy() 2-2->
>          <idle>-0    [002] 338.270984: __schedule: proxy() 2-2<-
> lock_torture_wr-1745 [003] 338.270985: __schedule: proxy() 3-3->
>     migration/4-34   [004] 338.270989: active_load_balance_cpu_stop: rq_pin_lock() 4-3 <-- ! cb on CPU3 still enqueued
>
> > chance you could post stack traces or logs of those issues, just in
> > case they have new nuggets of information? If you don't have them,
> > don't bother, I'll reproduce it.
>
> insmod /lib/modules/torture.ko random_shuffle=1 lock_torture_writer_fifo=1
> insmod /lib/modules/locktorture.ko torture_type=mutex_lock nlocks=3
>
> [  447.046916] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [  447.046926] WARNING: CPU: 1 PID: 1648 at kernel/sched/sched.h:1583 task_rq_lock+0x148/0x170
> [  447.062914] Modules linked in: locktorture(O) torture(O)
> [  447.068254] CPU: 1 PID: 1648 Comm: torture_shuffle Tainted: G W  O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203
> [  447.079168] Hardware name: ARM Juno development board (r0) (DT)
> [  447.085106] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  447.092095] pc : task_rq_lock+0x148/0x170
> [  447.096121] lr : task_rq_lock+0x148/0x170
> [  447.100146] sp : ffff80000b85bd30
> ...
> [  447.175138] Call trace:
> [  447.177589]  task_rq_lock+0x148/0x170
> [  447.181267]  __set_cpus_allowed_ptr+0x34/0xc0
> [  447.185643]  set_cpus_allowed_ptr+0x30/0x60
> [  447.189843]  torture_shuffle+0x158/0x224 [torture]
> [  447.194666]  kthread+0x10c/0x110
> [  447.197906]  ret_from_fork+0x10/0x20
> ...
> [  447.233560] ---[ end trace 0000000000000000 ]---
>
>
> [  446.542532] ------------[ cut here ]------------
> [  446.553224] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [  446.553243] WARNING: CPU: 3 PID: 0 at kernel/sched/sched.h:1583 update_blocked_averages+0x784/0x78c
> [  446.576089] Modules linked in: locktorture(O+) torture(O)
> [  446.581551] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203
> [  446.591723] Hardware name: ARM Juno development board (r0) (DT)
> [  446.597691] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  446.604712] pc : update_blocked_averages+0x784/0x78c
> [  446.609723] lr : update_blocked_averages+0x784/0x78c
> [  446.614733] sp : ffff80000b403b70
> ...
> [  446.690142] Call trace:
> [  446.692609]  update_blocked_averages+0x784/0x78c
> [  446.697270]  newidle_balance+0x184/0x5f0
> [  446.701232]  pick_next_task_fair+0x2c/0x500
> [  446.705456]  __schedule+0x1d4/0x1084
> [  446.709070]  schedule_idle+0x28/0x4c
> [  446.712682]  do_idle+0x1d4/0x2d0
> [  446.715946]  cpu_startup_entry+0x28/0x30
> [  446.719909]  secondary_start_kernel+0x138/0x15c
> [  446.724486]  __secondary_switched+0xb0/0xb4
> ...
> [  446.765848] ---[ end trace 0000000000000000 ]---
>
>
> [   95.091675] ------------[ cut here ]------------
> [   95.096301] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [   95.096322] WARNING: CPU: 5 PID: 39 at kernel/sched/sched.h:1583 load_balance+0x644/0xdc0
> [   95.103135] mutex_lock-torture: Creating lock_torture_writer task
> [   95.103238] mutex_lock-torture: lock_torture_writer task started
> [   95.110692] Modules linked in: locktorture(O+) torture(O)
> [   95.136699] CPU: 5 PID: 39 Comm: migration/5 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204
> [   95.147137] Hardware name: ARM Juno development board (r0) (DT)
> [   95.153105] Stopper: 0x0 <- 0x0
> [   95.156282] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   95.163306] pc : load_balance+0x644/0xdc0
> [   95.167356] lr : load_balance+0x644/0xdc0
> [   95.171405] sp : ffff80000b4cbaf0
> ...
> [   95.246833] Call trace:
> [   95.249300]  load_balance+0x644/0xdc0
> [   95.253000]  newidle_balance+0x290/0x6f0
> [   95.256963]  pick_next_task_fair+0x2c/0x510
> [   95.261188]  __schedule+0x1d4/0x1084
> [   95.264801]  schedule+0x64/0x11c
> [   95.268063]  smpboot_thread_fn+0xa4/0x270
> [   95.272115]  kthread+0x10c/0x110
> [   95.275375]  ret_from_fork+0x10/0x20
> ...
> [   95.316477] ---[ end trace 0000000000000000 ]---
>
>
> [  134.893379] ------------[ cut here ]------------
> [  134.898066] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [  134.898088] WARNING: CPU: 4 PID: 0 at kernel/sched/sched.h:1583 sched_rt_period_timer+0x1dc/0x3f0
> [  134.914683] Modules linked in: locktorture(O) torture(O)
> [  134.920059] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G W  O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204
> [  134.930235] Hardware name: ARM Juno development board (r0) (DT)
> [  134.936205] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  134.943229] pc : sched_rt_period_timer+0x1dc/0x3f0
> [  134.948069] lr : sched_rt_period_timer+0x1dc/0x3f0
> [  134.952908] sp : ffff80000b2cbde0
> ...
> [  135.028342] Call trace:
> [  135.030810]  sched_rt_period_timer+0x1dc/0x3f0
> [  135.035300]  __hrtimer_run_queues+0x184/0x504
> [  135.039700]  hrtimer_interrupt+0xe8/0x244
> [  135.043749]  arch_timer_handler_phys+0x2c/0x44
> [  135.048239]  handle_percpu_devid_irq+0x8c/0x150
> [  135.052815]  generic_handle_domain_irq+0x2c/0x44
> [  135.057480]  gic_handle_irq+0x44/0xc4
> [  135.061180]  call_on_irq_stack+0x2c/0x5c
> [  135.065143]  do_interrupt_handler+0x80/0x84
> ...
> [  135.141122] ---[ end trace 0000000000000000 ]---
>
> >> With your changes to locktorture in {2-3}/3 you still run CFS
> >> lock_torture_writers but you should see the issue popping up
> >> in __set_cpus_allowed_ptr() (from torture_shuffle()) for example.
> >>
> >> Tried with:
> >>
> >> insmod /lib/modules/torture.ko
> >> insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=1 nlocks=3
> >>                                                                       ^^^^^^^^^^^^^^^^^
> >>
> >> When changing all lock_torture_writer's to FIFO it becomes even
> >> more visible.
> >
> > Ok, thank you, I will make it more aggressively set to RT. The
> > previous locktorture was setting RT once every minute or so, at least
> > now that is down to 10 seconds ;-). But as you highlight with the
> > locktorture diff below, it needs to go further.
>
> Yeah, the test env is more aggressive this way and you spot potential
> issues quicker.
>
> > Anyway, this is going to be a nice holiday/Christmas project for me,
> > so thank you in advance for the holiday gift  :-)  ^_^
>
> Enjoy ;-)