linux-kernel - Re: [PATCH RFC 1/3] sched/pe: Exclude balance callback queuing during proxy()'s migrate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <51368AF7-A80D-45B5-8D0F-5956D4E002C4@joelfernandes.org>
Date:   Thu, 15 Dec 2022 18:31:12 -0500
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     linux-kernel@...r.kernel.org, Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Ingo Molnar <mingo@...hat.com>,
        Josh Triplett <josh@...htriplett.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Valentin Schneider <vschneid@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        kernel-team@...roid.com, John Stultz <jstultz@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        Qais Yousef <qais.yousef@....com>,
        Will Deacon <will@...nel.org>,
        Waiman Long <longman@...hat.com>,
        Boqun Feng <boqun.feng@...il.com>,
        Connor O'Brien <connoro@...gle.com>
Subject: Re: [PATCH RFC 1/3] sched/pe: Exclude balance callback queuing during proxy()'s migrate



> On Dec 15, 2022, at 6:12 PM, Joel Fernandes <joel@...lfernandes.org> wrote:
> 
> On Mon, Dec 12, 2022 at 9:39 AM Dietmar Eggemann
> <dietmar.eggemann@....com> wrote:
>> 
>> Hi Joel,
>> 
>>> On 09/12/2022 17:52, Joel Fernandes wrote:
>>> Hi Dietmar!
>>> 
>>> On Fri, Dec 9, 2022 at 3:07 PM Dietmar Eggemann
>>> <dietmar.eggemann@....com> wrote:
>>>> 
>>>> On 23/11/2022 02:21, Joel Fernandes (Google) wrote:
>> 
>> [...]
>> 
>>>> You consider rt_mutex_setprio() and __sched_setscheduler() versus
>>>> proxy() but what about all the other places like load_balance(),
>>>> update_blocked_averages(),  __set_cpus_allowed_ptr() and many
>>>> more in which we take the rq lock (via task_rq_lock() or
>>>> rq_lock{_xxx}())?
>>> 
>>> You are right. Those paths potentially need updates as well. Any
>> 
>> IMHO, you would still have to lock the entire `queue->execute` (1)->(2)
>> thing, like keeping the rq lock currently.
>> 
>> __schedule()
>> 
>>  pick_next_task()
>>    pick_next_task_{rt/dl}()
>>      set_next_task_{rt/dl}()
>>       {rt/deadline}_queue_push_tasks()
>>         queue_balance_callback() -->    (1)
>> 
>>  proxy()                         ---    !!!
>> 
>>  finish_lock_switch()
>>    __balance_callbacks()         <--    (2)
>> 
>>  __balance_callbacks(rq)         <--    (2)
> 
> Instead of locking it throughout, I think we can keep my initial patch
> and just execute the balance callbacks in proxy() before dropping the
> rq lock. I *think* that will make it work properly, but I could be
> missing something.
> 
> Anyway I see the issue you bring up, I took care of balance CBs queued
> from *other CPUs* while the rq lock is dropped, but the current CPU
> executing proxy() could itself have queued balance callbacks. Dropping
> the rq lock then causes other CPUs to see the proxy() CPU's balance
> CBs in the callback list.
> 
> Anyway I will try the above and get back to you. Thanks so much again
> for the insights. Will test as you suggested below.

Another option is to dequeue them before dropping the rq lock, and then requeue them later. Again not sure if that’s a can of worms. But the first option appears to me safer in theory. Anyway, it looks like we have a couple of options on the table here for me to try.

 - Joel 


> 
> Thanks,
> 
> - Joel
> 
> 
>> Otherwise. something like this could happen:
>> 
>> With `x-x` : smp_processor_id()-cpu_of(rq)
>> 
>> lock_torture_wr-1745 [003] 338.270963: queue_balance_callback: 3-3->
>> lock_torture_wr-1745 [003] 338.270965: queue_balance_callback: 3-3<-
>>            cat-1726 [001] 338.270969: __schedule: proxy() 1-1->
>> lock_torture_wr-1745 [003] 338.270971: __schedule: proxy() 3-3->
>>            cat-1726 [001] 338.270972: __schedule: proxy() 1-1<-
>> lock_torture_wr-1745 [003] 338.270979: __schedule: proxy() 3-3<-
>>         <idle>-0    [002] 338.270981: __schedule: proxy() 2-2->
>>         <idle>-0    [002] 338.270984: __schedule: proxy() 2-2<-
>> lock_torture_wr-1745 [003] 338.270985: __schedule: proxy() 3-3->
>>    migration/4-34   [004] 338.270989: active_load_balance_cpu_stop: rq_pin_lock() 4-3 <-- ! cb on CPU3 still enqueued
>> 
>>> chance you could post stack traces or logs of those issues, just in
>>> case they have new nuggets of information? If you don't have them,
>>> don't bother, I'll reproduce it.
>> 
>> insmod /lib/modules/torture.ko random_shuffle=1 lock_torture_writer_fifo=1
>> insmod /lib/modules/locktorture.ko torture_type=mutex_lock nlocks=3
>> 
>> [  447.046916] rq->balance_callback && rq->balance_callback != &balance_push_callback
>> [  447.046926] WARNING: CPU: 1 PID: 1648 at kernel/sched/sched.h:1583 task_rq_lock+0x148/0x170
>> [  447.062914] Modules linked in: locktorture(O) torture(O)
>> [  447.068254] CPU: 1 PID: 1648 Comm: torture_shuffle Tainted: G W  O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203
>> [  447.079168] Hardware name: ARM Juno development board (r0) (DT)
>> [  447.085106] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [  447.092095] pc : task_rq_lock+0x148/0x170
>> [  447.096121] lr : task_rq_lock+0x148/0x170
>> [  447.100146] sp : ffff80000b85bd30
>> ...
>> [  447.175138] Call trace:
>> [  447.177589]  task_rq_lock+0x148/0x170
>> [  447.181267]  __set_cpus_allowed_ptr+0x34/0xc0
>> [  447.185643]  set_cpus_allowed_ptr+0x30/0x60
>> [  447.189843]  torture_shuffle+0x158/0x224 [torture]
>> [  447.194666]  kthread+0x10c/0x110
>> [  447.197906]  ret_from_fork+0x10/0x20
>> ...
>> [  447.233560] ---[ end trace 0000000000000000 ]---
>> 
>> 
>> [  446.542532] ------------[ cut here ]------------
>> [  446.553224] rq->balance_callback && rq->balance_callback != &balance_push_callback
>> [  446.553243] WARNING: CPU: 3 PID: 0 at kernel/sched/sched.h:1583 update_blocked_averages+0x784/0x78c
>> [  446.576089] Modules linked in: locktorture(O+) torture(O)
>> [  446.581551] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203
>> [  446.591723] Hardware name: ARM Juno development board (r0) (DT)
>> [  446.597691] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [  446.604712] pc : update_blocked_averages+0x784/0x78c
>> [  446.609723] lr : update_blocked_averages+0x784/0x78c
>> [  446.614733] sp : ffff80000b403b70
>> ...
>> [  446.690142] Call trace:
>> [  446.692609]  update_blocked_averages+0x784/0x78c
>> [  446.697270]  newidle_balance+0x184/0x5f0
>> [  446.701232]  pick_next_task_fair+0x2c/0x500
>> [  446.705456]  __schedule+0x1d4/0x1084
>> [  446.709070]  schedule_idle+0x28/0x4c
>> [  446.712682]  do_idle+0x1d4/0x2d0
>> [  446.715946]  cpu_startup_entry+0x28/0x30
>> [  446.719909]  secondary_start_kernel+0x138/0x15c
>> [  446.724486]  __secondary_switched+0xb0/0xb4
>> ...
>> [  446.765848] ---[ end trace 0000000000000000 ]---
>> 
>> 
>> [   95.091675] ------------[ cut here ]------------
>> [   95.096301] rq->balance_callback && rq->balance_callback != &balance_push_callback
>> [   95.096322] WARNING: CPU: 5 PID: 39 at kernel/sched/sched.h:1583 load_balance+0x644/0xdc0
>> [   95.103135] mutex_lock-torture: Creating lock_torture_writer task
>> [   95.103238] mutex_lock-torture: lock_torture_writer task started
>> [   95.110692] Modules linked in: locktorture(O+) torture(O)
>> [   95.136699] CPU: 5 PID: 39 Comm: migration/5 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204
>> [   95.147137] Hardware name: ARM Juno development board (r0) (DT)
>> [   95.153105] Stopper: 0x0 <- 0x0
>> [   95.156282] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [   95.163306] pc : load_balance+0x644/0xdc0
>> [   95.167356] lr : load_balance+0x644/0xdc0
>> [   95.171405] sp : ffff80000b4cbaf0
>> ...
>> [   95.246833] Call trace:
>> [   95.249300]  load_balance+0x644/0xdc0
>> [   95.253000]  newidle_balance+0x290/0x6f0
>> [   95.256963]  pick_next_task_fair+0x2c/0x510
>> [   95.261188]  __schedule+0x1d4/0x1084
>> [   95.264801]  schedule+0x64/0x11c
>> [   95.268063]  smpboot_thread_fn+0xa4/0x270
>> [   95.272115]  kthread+0x10c/0x110
>> [   95.275375]  ret_from_fork+0x10/0x20
>> ...
>> [   95.316477] ---[ end trace 0000000000000000 ]---
>> 
>> 
>> [  134.893379] ------------[ cut here ]------------
>> [  134.898066] rq->balance_callback && rq->balance_callback != &balance_push_callback
>> [  134.898088] WARNING: CPU: 4 PID: 0 at kernel/sched/sched.h:1583 sched_rt_period_timer+0x1dc/0x3f0
>> [  134.914683] Modules linked in: locktorture(O) torture(O)
>> [  134.920059] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G W  O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204
>> [  134.930235] Hardware name: ARM Juno development board (r0) (DT)
>> [  134.936205] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [  134.943229] pc : sched_rt_period_timer+0x1dc/0x3f0
>> [  134.948069] lr : sched_rt_period_timer+0x1dc/0x3f0
>> [  134.952908] sp : ffff80000b2cbde0
>> ...
>> [  135.028342] Call trace:
>> [  135.030810]  sched_rt_period_timer+0x1dc/0x3f0
>> [  135.035300]  __hrtimer_run_queues+0x184/0x504
>> [  135.039700]  hrtimer_interrupt+0xe8/0x244
>> [  135.043749]  arch_timer_handler_phys+0x2c/0x44
>> [  135.048239]  handle_percpu_devid_irq+0x8c/0x150
>> [  135.052815]  generic_handle_domain_irq+0x2c/0x44
>> [  135.057480]  gic_handle_irq+0x44/0xc4
>> [  135.061180]  call_on_irq_stack+0x2c/0x5c
>> [  135.065143]  do_interrupt_handler+0x80/0x84
>> ...
>> [  135.141122] ---[ end trace 0000000000000000 ]---
>> 
>>>> With your changes to locktorture in {2-3}/3 you still run CFS
>>>> lock_torture_writers but you should see the issue popping up
>>>> in __set_cpus_allowed_ptr() (from torture_shuffle()) for example.
>>>> 
>>>> Tried with:
>>>> 
>>>> insmod /lib/modules/torture.ko
>>>> insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=1 nlocks=3
>>>>                                                                      ^^^^^^^^^^^^^^^^^
>>>> 
>>>> When changing all lock_torture_writer's to FIFO it becomes even
>>>> more visible.
>>> 
>>> Ok, thank you, I will make it more aggressively set to RT. The
>>> previous locktorture was setting RT once every minute or so, at least
>>> now that is down to 10 seconds ;-). But as you highlight with the
>>> locktorture diff below, it needs to go further.
>> 
>> Yeah, the test env is more aggressive this way and you spot potential
>> issues quicker.
>> 
>>> Anyway, this is going to be a nice holiday/Christmas project for me,
>>> so thank you in advance for the holiday gift  :-)  ^_^
>> 
>> Enjoy ;-)