[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5fce12a8.c23.190733d53e9.Coremail.yangyingliang@huaweicloud.com>
Date: Tue, 2 Jul 2024 19:38:37 +0800 (GMT+08:00)
From: yangyingliang@...weicloud.com
To: "Peter Zijlstra" <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com,
rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
bristot@...hat.com, vschneid@...hat.com, tglx@...utronix.de,
yu.c.chen@...el.com, tim.c.chen@...ux.intel.com,
yangyingliang@...wei.com, liwei391@...wei.com
Subject: Re: [PATCH resend] sched/smt: fix unbalance sched_smt_present
dec/inc
> -----Original Messages-----
> From: "Peter Zijlstra" <peterz@...radead.org>
> Sent Time: 2024-07-02 16:44:18 (Tuesday)
> To: "Yang Yingliang" <yangyingliang@...weicloud.com>
> Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com, tglx@...utronix.de, yu.c.chen@...el.com, tim.c.chen@...ux.intel.com, yangyingliang@...wei.com, liwei391@...wei.com
> Subject: Re: [PATCH resend] sched/smt: fix unbalance sched_smt_present dec/inc
>
> On Tue, Jul 02, 2024 at 04:11:28PM +0800, Yang Yingliang wrote:
> > From: Yang Yingliang <yangyingliang@...wei.com>
> >
> > I got the following warn report while doing stress test:
> >
> > jump label: negative count!
> > WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0
> > Call Trace:
> > <task>
> > __static_key_slow_dec_cpuslocked+0x16/0x70
> > sched_cpu_deactivate+0x26e/0x2a0
> > cpuhp_invoke_callback+0x3ad/0x10d0
> > cpuhp_thread_fun+0x3f5/0x680
> > smpboot_thread_fn+0x56d/0x8d0
> > kthread+0x309/0x400
> > ret_from_fork+0x41/0x70
> > ret_from_fork_asm+0x1b/0x30
> > </task>
> >
> > Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(),
> > the cpu offline failed, but sched_smt_present is decremented before
> > calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so
> > fix it by incrementing sched_smt_present in the error path.
> >
> > Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology")
> > Reviewed-by: Chen Yu <yu.c.chen@...el.com>
> > Reviewed-by: Tim Chen <tim.c.chen@...ux.intel.com>
> > Signed-off-by: Yang Yingliang <yangyingliang@...wei.com>
> > ---
> > kernel/sched/core.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bcf2c4cc0522..5ab6717b57e0 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9756,6 +9756,10 @@ int sched_cpu_deactivate(unsigned int cpu)
> > sched_update_numa(cpu, false);
> > ret = cpuset_cpu_inactive(cpu);
> > if (ret) {
> > +#ifdef CONFIG_SCHED_SMT
> > + if (cpumask_weight(cpu_smt_mask(cpu)) == 2)
> > + static_branch_inc_cpuslocked(&sched_smt_present);
> > +#endif
> > balance_push_set(cpu, false);
> > set_cpu_active(cpu, true);
> > sched_update_numa(cpu, true);
>
> Yes, does indeed appear needed, however!, when I look at
> what else goes before this failure, should we not also call
> set_rq_online() and things like that?
Yes, set_rq_online() is needed in the error path. I will send a new patch to add this.
>
> That is, can we rework things to be less fragile by sharing code between
> this error path and sched_cpu_activate() ?
</yangyingliang@...wei.com></tim.c.chen@...ux.intel.com></yu.c.chen@...el.com></yangyingliang@...wei.com></yangyingliang@...weicloud.com></peterz@...radead.org>
Powered by blists - more mailing lists