[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bb174ad3-e5f6-8643-21f4-40604323c00b@amd.com>
Date: Thu, 27 Jun 2024 19:43:23 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Julia Lawall <julia.lawall@...ia.fr>, Vincent Guittot
<vincent.guittot@...aro.org>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>, Mel Gorman <mgorman@...e.de>,
linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: softirq
Hello Julia,
Some data from my 3rd Generation EPYC machine.
On 6/27/2024 2:37 AM, Julia Lawall wrote:
>
>
> On Wed, 26 Jun 2024, Vincent Guittot wrote:
>
>> On Wed, 26 Jun 2024 at 07:37, Julia Lawall <julia.lawall@...ia.fr> wrote:
>>>
>>> Hello,
>>>
>>> I'm not sure to understand how soft irqs work. I see the code:
>>>
>>> open_softirq(SCHED_SOFTIRQ, sched_balance_softirq);
>>>
>>> Intuitively, I would expect that sched_balance_softirq would be run by
>>> ksoftirqd. That is, I would expect ksoftirqd to be scheduled
>>
>> By default, sched_softirq and others run in interrupt context.
>> ksoftirqd is woken up only in some cases like when we spent too much
>> time processing softirq in interrupt context or the softirq is raised
>> outside interrupt context
>
> nohz_csd_func calls raise_softirq_irqoff, which does:
>
> inline void raise_softirq_irqoff(unsigned int nr)
> {
> __raise_softirq_irqoff(nr);
>
> /*
> * If we're in an interrupt or softirq, we're done
> * (this also catches softirq-disabled code). We will
> * actually run the softirq once we return from
> * the irq or softirq.
> *
> * Otherwise we wake up ksoftirqd to make sure we
> * schedule the softirq soon.
> */
> if (!in_interrupt() && should_wake_ksoftirqd())
I think it is the !in_interrupt() check that fails. When I disable C2
(which is I/O Port based C-state on AMD) and only leave C0 (Poll loop)
and C1 (MWAIT based C-state), both of which set TIF_POLLING_NRFLAG while
idling, and I add the following log line:
trace_printk("raise_softirq_irqoff %d %lu %lu\n",
preempt_count(),
in_interrupt());
just above the "if" condition on previous line, I see:
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [000] d..1. 364.875516: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 364.879504: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 365.299507: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 365.963524: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 367.291500: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 370.339504: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 371.875481: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 374.875462: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
...
(Note, this is only for SCHED_SOFTIRQ being raised from nohz_csd_func())
Since for !CONFIG_PREEMPT_RT "should_wake_ksoftirqd()" always returns
true, we end up waking softirqd for idle load balancing. Note that
"hardirq/softirq" column is always a "." since nohz_csd_func() is
executed from "flush_smp_call_function_queue()" on the way out of
do_idle().
With C2 enabled, which is an I/O Port based C-state on AMD and does
not set TIF_POLLING_NRFLAG, I see:
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [000] d.h1. 2880.497140: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.H2. 2882.193270: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65794 65792
<idle>-0 [000] d.h1. 2884.857103: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2886.769577: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2886.989832: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2887.281561: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2887.825556: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2888.817564: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
Raising of softirq here happens in "hardirq" context which, I believe,
will lead to SCHED_SOFTIRQ being serviced on the way out. When enabling
soft_irq_enter and soft_irq_exit tracepoints, I see:
<idle>-0 [000] d.h1. 3309.994942: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536 65536
<idle>-0 [000] ..s1. 3309.994943: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [000] ..s1. 3309.995026: softirq_exit: vec=7 [action=SCHED]
With the former, I do see nr_running > 1 whenever "ksoftirqd" is running
"sched_balance_domains":
ksoftirqd/0-16 [000] ..s.. 10153.805434: sched_balance_domains: nr_running: 1
ksoftirqd/168-1038 [168] ..s.. 10163.765221: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10166.761349: sched_balance_domains: nr_running: 1
ksoftirqd/120-747 [120] ..s.. 10166.809204: sched_balance_domains: nr_running: 2
ksoftirqd/132-820 [132] ..s.. 10166.813203: sched_balance_domains: nr_running: 1
ksoftirqd/246-1511 [246] ..s.. 10166.845532: sched_balance_domains: nr_running: 1
ksoftirqd/107-668 [107] ..s.. 10166.853201: sched_balance_domains: nr_running: 2
ksoftirqd/120-747 [120] ..s.. 10166.865359: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10191.273328: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10193.137307: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10235.057105: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10320.172832: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10323.708863: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10338.912787: sched_balance_domains: nr_running: 1
--
Thanks and Regards,
Prateek
> wakeup_softirqd();
> }
>
> My impression was that wakeup_softirqd was getting called.
>
> But it is true that if the code is being executed by idle, then
> in_interrupt() should be true. So perhaps it is someone else who is
> waking up ksoftirqd. When I switched to __raise_softirq_irqoff, the
> behavior seemed to change, but I may not have fully understood why that
> happened.
>
>>
>>> (sched_switch event), then the various actions of sched_balance_softirq to
>>> be executed, and the ksoftirqd to be unscheduled (another ksoftirqd)
>>> event.
>>>
>>> But in practice, I see the code of sched_balance_softirq being executed
>>> by the idle task, before the ksoftirqd is scheduled (see core 40):
>>
>> What wakes up ksoftirqd ? and which softirq finally runs in ksoftirqd ?
>>
>>>
>>> <idle>-0 [040] 3611.432554: softirq_entry: vec=7 [action=SCHED]
>>> <idle>-0 [040] 3611.432554: bputs: sched_balance_softirq: starting nohz
>>> <idle>-0 [040] 3611.432554: bputs: sched_balance_softirq: starting _nohz_idle_balance
>>> bt.B.x-12022 [047] 3611.432554: softirq_entry: vec=1 [action=TIMER]
>>> <idle>-0 [040] 3611.432554: bputs: _nohz_idle_balance.isra.0: searching for a cpu
>>> bt.B.x-12033 [003] 3611.432554: softirq_entry: vec=7 [action=SCHED]
>>> <idle>-0 [040] 3611.432554: bputs: sched_balance_softirq: ending _nohz_idle_balance
>>> bt.B.x-12052 [011] 3611.432554: softirq_entry: vec=7 [action=SCHED]
>>> <idle>-0 [040] 3611.432554: bputs: sched_balance_softirq: nohz returns true ending soft irq
>>> <idle>-0 [040] 3611.432554: softirq_exit: vec=7 [action=SCHED]
>>>
>>> For example, idle seems to be running the code in _nohz_idle_balance.
>>>
>>> I updated the code of _nohz_idle_balance as follows:
>>>
>>> trace_printk("searching for a cpu\n");
>>> for_each_cpu_wrap(balance_cpu, nohz.idle_cpus_mask, this_cpu+1) {
>>> if (!idle_cpu(balance_cpu))
>>> continue;
>>> trace_printk("found an idle cpu\n");
>>>
>>> It prints searching for a cpu, but not found an idle cpu, because the
>>> ksoftirqd on the core's runqueue makes the core not idle. This makes the
>>> whole softirq seem fairly useless when the only idle core is the one
>>> raising the soft irq.
>>
>> The typical behavior is:
>>
>> CPUA CPUB
>> do_idle
>> while (!need_resched()) {
>> ...
>>
>> kick_ilb
>> smp_call_function_single_async(CPUB)
>> send_call_function_single_ipi
>> raise_ipi ---------------------> cpuidle exit event
>> irq_handler_entry
>> ipi_handler
>> raise sched_softirq
>> irq_handler_exit
>> sorftirq_entry
>> sched_balance_softirq
>> __nohe_idle_balance
>> softirq_exit
>> cpuidle_enter event
>>
>> softirq is done in the interrupt context after the irq handler and
>> CPUB never leaves the while (!need_resched()) loop
>>
>> In your case, I suspect that you have a racing with the polling mode
>> and the fact that you leave the while (!need_resched()) loop and call
>> flush_smp_call_function_queue()
>>
>> We don't use polling on arm64 so I can't even try to reproduce your case
>
> This is with Prateek's patch. So need_resched is not true any more.
>
> thanks,
> julia
>
>>>
>>> This is all for the same scenario that I have discussed previously, where
>>> there are two sockets and an overload of on thread on one and an underload
>>> of on thread on the other, and all the thread have been marked by numa
>>> balancing as preferring to be where they are. Now I am trying Prateek's
>>> patch series.
>>>
>>> thanks,
>>> julia
>>
Powered by blists - more mailing lists