linux-kernel - Re: softirq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr>
Date: Fri, 28 Jun 2024 08:30:20 +1000 (AEST)
From: Julia Lawall <julia.lawall@...ia.fr>
To: Vincent Guittot <vincent.guittot@...aro.org>
cc: K Prateek Nayak <kprateek.nayak@....com>, 
    Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
    Dietmar Eggemann <dietmar.eggemann@....com>, Mel Gorman <mgorman@...e.de>, 
    linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: softirq



On Thu, 27 Jun 2024, Vincent Guittot wrote:

> On Thu, 27 Jun 2024 at 16:13, K Prateek Nayak <kprateek.nayak@....com> wrote:
> >
> > Hello Julia,
> >
> > Some data from my 3rd Generation EPYC machine.
> >
> > On 6/27/2024 2:37 AM, Julia Lawall wrote:
> > >
> > >
> > > On Wed, 26 Jun 2024, Vincent Guittot wrote:
> > >
> > >> On Wed, 26 Jun 2024 at 07:37, Julia Lawall <julia.lawall@...ia.fr> wrote:
> > >>>
> > >>> Hello,
> > >>>
> > >>> I'm not sure to understand how soft irqs work.  I see the code:
> > >>>
> > >>> open_softirq(SCHED_SOFTIRQ, sched_balance_softirq);
> > >>>
> > >>> Intuitively, I would expect that sched_balance_softirq would be run by
> > >>> ksoftirqd.  That is, I would expect ksoftirqd to be scheduled
> > >>
> > >> By default, sched_softirq and others run in interrupt context.
> > >> ksoftirqd is woken up only in some cases like when we spent too much
> > >> time processing softirq in interrupt context or the softirq is raised
> > >> outside interrupt context
> > >
> > > nohz_csd_func calls raise_softirq_irqoff, which does:
> > >
> > > inline void raise_softirq_irqoff(unsigned int nr)
> > > {
> > >          __raise_softirq_irqoff(nr);
> > >
> > >          /*
> > >           * If we're in an interrupt or softirq, we're done
> > >           * (this also catches softirq-disabled code). We will
> > >           * actually run the softirq once we return from
> > >           * the irq or softirq.
> > >           *
> > >           * Otherwise we wake up ksoftirqd to make sure we
> > >           * schedule the softirq soon.
> > >           */
> > >          if (!in_interrupt() && should_wake_ksoftirqd())
> >
> > I think it is the !in_interrupt() check that fails. When I disable C2
> > (which is I/O Port based C-state on AMD) and only leave C0 (Poll loop)
> > and C1 (MWAIT based C-state), both of which set TIF_POLLING_NRFLAG while
> > idling, and I add the following log line:
> >
> >         trace_printk("raise_softirq_irqoff %d %lu %lu\n",
> >                      preempt_count(),
> >                      in_interrupt());
> >
> > just above the "if" condition on previous line, I see:
> >
> > #                                _-----=> irqs-off/BH-disabled
> > #                               / _----=> need-resched
> > #                              | / _---=> hardirq/softirq
> > #                              || / _--=> preempt-depth
> > #                              ||| / _-=> migrate-disable
> > #                              |||| /     delay
> > #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> > #              | |         |   |||||     |         |
> >            <idle>-0       [000] d..1.   364.875516: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
> >            <idle>-0       [000] d..1.   364.879504: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
> >            <idle>-0       [000] d..1.   365.299507: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
> >            <idle>-0       [000] d..1.   365.963524: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
> >            <idle>-0       [000] d..1.   367.291500: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
> >            <idle>-0       [000] d..1.   370.339504: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
> >            <idle>-0       [000] d..1.   371.875481: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
> >            <idle>-0       [000] d..1.   374.875462: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
> >           ...
> >
> > (Note, this is only for SCHED_SOFTIRQ being raised from nohz_csd_func())
> >
> > Since for !CONFIG_PREEMPT_RT "should_wake_ksoftirqd()" always returns
> > true, we end up waking softirqd for idle load balancing. Note that
> > "hardirq/softirq" column is always a "." since nohz_csd_func() is
> > executed from "flush_smp_call_function_queue()" on the way out of
> > do_idle().
> >
> > With C2 enabled, which is an I/O Port based C-state on AMD and does
> > not set TIF_POLLING_NRFLAG, I see:
> >
> > #                                _-----=> irqs-off/BH-disabled
> > #                               / _----=> need-resched
> > #                              | / _---=> hardirq/softirq
> > #                              || / _--=> preempt-depth
> > #                              ||| / _-=> migrate-disable
> > #                              |||| /     delay
> > #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> > #              | |         |   |||||     |         |
> >            <idle>-0       [000] d.h1.  2880.497140: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
> >            <idle>-0       [000] d.H2.  2882.193270: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65794 65792
> >            <idle>-0       [000] d.h1.  2884.857103: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
> >            <idle>-0       [000] d.h1.  2886.769577: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
> >            <idle>-0       [000] d.h1.  2886.989832: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
> >            <idle>-0       [000] d.h1.  2887.281561: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
> >            <idle>-0       [000] d.h1.  2887.825556: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
> >            <idle>-0       [000] d.h1.  2888.817564: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
> >
> > Raising of softirq here happens in "hardirq" context which, I believe,
> > will lead to SCHED_SOFTIRQ being serviced on the way out. When enabling
> > soft_irq_enter and soft_irq_exit tracepoints, I see:
> >
> >            <idle>-0       [000] d.h1.  3309.994942: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536 65536
> >            <idle>-0       [000] ..s1.  3309.994943: softirq_entry: vec=7 [action=SCHED]
> >            <idle>-0       [000] ..s1.  3309.995026: softirq_exit: vec=7 [action=SCHED]
> >
> > With the former, I do see nr_running > 1 whenever "ksoftirqd" is running
> > "sched_balance_domains":
> >
> >       ksoftirqd/0-16      [000] ..s.. 10153.805434: sched_balance_domains: nr_running: 1
> >     ksoftirqd/168-1038    [168] ..s.. 10163.765221: sched_balance_domains: nr_running: 1
> >       ksoftirqd/0-16      [000] ..s.. 10166.761349: sched_balance_domains: nr_running: 1
> >     ksoftirqd/120-747     [120] ..s.. 10166.809204: sched_balance_domains: nr_running: 2
> >     ksoftirqd/132-820     [132] ..s.. 10166.813203: sched_balance_domains: nr_running: 1
> >     ksoftirqd/246-1511    [246] ..s.. 10166.845532: sched_balance_domains: nr_running: 1
> >     ksoftirqd/107-668     [107] ..s.. 10166.853201: sched_balance_domains: nr_running: 2
> >     ksoftirqd/120-747     [120] ..s.. 10166.865359: sched_balance_domains: nr_running: 1
> >       ksoftirqd/0-16      [000] ..s.. 10191.273328: sched_balance_domains: nr_running: 1
> >       ksoftirqd/0-16      [000] ..s.. 10193.137307: sched_balance_domains: nr_running: 1
> >       ksoftirqd/0-16      [000] ..s.. 10235.057105: sched_balance_domains: nr_running: 1
> >       ksoftirqd/0-16      [000] ..s.. 10320.172832: sched_balance_domains: nr_running: 1
> >       ksoftirqd/0-16      [000] ..s.. 10323.708863: sched_balance_domains: nr_running: 1
> >       ksoftirqd/0-16      [000] ..s.. 10338.912787: sched_balance_domains: nr_running: 1
> >
>
> This is probably linked to flush_smp_call_function_queue()
>
> flush_smp_call_function_queue();
>   local_irq_save(flags);
>   __flush_smp_call_function_queue(true);
>     nohz_csd_func
>       wakeup ksoftirqd because not in interrupt context
>   if (local_softirq_pending())
>     do_softirq_post_smp_call_flush(was_pending);
>       do_softirq()
>         sched_balance_softirq
>   local_irq_restore(flags);

Thanks for the experiment Prateek.

I agree that the issue is that when polling, we are not in interrupt
context, and thus there is a wakeup ksoftirqd.  The problem is that when
there is only one idle core, waking up the ksoftirqd is counterproductive,
because it prevents load balancing, because that core is then seen as not
being idle.

julia

>
>
>
> > --
> > Thanks and Regards,
> > Prateek
> >
> > >               wakeup_softirqd();
> > > }
> > >
> > > My impression was that wakeup_softirqd was getting called.
> > >
> > > But it is true that if the code is being executed by idle, then
> > > in_interrupt() should be true.  So perhaps it is someone else who is
> > > waking up ksoftirqd.  When I switched to __raise_softirq_irqoff, the
> > > behavior seemed to change, but I may not have fully understood why that
> > > happened.
> > >
> > >>
> > >>> (sched_switch event), then the various actions of sched_balance_softirq to
> > >>> be executed, and the ksoftirqd to be unscheduled (another ksoftirqd)
> > >>> event.
> > >>>
> > >>> But in practice, I see the code of sched_balance_softirq being executed
> > >>> by the idle task, before the ksoftirqd is scheduled (see core 40):
> > >>
> > >> What wakes up ksoftirqd ? and which softirq finally runs in ksoftirqd ?
> > >>
> > >>>
> > >>>            <idle>-0     [040]  3611.432554: softirq_entry:        vec=7 [action=SCHED]
> > >>>            <idle>-0     [040]  3611.432554: bputs:                sched_balance_softirq: starting nohz
> > >>>            <idle>-0     [040]  3611.432554: bputs:                sched_balance_softirq: starting _nohz_idle_balance
> > >>>            bt.B.x-12022 [047]  3611.432554: softirq_entry:        vec=1 [action=TIMER]
> > >>>            <idle>-0     [040]  3611.432554: bputs:                _nohz_idle_balance.isra.0: searching for a cpu
> > >>>            bt.B.x-12033 [003]  3611.432554: softirq_entry:        vec=7 [action=SCHED]
> > >>>            <idle>-0     [040]  3611.432554: bputs:                sched_balance_softirq: ending _nohz_idle_balance
> > >>>            bt.B.x-12052 [011]  3611.432554: softirq_entry:        vec=7 [action=SCHED]
> > >>>            <idle>-0     [040]  3611.432554: bputs:                sched_balance_softirq: nohz returns true ending soft irq
> > >>>            <idle>-0     [040]  3611.432554: softirq_exit:         vec=7 [action=SCHED]
> > >>>
> > >>> For example, idle seems to be running the code in _nohz_idle_balance.
> > >>>
> > >>> I updated the code of _nohz_idle_balance as follows:
> > >>>
> > >>> trace_printk("searching for a cpu\n");
> > >>>          for_each_cpu_wrap(balance_cpu,  nohz.idle_cpus_mask, this_cpu+1) {
> > >>>                  if (!idle_cpu(balance_cpu))
> > >>>                          continue;
> > >>> trace_printk("found an idle cpu\n");
> > >>>
> > >>> It prints searching for a cpu, but not found an idle cpu, because the
> > >>> ksoftirqd on the core's runqueue makes the core not idle.  This makes the
> > >>> whole softirq seem fairly useless when the only idle core is the one
> > >>> raising the soft irq.
> > >>
> > >> The typical behavior is:
> > >>
> > >> CPUA                                   CPUB
> > >>                                         do_idle
> > >>                                           while (!need_resched()) {
> > >>                                           ...
> > >>
> > >> kick_ilb
> > >>    smp_call_function_single_async(CPUB)
> > >>      send_call_function_single_ipi
> > >>        raise_ipi  --------------------->    cpuidle exit event
> > >>                                             irq_handler_entry
> > >>                                               ipi_handler
> > >>                                                 raise sched_softirq
> > >>                                             irq_handler_exit
> > >>                                             sorftirq_entry
> > >>                                               sched_balance_softirq
> > >>                                                 __nohe_idle_balance
> > >>                                             softirq_exit
> > >>                                             cpuidle_enter event
> > >>
> > >> softirq is done in the interrupt context after the irq handler and
> > >> CPUB never leaves the while (!need_resched())  loop
> > >>
> > >> In your case, I suspect that you have a racing with the polling mode
> > >> and the fact that you leave the while (!need_resched()) loop and call
> > >> flush_smp_call_function_queue()
> > >>
> > >> We don't use polling on arm64 so I can't even try to reproduce your case
> > >
> > > This is with Prateek's patch.  So need_resched is not true any more.
> > >
> > > thanks,
> > > julia
> > >
> > >>>
> > >>> This is all for the same scenario that I have discussed previously, where
> > >>> there are two sockets and an overload of on thread on one and an underload
> > >>> of on thread on the other, and all the thread have been marked by numa
> > >>> balancing as preferring to be where they are.  Now I am trying Prateek's
> > >>> patch series.
> > >>>
> > >>> thanks,
> > >>> julia
> > >>
> >
>