linux-kernel - Re: softirq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf7716fd-a148-aae2-45d-94862549e20@inria.fr>
Date: Thu, 27 Jun 2024 07:07:52 +1000 (AEST)
From: Julia Lawall <julia.lawall@...ia.fr>
To: Vincent Guittot <vincent.guittot@...aro.org>
cc: Julia Lawall <julia.lawall@...ia.fr>, 
    Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
    Dietmar Eggemann <dietmar.eggemann@....com>, Mel Gorman <mgorman@...e.de>, 
    K Prateek Nayak <kprateek.nayak@....com>, 
    linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: softirq



On Wed, 26 Jun 2024, Vincent Guittot wrote:

> On Wed, 26 Jun 2024 at 07:37, Julia Lawall <julia.lawall@...ia.fr> wrote:
> >
> > Hello,
> >
> > I'm not sure to understand how soft irqs work.  I see the code:
> >
> > open_softirq(SCHED_SOFTIRQ, sched_balance_softirq);
> >
> > Intuitively, I would expect that sched_balance_softirq would be run by
> > ksoftirqd.  That is, I would expect ksoftirqd to be scheduled
>
> By default, sched_softirq and others run in interrupt context.
> ksoftirqd is woken up only in some cases like when we spent too much
> time processing softirq in interrupt context or the softirq is raised
> outside interrupt context

nohz_csd_func calls raise_softirq_irqoff, which does:

inline void raise_softirq_irqoff(unsigned int nr)
{
        __raise_softirq_irqoff(nr);

        /*
         * If we're in an interrupt or softirq, we're done
         * (this also catches softirq-disabled code). We will
         * actually run the softirq once we return from
         * the irq or softirq.
         *
         * Otherwise we wake up ksoftirqd to make sure we
         * schedule the softirq soon.
         */
        if (!in_interrupt() && should_wake_ksoftirqd())
		wakeup_softirqd();
}

My impression was that wakeup_softirqd was getting called.

But it is true that if the code is being executed by idle, then
in_interrupt() should be true.  So perhaps it is someone else who is
waking up ksoftirqd.  When I switched to __raise_softirq_irqoff, the
behavior seemed to change, but I may not have fully understood why that
happened.

>
> > (sched_switch event), then the various actions of sched_balance_softirq to
> > be executed, and the ksoftirqd to be unscheduled (another ksoftirqd)
> > event.
> >
> > But in practice, I see the code of sched_balance_softirq being executed
> > by the idle task, before the ksoftirqd is scheduled (see core 40):
>
> What wakes up ksoftirqd ? and which softirq finally runs in ksoftirqd ?
>
> >
> >           <idle>-0     [040]  3611.432554: softirq_entry:        vec=7 [action=SCHED]
> >           <idle>-0     [040]  3611.432554: bputs:                sched_balance_softirq: starting nohz
> >           <idle>-0     [040]  3611.432554: bputs:                sched_balance_softirq: starting _nohz_idle_balance
> >           bt.B.x-12022 [047]  3611.432554: softirq_entry:        vec=1 [action=TIMER]
> >           <idle>-0     [040]  3611.432554: bputs:                _nohz_idle_balance.isra.0: searching for a cpu
> >           bt.B.x-12033 [003]  3611.432554: softirq_entry:        vec=7 [action=SCHED]
> >           <idle>-0     [040]  3611.432554: bputs:                sched_balance_softirq: ending _nohz_idle_balance
> >           bt.B.x-12052 [011]  3611.432554: softirq_entry:        vec=7 [action=SCHED]
> >           <idle>-0     [040]  3611.432554: bputs:                sched_balance_softirq: nohz returns true ending soft irq
> >           <idle>-0     [040]  3611.432554: softirq_exit:         vec=7 [action=SCHED]
> >
> > For example, idle seems to be running the code in _nohz_idle_balance.
> >
> > I updated the code of _nohz_idle_balance as follows:
> >
> > trace_printk("searching for a cpu\n");
> >         for_each_cpu_wrap(balance_cpu,  nohz.idle_cpus_mask, this_cpu+1) {
> >                 if (!idle_cpu(balance_cpu))
> >                         continue;
> > trace_printk("found an idle cpu\n");
> >
> > It prints searching for a cpu, but not found an idle cpu, because the
> > ksoftirqd on the core's runqueue makes the core not idle.  This makes the
> > whole softirq seem fairly useless when the only idle core is the one
> > raising the soft irq.
>
> The typical behavior is:
>
> CPUA                                   CPUB
>                                        do_idle
>                                          while (!need_resched()) {
>                                          ...
>
> kick_ilb
>   smp_call_function_single_async(CPUB)
>     send_call_function_single_ipi
>       raise_ipi  --------------------->    cpuidle exit event
>                                            irq_handler_entry
>                                              ipi_handler
>                                                raise sched_softirq
>                                            irq_handler_exit
>                                            sorftirq_entry
>                                              sched_balance_softirq
>                                                __nohe_idle_balance
>                                            softirq_exit
>                                            cpuidle_enter event
>
> softirq is done in the interrupt context after the irq handler and
> CPUB never leaves the while (!need_resched())  loop
>
> In your case, I suspect that you have a racing with the polling mode
> and the fact that you leave the while (!need_resched()) loop and call
> flush_smp_call_function_queue()
>
> We don't use polling on arm64 so I can't even try to reproduce your case

This is with Prateek's patch.  So need_resched is not true any more.

thanks,
julia

> >
> > This is all for the same scenario that I have discussed previously, where
> > there are two sockets and an overload of on thread on one and an underload
> > of on thread on the other, and all the thread have been marked by numa
> > balancing as preferring to be where they are.  Now I am trying Prateek's
> > patch series.
> >
> > thanks,
> > julia
>