linux-kernel - Re: [RFC PATCH v4 2/3] sched: Avoid placing RT threads on cores handling long softirqs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANDhNCr5z+seg0Y=BVkJE0+xQLnjWfJuyFdv4i7yNdf57-yeSg@mail.gmail.com>
Date:   Tue, 15 Nov 2022 13:36:31 -0800
From:   John Stultz <jstultz@...gle.com>
To:     Joel Fernandes <joel@...lfernandes.org>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        "Connor O'Brien" <connoro@...gle.com>,
        John Dias <joaodias@...gle.com>, Rick Yiu <rickyiu@...gle.com>,
        John Kacur <jkacur@...hat.com>,
        Qais Yousef <qais.yousef@....com>,
        Chris Redpath <chris.redpath@....com>,
        Abhijeet Dharmapurikar <adharmap@...cinc.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Thomas Gleixner <tglx@...utronix.de>, kernel-team@...roid.com,
        "J . Avila" <elavila@...gle.com>
Subject: Re: [RFC PATCH v4 2/3] sched: Avoid placing RT threads on cores
 handling long softirqs

On Sat, Oct 22, 2022 at 11:28 AM Joel Fernandes <joel@...lfernandes.org> wrote:
> On Mon, Oct 03, 2022 at 11:20:32PM +0000, John Stultz wrote:
> > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > index a749a8663841..e3a4add67e8c 100644
> > --- a/include/linux/interrupt.h
> > +++ b/include/linux/interrupt.h
> > @@ -582,6 +582,11 @@ enum
> >   * _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
> >   */
> >  #define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
> > +/* Softirq's where the handling might be long: */
> > +#define LONG_SOFTIRQ_MASK (BIT(NET_TX_SOFTIRQ)    | \
> > +                        BIT(NET_RX_SOFTIRQ)    | \
> > +                        BIT(BLOCK_SOFTIRQ)     | \
> > +                        BIT(IRQ_POLL_SOFTIRQ))
>
> Instead of a mask, I wonder if a better approach is to have a heuristic based
> on measurement of softirq load. There is already code to measure irq load
> (see update_irq_load_avg). Perhaps, Vincent would be willing to add the
> softirq load in it as well if asked nicely ;)
>
> That's much better because:
> 1. Any new softirqs added will also trigger this optimization.
> 2. Systems where these softirqs are quiet will not unnecessarily trigger it.
> 3. You can also include irq load so that things like IRQ storms also don't
> cause RT issues (when there are better "uncompromised" idle CPU candidates).
> 4. may not be need CONFIG option at all if the optimization makes
> sense for everything. Think all the systems not running Android.

Hey Joel,
  Thanks again for the feedback, and sorry to take awhile to get back to you.

I'll have to think about this some more, but I'm hesitant about
switching to a load based hursitic approach. Mostly as responding to
"load" feels a bit fuzzy to me.
If we suddenly get a burst of softirqs preempting scheduling, we don't
want to have to wait for the load tracking to recognize this, as the
effect of the blocked audio processing will be immediate.

That's why this optimization just avoids scheduling rt tasks on cpus
that are running potentially long-running softirqs, as we can't be
sure in that instance we'll be able to run soon.


> >  /* map softirq index to softirq name. update 'softirq_to_name' in
> >   * kernel/softirq.c when adding a new softirq.
> > @@ -617,6 +622,7 @@ extern void raise_softirq_irqoff(unsigned int nr);
> >  extern void raise_softirq(unsigned int nr);
> >
> >  DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
> > +DECLARE_PER_CPU(u32, active_softirqs);
> >
> >  static inline struct task_struct *this_cpu_ksoftirqd(void)
> >  {
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 532362fcfe31..3d1de6edcfa1 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1284,6 +1284,16 @@ config SCHED_AUTOGROUP
> >         desktop applications.  Task group autogeneration is currently based
> >         upon task session.
> >
> > +config RT_SOFTIRQ_OPTIMIZATION
>
> I much dislike this config option name because it does not self-explain what
> the option is and it is too long. More so, any future such optimizations,
> even if rare, will have to come up with another name causing more confusion.
> Instead, CONFIG_RT_AVOID_SOFTIRQS seems cleaner, but I'll defer to scheduler
> maintainers since they have to take the patch ultimately. Though I'll give my
> tag for a rename / better name. More below:

That's a fair point.  RT_SOFTIRQ_AWARE_SCHED maybe?


> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 55f39c8f4203..3c628db807c8 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1599,6 +1599,44 @@ static void yield_task_rt(struct rq *rq)
> >  #ifdef CONFIG_SMP
> >  static int find_lowest_rq(struct task_struct *task);
> >
> > +#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> > +#define __use_softirq_opt 1
>
> Why not just use IS_ENABLED(CONFIG_RT_SOFTIRQ_OPT..) instead of defining new
> macros?

Mostly for readability. The lines where its added are already fairly long, ie:
  if (static_branch_unlikely(&sched_asym_cpucapacity)) {

So it seemed nicer to have the shorter macro.  But I'm not strongly
opinionated on this.

> > + * Return whether the given cpu is currently non-preemptible
> > + * while handling a potentially long softirq, or if the current
> > + * task is likely to block preemptions soon because it is a
> > + * ksoftirq thread that is handling slow softirq.
> > + */
> > +static bool cpu_busy_with_softirqs(int cpu)
> > +{
> > +     u32 softirqs = per_cpu(active_softirqs, cpu) |
> > +                    __cpu_softirq_pending(cpu);
> > +     struct task_struct *cpu_ksoftirqd = per_cpu(ksoftirqd, cpu);
> > +     struct task_struct *curr;
> > +     struct rq *rq = cpu_rq(cpu);
> > +     int ret;
> > +
> > +     rcu_read_lock();
>
> Could you clarify what is the RCU read-side critical section protecting?
>
> > +     curr = READ_ONCE(rq->curr); /* unlocked access */
> > +     ret = (softirqs & LONG_SOFTIRQ_MASK) &&
> > +              (curr == cpu_ksoftirqd ||
> > +               preempt_count() & SOFTIRQ_MASK);
>
> If the goal is to avoid ksoftirqd when it is running an actual softirq,
> doesn't ksoftirqd already run softirqs under local_bh_disable()? If so, then
> SOFTIRQ_MASK should already be set. If ksoftirqd is in between softirqs (and
> bh is enabled there), then scheduling the RT task on the CPU may preempt
> ksoftirqd and give the RT task a chance to run anyway, right?.

Yeah. I'm refactoring/simplifying this for v5, so hopefully it will be
more straightforward.

Thanks again for the feedback!
-john