linux-kernel - Re: [PATCH RFC v4 2/3] sched: Avoid placing RT threads on cores handling long softirqs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCoTGNANDnOQ6touhreG_UEn1-N4T4BktWFxpLPSWVfrdA@mail.gmail.com>
Date:   Mon, 17 Oct 2022 20:42:53 -0700
From:   John Stultz <jstultz@...gle.com>
To:     Alexander Gordeev <agordeev@...ux.ibm.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        "Connor O'Brien" <connoro@...gle.com>,
        John Dias <joaodias@...gle.com>, Rick Yiu <rickyiu@...gle.com>,
        John Kacur <jkacur@...hat.com>,
        Qais Yousef <qais.yousef@....com>,
        Chris Redpath <chris.redpath@....com>,
        Abhijeet Dharmapurikar <adharmap@...cinc.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Thomas Gleixner <tglx@...utronix.de>, kernel-team@...roid.com,
        "J . Avila" <elavila@...gle.com>
Subject: Re: [PATCH RFC v4 2/3] sched: Avoid placing RT threads on cores
 handling long softirqs

 On Mon, Oct 17, 2022 at 5:40 AM Alexander Gordeev
<agordeev@...ux.ibm.com> wrote:
>
> On Mon, Oct 03, 2022 at 11:20:32PM +0000, John Stultz wrote:
> > From: Connor O'Brien <connoro@...gle.com>
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 55f39c8f4203..3c628db807c8 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -1599,6 +1599,44 @@ static void yield_task_rt(struct rq *rq)
> >  #ifdef CONFIG_SMP
> >  static int find_lowest_rq(struct task_struct *task);
> >
> > +#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> > +#define __use_softirq_opt 1
> > +/*
> > + * Return whether the given cpu is currently non-preemptible
> > + * while handling a potentially long softirq, or if the current
> > + * task is likely to block preemptions soon because it is a
> > + * ksoftirq thread that is handling slow softirq.
>
> What is slow softirqs in this context compared to long?

So the long softirqs are any of the softirqs in the LONG_SOFTIRQ_MASK
(net, block, irqpoll).
The "slow softirq" just refers to softirq processing that has already
been pushed out to ksoftirqd on the target cpu.

By default ksoftirqd's priority is likely not high enough priority to
prevent the rt task from running, however, it does disable preemption
while it's running the softirqs, so it effectively could block the rt
task from running for a while.


> > + */
> > +static bool cpu_busy_with_softirqs(int cpu)
> > +{
> > +     u32 softirqs = per_cpu(active_softirqs, cpu) |
> > +                    __cpu_softirq_pending(cpu);
> > +     struct task_struct *cpu_ksoftirqd = per_cpu(ksoftirqd, cpu);
> > +     struct task_struct *curr;
> > +     struct rq *rq = cpu_rq(cpu);
> > +     int ret;
> > +
> > +     rcu_read_lock();
> > +     curr = READ_ONCE(rq->curr); /* unlocked access */
>
> select_task_rq_rt() takes the lock and reads curr already,
> before calling this funciton. I think there is a way to
> decompose it in a better way.

Hrm. Suggestions? As select_task_rq_rt() is only one of the callers.
Trying to pass curr into cpu_busy_with_softirqs() would mean
cpupri_find_fitness() would need to read the cpu_rq(cpu)->curr for the
specified cpu and pass that in.

The original patch actually was


> > +     ret = (softirqs & LONG_SOFTIRQ_MASK) &&
> > +              (curr == cpu_ksoftirqd ||
>
> EOL is extra.

Removed - thanks!

>
> > +               preempt_count() & SOFTIRQ_MASK);
>
> Could you please clarify this whole check in more detail?
>
> What is the point in checking if a remote CPU is handling
> a "long" softirq while the local one is handling any softirq?

Good question! This looks like an error from my rework of the earlier
version of the patch.
In v1 it was:
   task_thread_info(curr)->preempt_count & SOFTIRQ_MASK));
And it looks like I erroneously condensed that to preempt_count() &
SOFTIRQ_MASK  treating curr (from the target cpu rq) as if it were
current. :P

I'll fix this. Thank you for catching it!

Just to expand what it should be in detail:
1:  (softirqs & LONG_SOFTIRQ_MASK) &&
2:  (curr == cpu_ksoftirqd ||
3:  task_thread_info(curr)->preempt_count & SOFTIRQ_MASK)

Where we're checking
1) that  the active_softirqs and __cpu_softirq_pending() values on the
target cpu are running a long softirq.
AND (
2) The current task on the target cpu is ksoftirqd
OR
3) The preempt_count of the current task on the target cpu has SOFTIRQ entries
)

> > +     rcu_read_unlock();
>
> Why ret needs to be calculated under the lock?

I believe it is the rq->curr == cpu_ksoftirqd (and the erroneously
removed task_thread_info(curr)) check?
Part of me wonders if it should have the rcu_dereference(rq->curr) as well?

> > +     return ret;
> > +}
> > +#else
> > +#define __use_softirq_opt 0
> > +static bool cpu_busy_with_softirqs(int cpu)
> > +{
> > +     return false;
> > +}
> > +#endif /* CONFIG_RT_SOFTIRQ_OPTIMIZATION */
> > +
> > +static bool rt_task_fits_cpu(struct task_struct *p, int cpu)
>
> To me, the new name is unfortunate, since it strips a notion
> of the reason. Instead, "CPU un/fits, because of capacity" it
> reads as "CPU un/fits, because of ..." what?

As with all complicated questions: "It depends" :) On the kernel
config specifically.

The earlier version of the patch had:
rt_task_fits_capacity_and_may_preempt() but Qais suggested it be
switched to the generic "fits_cpu" as it would depend on the kernel
config as to which aspect of "fit" was being considered.

I guess it could be  rt_task_fits_capacity_and_softirq_free() ?

> > +{
> > +     return !cpu_busy_with_softirqs(cpu) && rt_task_fits_capacity(p, cpu);
>
> I guess the order needs to be swapped, as rt_task_fits_capacity()
> is rather "quick" while cpu_busy_with_softirqs() is rather "slow".

Fair point. I've gone ahead and tweaked that. Thanks!

> > @@ -1894,14 +1934,17 @@ static int find_lowest_rq(struct task_struct *task)
> >               return -1; /* No other targets possible */
> >
> >       /*
> > -      * If we're on asym system ensure we consider the different capacities
> > -      * of the CPUs when searching for the lowest_mask.
> > +      * If we're using the softirq optimization or if we are
> > +      * on asym system, ensure we consider the softirq processing
> > +      * or different capacities of the CPUs when searching for the
> > +      * lowest_mask.
> >        */
> > -     if (static_branch_unlikely(&sched_asym_cpucapacity)) {
> > +     if (__use_softirq_opt ||
>
> Why use __use_softirq_opt and not IS_ENABLED(CONFIG_RT_SOFTIRQ_OPTIMIZATION)?

Because IS_ENABLED(CONFIG_RT_SOFTIRQ_OPTIMIZATION) seemed a bit long
and less easy to read?
I'm ambivalent if folks would rather have the CONFIG switch be
explicit here, but to me it seemed nicer to read the flag variable.

> > @@ -60,6 +60,13 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp
> >
> >  DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
> >
> > +/*
> > + * active_softirqs -- per cpu, a mask of softirqs that are being handled,
> > + * with the expectation that approximate answers are acceptable and therefore
> > + * no synchronization.
> > + */
> > +DEFINE_PER_CPU(u32, active_softirqs);
>
> I guess all active_softirqs uses need to be coupled with
> IS_ENABLED(CONFIG_RT_SOFTIRQ_OPTIMIZATION) check.

I guess it could be wrapped. I didn't see the per-cpu accounting as
troublesome enough to conditionalize, but the extra per-cpu data isn't
great if no one uses it. So I'll do that.

Thanks so much for the review and feedback! I really appreciate it.
-john