linux-kernel - Re: [PATCH] RFC: sched: Rework task_sched_runtime to avoid calling update_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANDhNCo0he-9X2M3nvWoVD3wBgstJqb0n0hfxKT7w3=2sNxEdg@mail.gmail.com>
Date: Thu, 13 Jun 2024 11:59:31 -0700
From: John Stultz <jstultz@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Thomas Gleixner <tglx@...utronix.de>, 
	Frederic Weisbecker <frederic@...nel.org>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Valentin Schneider <vschneid@...hat.com>, 
	Qais Yousef <qyousef@...alina.io>, Joel Fernandes <joel@...lfernandes.org>, kernel-team@...roid.com
Subject: Re: [PATCH] RFC: sched: Rework task_sched_runtime to avoid calling update_rq_clock

On Thu, Jun 13, 2024 at 3:04 AM Peter Zijlstra <peterz@...radead.org> wrote:
> On Wed, Jun 12, 2024 at 06:58:26PM -0700, John Stultz wrote:
> > I recently got a bug report that
> > clock_gettime(CLOCK_THREAD_CPUTIME_ID,...) had regressed between
> > 5.10 and 6.1. Its not a huge regression in absolute time
> > (~30-40ns), but is >10% change.
> >
> > I narrowed the cause down to the addition of
> > psi_account_irqtime() in update_rq_clock_task(), in commit
> > 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
> > pressure")
> >
> > So that explains the behavior change,
>
> It doesn't really... that patch just feeds the irq_time we *already*
> subtracted prior to it, to PSI, such that PSI can also enjoy the 'view'.
>
> The only explanation I have is that interrupts that end up in the
> scheduler (wakeups, tick, etc..) now get to do that PSI cgroup iteration
> and that cost adds up to the IRQ time itself, and as such the task time
> slows down accordingly.
>
> Are you using silly deep cgroup hierarchies?

Not deeply nested. We have a handful of cgroups, but due to
priority-inversion issues we're not putting any tasks in them right
now.

Again, this is only ~30necs delta, so it's not a huge amount of time
we're talking about. The cpu_clock() call in psi_account_irqtime is
probably a reasonable chunk of that (as just the vdso clock_gettime()
takes a similar order of time).

> > but it also seems odd that
> > we're doing psi irq accounting from a syscall that is just
> > trying to read the thread's cputime.
>
> In order to avoid doing all the accounting in the IRQ entry/exit paths,
> those paths only do the bare minimum of tracking how much IRQ time there
> is.
>
> update_rq_clock_task() then looks at the increase of IRQ time and
> subtracts this from the task time -- after all, all time spend in the
> IRQ wasn't spend on the task itself.
>
> It did that prior to the PSI patch, and it does so after. The only
> change is it now feeds this delta into the PSI thing.

Yeah. I don't see the PSI logic as particularly bad, it's just that I
was surprised we were getting into the accounting paths during a read.

Coming from the timekeeping subsystem, we usually do the time
accumulation/accounting (write side) periodically, and on the read
fast-path we avoid anything extra. So it was just a bit unintuitive to
me to find we were doing the accounting/accumulation to the task clock
on each read.

> > NOTE: I'm not 100% sure this is correct yet. There may be some
> > edge cases I've overlooked, so I'd greatly appreciate any
> > review or feedback.
>
> Urgh, you're sprinkling the details of what is clock_task over multiple
> places.

Yeah. Agreed, the base-irq-steal logic I added should have been
wrapped up better.

> Does something like so work?

Yeah, it takes the psi_account_irqtime out of the hotpath, so with it
we're back to ~5.10 numbers.

Qais' correctness concerns seem reasonable, and I fret the second
psi_irq_time accounting base creates multiple irq_time timelines to
manage (and eventually fumble up). So I still think the idea of
separate read/write paths might be an overall improvement. But
obviously your patch seemingly works properly and mine doesn't, so I
can't object too strongly here. :)  Though I may still take another
pass on my patch to see if we can make the clock_gettime read path
avoid the accounting logic, but it can be a throwaway if you really
don't like it.

> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..d4b87539d72a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -724,7 +724,6 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>
>         rq->prev_irq_time += irq_delta;
>         delta -= irq_delta;
> -       psi_account_irqtime(rq->curr, irq_delta);
>         delayacct_irq(rq->curr, irq_delta);
>  #endif
>  #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> @@ -5459,6 +5458,8 @@ void sched_tick(void)
>
>         sched_clock_tick();
>
> +       psi_account_irqtime(curr, &rq->psi_irq_time);
> +
>         rq_lock(rq, &rf);

Should the psi_account_irqtime (and its use of the rq->psi_irq_time)
go under the rq_lock here?  This is the only user, so there isn't any
practical parallel access, but the locking rules are subtle enough
already.

thanks
-john