[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251029145513.GO3245006@noisy.programming.kicks-ass.net>
Date: Wed, 29 Oct 2025 15:55:13 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Dmitry Ilvokhin <d@...okhin.com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND] sched/stats: Optimize /proc/schedstat printing
On Wed, Oct 29, 2025 at 02:46:33PM +0000, Dmitry Ilvokhin wrote:
> On Wed, Oct 29, 2025 at 03:07:55PM +0100, Peter Zijlstra wrote:
> > On Wed, Oct 29, 2025 at 01:07:15PM +0000, Dmitry Ilvokhin wrote:
> > > Function seq_printf supports rich format string for decimals printing,
> > > but there is no need for it in /proc/schedstat, since majority of the
> > > data is space separared decimals. Use seq_put_decimal_ull instead as
> > > faster alternative.
> > >
> > > Performance counter stats (truncated) for sh -c 'cat /proc/schedstat >
> > > /dev/null' before and after applying the patch from machine with 72 CPUs
> > > are below.
> > >
> > > Before:
> > >
> > > 2.94 msec task-clock # 0.820 CPUs utilized
> > > 1 context-switches # 340.551 /sec
> > > 0 cpu-migrations # 0.000 /sec
> > > 340 page-faults # 115.787 K/sec
> > > 10,327,200 instructions # 1.89 insn per cycle
> > > # 0.10 stalled cycles per insn
> > > 5,458,307 cycles # 1.859 GHz
> > > 1,052,733 stalled-cycles-frontend # 19.29% frontend cycles idle
> > > 2,066,321 branches # 703.687 M/sec
> > > 25,621 branch-misses # 1.24% of all branches
> > >
> > > 0.00357974 +- 0.00000209 seconds time elapsed ( +- 0.06% )
> > >
> > > After:
> > >
> > > 2.50 msec task-clock # 0.785 CPUs utilized
> > > 1 context-switches # 399.780 /sec
> > > 0 cpu-migrations # 0.000 /sec
> > > 340 page-faults # 135.925 K/sec
> > > 7,371,867 instructions # 1.59 insn per cycle
> > > # 0.13 stalled cycles per insn
> > > 4,647,053 cycles # 1.858 GHz
> > > 986,487 stalled-cycles-frontend # 21.23% frontend cycles idle
> > > 1,591,374 branches # 636.199 M/sec
> > > 28,973 branch-misses # 1.82% of all branches
> > >
> > > 0.00318461 +- 0.00000295 seconds time elapsed ( +- 0.09% )
> > >
> > > This is ~11% (relative) improvement in time elapsed.
> >
> > Yeah, but who cares? Why do we want less obvious code for a silly stats
> > file?
>
> Thanks for the feedback, Peter.
>
> Fair point that /proc/schedstat isn’t a hot path in the kernel itself,
> but it is a hot path for monitoring software (Prometheus for example).
Aliens! I like Xenomorphs :-) But I doubt that's what you're talking
about.
> In large fleets, these files are polled periodically (often every few
> seconds) on every machine. The cumulative overhead adds up quickly
> across thousands of nodes, so reducing the cost of generating these
> stats does have a measurable operational impact. With the ongoing trend
> toward higher core counts per machine, this cost becomes even more
> noticeable over time.
>
> I've tried to keep the code as readable as possible, but I understand if
> you think an ~11% improvement isn't worth the added complexity. If you
> have suggestions for making the code cleaner or the intent clearer, I’d
> be happy to rework it.
What are they doing this for? I would much rather rework all this such
that all the schedstat crap becomes tracepoints and all the existing
cruft optional consumers of that.
Like I argued here:
https://lkml.kernel.org/r/20250703141800.GX1613200@noisy.programming.kicks-ass.net
Then people can consume them however makes most sense, ideally with a
binary interface if it is high bandwidth.
Powered by blists - more mailing lists