[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200421170836.4aafcbc2@gandalf.local.home>
Date: Tue, 21 Apr 2020 17:08:36 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Alexey Dobriyan <adobriyan@...il.com>
Cc: Andy Shevchenko <andriy.shevchenko@...ux.intel.com>,
akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, pmladek@...e.com,
sergey.senozhatsky@...il.com, linux@...musvillemoes.dk
Subject: Re: [PATCH 03/15] print_integer: new and improved way of printing
integers
On Tue, 21 Apr 2020 19:49:24 +0300
Alexey Dobriyan <adobriyan@...il.com> wrote:
> > Exactly. The code in _print_integer_u32() doesn't look as fast as the
> > code in vsprintf() that happens to use lookup tables and converts
> > without any loops.
> >
> > Hint, loops are bad, they cause the CPU to slow down.
>
> Oh, come on! Loops make code fit into icache and μop decode cache.
Depends on the architecture.
>
> > Anyway, this patch series would require a pretty good improvement, as
> > the code replacing the sprintf() usages is pretty ugly compared to a
> > simple sprintf() call.
>
> No! Fast code must look ugly. Or in other words if you try to optimise
> integer printing to death you'll probably end with something like
> _print_integer().
As I stated, it will require a "pretty good improvement". There's always a
trade off. If the improvement is noticeable for real life cases, then ugly
code is worth it. If we are making ugly code for a benefit of something
that never shows outside of noise, then the net cost (less maintainable
code), is not worth it.
>
> When the very first patch changed /proc/stat to seq_put_decimal_ull()
> the speed up was 66% (or 33%). That's how slow printing was back then.
> It can be made slightly faster even now.
I'd like to see the tests that your ran (to reproduce them myself).
The first patch was making a 64bit number into 32bit number, thus
shortening the work by half.
>
> > Randomly picking patch 6:
> >
> > static int loadavg_proc_show(struct seq_file *m, void *v)
> > {
> > unsigned long avnrun[3];
> >
> > get_avenrun(avnrun, FIXED_1/200, 0);
> >
> > seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n",
> > LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
> > LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> > LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> > nr_running(), nr_threads,
> > idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> > return 0;
> > }
> >
> > *vs*
> >
> > static int loadavg_proc_show(struct seq_file *m, void *v)
> > {
> > unsigned long avnrun[3];
> > char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1];
> > char *p = buf + sizeof(buf);
> > int i;
> >
> > *--p = '\n';
> > p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> > *--p = ' ';
> > p = _print_integer_u32(p, nr_threads);
> > *--p = '/';
> > p = _print_integer_u32(p, nr_running());
> >
> > get_avenrun(avnrun, FIXED_1/200, 0);
> > for (i = 2; i >= 0; i--) {
> > *--p = ' ';
> > --p; /* overwritten */
> > *--p = '0'; /* conditionally overwritten */
> > (void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i]));
> > *--p = '.';
> > p = _print_integer_ul(p, LOAD_INT(avnrun[i]));
> > }
> >
> > seq_write(m, p, buf + sizeof(buf) - p);
> > return 0;
> > }
> >
> >
> > I much rather keep the first version.
>
> I did the benchmarks (without stack protector though), everything except
> /proc/cpuinfo and /proc/meminfo became faster. This requires investigation
> and I can drop vsprintf() changes until then.
>
> Now given that /proc/uptime format cast in stone, code may look a bit ugly
> and unusual but it won't require maintainance
Please share what you did as your benchmarks. If this is as good of a
performance as you claim, then these changes would be worth looking into.
So I applied your entire series, added the following patch:
diff --git a/include/linux/spinlock_api_smp.h
b/include/linux/spinlock_api_smp.h index 19a9be9d97ee..17c582d77ab7 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -152,9 +152,16 @@ static inline void __raw_spin_unlock(raw_spinlock_t
*lock) preempt_enable();
}
+extern u64 sched_clock(void);
static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock,
unsigned long flags)
{
+ char buf[32];
+ u64 start, stop;
+ start = sched_clock();
+ sprintf(buf,"%lld", (unsigned long)lock);
+ stop = sched_clock();
+ trace_printk("time: %lld '%s'\n", stop - start, buf);
spin_release(&lock->dep_map, _RET_IP_);
do_raw_spin_unlock(lock);
local_irq_restore(flags);
Then after boot up, I did the following:
# trace-cmd stop
# trace-cmd extract
Which captured the traces:
<idle>-0 [003] 5.405208: bprint: _raw_spin_unlock_irqrestore: time: 799 '-110308271193088'
<idle>-0 [003] 5.405210: bprint: _raw_spin_unlock_irqrestore: time: 273 '-110308271193088'
<idle>-0 [003] 5.412235: bprint: _raw_spin_unlock_irqrestore: time: 1138 '-110308271193088'
<idle>-0 [003] 5.412236: bprint: _raw_spin_unlock_irqrestore: time: 213 '-110308271193088'
<idle>-0 [003] 5.414241: bprint: _raw_spin_unlock_irqrestore: time: 1094 '-110308271193088'
<idle>-0 [003] 5.414243: bprint: _raw_spin_unlock_irqrestore: time: 182 '-110308271193088'
<idle>-0 [003] 5.418241: bprint: _raw_spin_unlock_irqrestore: time: 1113 '-110308271193088'
Where "time: X", X is the delta in nanoseconds around the sprintf() call.
The I ran the attached perl program on the output, and got the following
results:
Before your patches:
# trace-cmd report | ./report.pl
full_total = 52844823
average: 255.902176229032
std: 439.269729814847
And with your patches:
# trace-cmd report | ./report.pl
full_total = 84725476
average: 407.873274762306
std: 555.755670463724
As the standard deviation is bigger than the average, it appears to be all
in the noise.
Then I decided to see if it affects "ps -eux"
Original:
# perf stat -r 100 ps -eux > /dev/null
Performance counter stats for 'ps -eux' (100 runs):
8.92 msec task-clock # 0.937 CPUs utilized ( +- 0.90% )
5 context-switches # 0.545 K/sec ( +- 1.24% )
0 cpu-migrations # 0.000 K/sec
259 page-faults # 0.029 M/sec ( +- 0.07% )
32,973,751 cycles # 3.698 GHz ( +- 0.09% )
17,254,307 stalled-cycles-frontend # 52.33% frontend cycles idle ( +- 0.17% )
38,707,960 instructions # 1.17 insn per cycle
# 0.45 stalled cycles per insn ( +- 0.01% )
8,153,197 branches # 914.274 M/sec ( +- 0.01% )
114,992 branch-misses # 1.41% of all branches ( +- 0.12% )
0.0095170 +- 0.0000829 seconds time elapsed ( +- 0.87% )
With your patches:
# perf stat -r 100 ps -eux > /dev/null
Performance counter stats for 'ps -eux' (100 runs):
8.86 msec task-clock # 0.918 CPUs utilized ( +- 1.06% )
5 context-switches # 0.527 K/sec ( +- 1.22% )
0 cpu-migrations # 0.001 K/sec ( +-100.00% )
259 page-faults # 0.029 M/sec ( +- 0.08% )
32,699,168 cycles # 3.692 GHz ( +- 0.12% )
16,995,861 stalled-cycles-frontend # 51.98% frontend cycles idle ( +- 0.21% )
38,114,396 instructions # 1.17 insn per cycle
# 0.45 stalled cycles per insn ( +- 0.03% )
7,985,526 branches # 901.625 M/sec ( +- 0.03% )
112,852 branch-misses # 1.41% of all branches ( +- 0.17% )
0.009652 +- 0.000276 seconds time elapsed ( +- 2.86% )
Not really much difference.
Then what about just catting /proc/$$/stat, and do a 1000 runs!
Original:
# perf stat -r 1000 cat /proc/$$/stat > /dev/null
Performance counter stats for 'cat /proc/1622/stat' (1000 runs):
0.34 msec task-clock # 0.680 CPUs utilized ( +- 0.21% )
0 context-switches # 0.071 K/sec ( +- 20.18% )
0 cpu-migrations # 0.000 K/sec
65 page-faults # 0.192 M/sec ( +- 0.07% )
993,486 cycles # 2.934 GHz ( +- 0.07% )
577,903 stalled-cycles-frontend # 58.17% frontend cycles idle ( +- 0.09% )
936,489 instructions # 0.94 insn per cycle
# 0.62 stalled cycles per insn ( +- 0.07% )
202,912 branches # 599.213 M/sec ( +- 0.07% )
6,976 branch-misses # 3.44% of all branches ( +- 0.11% )
0.00049797 +- 0.00000111 seconds time elapsed ( +- 0.22% )
With your patches:
# perf stat -r 1000 cat /proc/$$/stat > /dev/null
Performance counter stats for 'cat /proc/1624/stat' (1000 runs):
0.34 msec task-clock # 0.664 CPUs utilized ( +- 0.23% )
0 context-switches # 0.018 K/sec ( +- 40.72% )
0 cpu-migrations # 0.000 K/sec
65 page-faults # 0.190 M/sec ( +- 0.07% )
988,430 cycles # 2.892 GHz ( +- 0.07% )
574,841 stalled-cycles-frontend # 58.16% frontend cycles idle ( +- 0.09% )
933,852 instructions # 0.94 insn per cycle
# 0.62 stalled cycles per insn ( +- 0.07% )
202,096 branches # 591.297 M/sec ( +- 0.07% )
6,836 branch-misses # 3.38% of all branches ( +- 0.09% )
0.00051476 +- 0.00000557 seconds time elapsed ( +- 1.08% )
They are pretty much identical.
What's the purpose of all these changes again? There was no cover letter.
-- Steve
Download attachment "report.pl" of type "application/x-perl" (474 bytes)
Powered by blists - more mailing lists