[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e6de0ec1-c59c-fea4-0335-4c5609e21656@prevas.dk>
Date: Wed, 24 Jan 2018 09:54:09 +0100
From: Rasmus Villemoes <rasmus.villemoes@...vas.dk>
To: Andrey Ryabinin <aryabinin@...tuozzo.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Linus Torvalds <torvalds@...ux-foundation.org>
CC: <linux-kernel@...r.kernel.org>, Kees Cook <keescook@...omium.org>,
Eryu Guan <eguan@...hat.com>,
Alexander Potapenko <glider@...gle.com>,
Chris Metcalf <metcalf@...m.mit.edu>,
David Laight <David.Laight@...LAB.COM>,
Dmitry Vyukov <dvyukov@...gle.com>, <stable@...r.kernel.org>
Subject: Re: [PATCH] lib/strscpy: remove word-at-a-time optimization.
On 2018-01-09 17:47, Andrey Ryabinin wrote:
> Attached user space program I used to see the difference.
> Usage:
> gcc -02 -o strscpy strscpy_test.c
> ./strscpy {b|w} src_str_len count
>
> src_str_len - length of source string in between 1-4096
> count - how many strscpy() to execute.
>
> Also I've noticed something strange. I'm not sure why, but certain
> src_len values (e.g. 30) drives branch predictor crazy causing worse than usual results
> for byte-at-a-time copy:
I see something similar, but at the 30->31 transition, and the
branch-misses remain at 1-3% for higher values, until 42 where it drops
back to 0%. Anyway, I highly doubt we do a lot of string copies of
strings longer then 32.
$ perf stat ./strscpy_test b 30 10000000
Performance counter stats for './strscpy_test b 30 10000000':
156,777082 task-clock (msec) # 0,999 CPUs
utilized
0 context-switches # 0,000 K/sec
0 cpu-migrations # 0,000 K/sec
48 page-faults # 0,306 K/sec
584.646.177 cycles # 3,729 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
2.580.599.614 instructions # 4,41 insns per
cycle
660.114.283 branches # 4210,528 M/sec
4.891 branch-misses # 0,00% of all
branches
0,156970910 seconds time elapsed
$ perf stat ./strscpy_test b 31 10000000
Performance counter stats for './strscpy_test b 31 10000000':
258,533250 task-clock (msec) # 0,999 CPUs
utilized
0 context-switches # 0,000 K/sec
0 cpu-migrations # 0,000 K/sec
50 page-faults # 0,193 K/sec
965.505.138 cycles # 3,735 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
2.660.773.463 instructions # 2,76 insns per
cycle
680.141.051 branches # 2630,768 M/sec
19.150.367 branch-misses # 2,82% of all
branches
0,258725192 seconds time elapsed
Rasmus
Powered by blists - more mailing lists