[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190623052647.GA9838@gmail.com>
Date: Sat, 22 Jun 2019 22:26:48 -0700
From: Andrei Vagin <avagin@...il.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Dmitry Safonov <dima@...sta.com>, linux-kernel@...r.kernel.org,
Adrian Reber <adrian@...as.de>,
Andrei Vagin <avagin@...nvz.org>,
Andy Lutomirski <luto@...nel.org>,
Arnd Bergmann <arnd@...db.de>,
Christian Brauner <christian.brauner@...ntu.com>,
Cyrill Gorcunov <gorcunov@...nvz.org>,
Dmitry Safonov <0x7f454c46@...il.com>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
Jann Horn <jannh@...gle.com>, Jeff Dike <jdike@...toit.com>,
Oleg Nesterov <oleg@...hat.com>,
Pavel Emelyanov <xemul@...tuozzo.com>,
Shuah Khan <shuah@...nel.org>,
Vincenzo Frascino <vincenzo.frascino@....com>,
containers@...ts.linux-foundation.org, criu@...nvz.org,
linux-api@...r.kernel.org, x86@...nel.org
Subject: Re: [PATCHv4 26/28] x86/vdso: Align VDSO functions by CPU L1 cache
line
On Fri, Jun 14, 2019 at 04:13:31PM +0200, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Dmitry Safonov wrote:
>
> > From: Andrei Vagin <avagin@...il.com>
> >
> > After performance testing VDSO patches a noticeable 20% regression was
> > found on gettime_perf selftest with a cold cache.
> > As it turns to be, before time namespaces introduction, VDSO functions
> > were quite aligned to cache lines, but adding a new code to adjust
> > timens offset inside namespace created a small shift and vdso functions
> > become unaligned on cache lines.
> >
> > Add align to vdso functions with gcc option to fix performance drop.
> >
> > Coping the resulting numbers from cover letter:
> >
> > Hot CPU cache (more gettime_perf.c cycles - the better):
> > | before | CONFIG_TIME_NS=n | host | inside timens
> > --------|------------|------------------|-------------|-------------
> > cycles | 139887013 | 139453003 | 139899785 | 128792458
> > diff (%)| 100 | 99.7 | 100 | 92
>
> Why is CONFIG_TIME_NS=n behaving worse than current mainline and
> worse than 'host' mode?
We had to specify a precision of these numbers, it is more than this
0.3%, so at that time I decided that here is nothing to worry about. I
did these measurments a few mounth ago for the second version of this
series. I repeated measurments for this set of patches:
| before | CONFIG_TIME_NS=n | host | inside timens
--------------------------------------------------------------
| 144645498 | 142916801 | 140364862 | 132378440
| 143440633 | 141545739 | 140540053 | 132714190
| 144876395 | 144650599 | 140026814 | 131843318
| 143984551 | 144595770 | 140359260 | 131683544
| 144875682 | 143799788 | 140692618 | 131300332
--------------------------------------------------------------
avg | 144364551 | 143501739 | 140396721 | 131983964
diff % | 100 | 99.4 | 97.2 | 91.4
-------------------------------------------------------------
stdev % | 0.4 | 0.9 | 0.1 | 0.4
>
> > Cold cache (lesser tsc per gettime_perf_cold.c cycle - the better):
> > | before | CONFIG_TIME_NS=n | host | inside timens
> > --------|------------|------------------|-------------|-------------
> > tsc | 6748 | 6718 | 6862 | 12682
> > diff (%)| 100 | 99.6 | 101.7 | 188
>
> Weird, now CONFIG_TIME_NS=n is better than current mainline and 'host' mode
> drops.
The precision of these numbers is much smaller than of the previous set.
These numbers are for the second version of this series, so I decided to
repeat measurements for this version. When I run the test, I found that
there is some degradation in compare with v5.0. I bisected and found
that the problem is in 2b539aefe9e4 ("mm/resource: Let
walk_system_ram_range() search child resources"). At this point, I
realized that my test isn't quite right. On each iteration, the test
starts a new process, then do start=rdtsc();clock_gettime();end=rdtsc()
and prints (end-start). The problem here is that when clock_gettime() is
called the first time, vdso pages are not mapped into a process address
space, so the test measures how fast vdso pages are mapped into the
process address space. I modified this test, now it uses the clflush
instruction to drop cpu caches. Here are the results:
| before | CONFIG_TIME_NS=n | host | inside timens
--------------------------------------------------------------
tsc | 434 | 433 | 437 | 477
stdev(tsc) | 5 | 5 | 5 | 3
diff (%) | 1 | 1 | 100.1 | 109
Here is the source code for the modified test:
https://github.com/avagin/linux-task-diag/blob/wip/timens-rfc-v4/tools/testing/selftests/timens/gettime_perf_cold.c
This test does 10K iterations. At the first glance, the numbers look
noisy, so I sort them and take only 8K numbers in the middle:
$ ./gettime_perf_cold > raw
$ cat raw | sort -n | tail -n 9000 | head -n 8000 > results
>
> Either I'm misreading the numbers or missing something or I'm just confused
> as usual :)
>
> Thanks,
> > tglx
Powered by blists - more mailing lists