[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.1906231008170.32342@nanos.tec.linutronix.de>
Date: Sun, 23 Jun 2019 12:18:26 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: "Sverdlin, Alexander (Nokia - DE/Ulm)" <alexander.sverdlin@...ia.com>
cc: "x86@...nel.org" <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
"H. Peter Anvin" <hpa@...or.com>,
Jason Vas Dias <jason.vas.dias@...il.com>,
Andy Lutomirski <luto@...nel.org>
Subject: Re: [PATCH v3 1/2] x86/vdso: Move mult and shift into struct
vgtod_ts
Alexander,
On Wed, 5 Jun 2019, Sverdlin, Alexander (Nokia - DE/Ulm) wrote:
> From: Alexander Sverdlin <alexander.sverdlin@...ia.com>
>
> This is a preparation for CLOCK_MONOTONIC_RAW vDSO implementation.
> Coincidentally it had a slight performance improvement as well:
>
> ---- Test code ----
> #include <errno.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <time.h>
> #include <unistd.h>
>
> #define CLOCK_TYPE CLOCK_MONOTONIC_RAW
> #define DURATION_SEC 10
>
> int main(int argc, char **argv)
> {
> struct timespec t, end;
> unsigned long long cnt = 0;
>
> clock_gettime(CLOCK_TYPE, &end);
> end.tv_sec += DURATION_SEC;
>
> do {
> clock_gettime(CLOCK_TYPE, &t);
> ++cnt;
> } while (t.tv_sec < end.tv_sec || t.tv_nsec < end.tv_nsec);
>
> dprintf(STDOUT_FILENO, "%llu", cnt);
>
> return EXIT_SUCCESS;
> }
> -------------------
>
> The results from the above test program:
>
> Clock Before After Diff
> ----- ------ ----- ----
> CLOCK_MONOTONIC 355.5M 359.6M +1%
> CLOCK_REALTIME 355.5M 359.6M +1%
Been there done that. But the results of the micro benchmark above have to
be taken with a grain of salt.
Let's look at the cache side effects of this change:
Before:
struct vsyscall_gtod_data {
unsigned int seq; /* 0 4 */
int vclock_mode; /* 4 4 */
u64 cycle_last; /* 8 8 */
u64 mask; /* 16 8 */
u32 mult; /* 24 4 */
u32 shift; /* 28 4 */
struct vgtod_ts basetime[12]; /* 32 192 */
/* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
int tz_minuteswest; /* 224 4 */
int tz_dsttime; /* 228 4 */
/* size: 232, cachelines: 4, members: 9 */
/* last cacheline: 40 bytes */
};
After:
struct vsyscall_gtod_data {
unsigned int seq; /* 0 4 */
int vclock_mode; /* 4 4 */
u64 cycle_last; /* 8 8 */
u64 mask; /* 16 8 */
struct vgtod_ts basetime[12]; /* 24 288 */
/* --- cacheline 4 boundary (256 bytes) was 56 bytes ago --- */
int tz_minuteswest; /* 312 4 */
int tz_dsttime; /* 316 4 */
/* size: 320, cachelines: 5, members: 7 */
};
So the interesting change is here:
struct vgtod_ts basetime[12]; /* 32 192 */
vs.
struct vgtod_ts basetime[12]; /* 24 288 */
In the original version struct vgtod_ts occupies 16 bytes in the modified
version it has 24 bytes. As a consequence:
CLOCK Before After
data->basetime[CLOCK_REALTIME] 32 .. 47 24 .. 47
data->basetime[CLOCK_MONOTONIC] 48 .. 63 48 .. 71
That means that the CLOCK_MONOTONIC storage now overlaps into a second
cache line. CLOCK_MONOTONIC is widely used.
Of course on a micro benchmark this does not matter at all, but in real
world use cases which care about cache pressure it does.
Another thing is that just adding some small computation to the benchmark
loop makes the advantage of the modified version go away and it becomes
slightly slower.
On a real world application which is a cache heavy and uses CLOCK MONOTONIC
extensivly I can observe a significant L1 miss increase (%2) with the
modified version.
Now there is something else which is interesting:
On a guest (pvclock) I can observe with your benchmark:
Before:
MONO cnt: 576.7
REAL cnt: 576.6
After:
MONO cnt: 577.1
REAL cnt: 577.0
But on a (different) host:
Before:
MONO cnt: 353.9
REAL cnt: 354.4
After:
MONO cnt: 347.9
REAL cnt: 347.1
Yuck!
So this is really sensitive and I'm inclined not to do that due to the
cache line side effect.
The alternative solution for this is what Vincenzo has in his unified VDSO
patch series:
https://lkml.kernel.org/r/20190621095252.32307-1-vincenzo.frascino@arm.com
It leaves the data struct unmodified and has a separate array for the raw
clock. That does not have the side effects at all.
I'm in the process of merging that series and I actually adapted your
scheme to the new unified infrastructure where it has exactly the same
effects as with your original patches against the x86 version.
Thanks,
tglx
Powered by blists - more mailing lists