[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2267fbe6-37e8-7063-d48f-1879f31d3258@kernel.org>
Date: Mon, 23 Jul 2018 19:11:48 -0700
From: Andy Lutomirski <luto@...nel.org>
To: Fenghua Yu <fenghua.yu@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, H Peter Anvin <hpa@...or.com>
Cc: Ashok Raj <ashok.raj@...el.com>, Alan Cox <alan@...ux.intel.com>,
Ravi V Shankar <ravi.v.shankar@...el.com>,
linux-kernel <linux-kernel@...r.kernel.org>, x86 <x86@...nel.org>
Subject: Re: [PATCH 6/7] x86/vdso: Add vDSO functions for user wait
instructions
On 07/23/2018 05:55 AM, Fenghua Yu wrote:
> User wants to query if user wait instructions (umonitor, umwait, and
> tpause) are supported and use the instructions. The vDSO functions
> provides fast interface for user to check the support and use the
> instructions.
>
> waitpkg_supported and its alias __vdso_waitpkg_supported check if
> user wait instructions (a.k.a. wait package feature) are supported
>
> umonitor and its alias __vdso_umonitor provide user APIs for calling
> umonitor instruction.
>
> umwait and its alias __vdso_umwait provide user APIs for calling
> umwait instruction.
>
> tpause and its alias __vdso_tpause provide user APIs for calling
> tpause instruction.
>
> nsec_to_tsc and its alias __vdso_nsec_to_tsc converts nanoseconds
> to TSC counter if TSC frequency is known. It will fail if TSC frequency
> is unknown.
>
> The instructions can be implemented in intrinsic functions in future
> GCC. But the vDSO interfaces are available to user without the
> intrinsic functions support in GCC and the API waitpkg_supported and
> nsec_to_tsc cannot be implemented as GCC functions.
>
> Signed-off-by: Fenghua Yu <fenghua.yu@...el.com>
> ---
> arch/x86/entry/vdso/Makefile | 2 +-
> arch/x86/entry/vdso/vdso.lds.S | 10 ++
> arch/x86/entry/vdso/vma.c | 9 ++
> arch/x86/entry/vdso/vuserwait.c | 233 +++++++++++++++++++++++++++++++++
> arch/x86/include/asm/vdso_funcs_data.h | 3 +
> 5 files changed, 256 insertions(+), 1 deletion(-)
> create mode 100644 arch/x86/entry/vdso/vuserwait.c
>
> diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
> index af4fcae5de83..fb0062b09b3c 100644
> --- a/arch/x86/entry/vdso/Makefile
> +++ b/arch/x86/entry/vdso/Makefile
> @@ -17,7 +17,7 @@ VDSO32-$(CONFIG_X86_32) := y
> VDSO32-$(CONFIG_IA32_EMULATION) := y
>
> # files to link into the vdso
> -vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o
> +vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o vdirectstore.o vuserwait.o
>
> # files to link into kernel
> obj-y += vma.o
> diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S
> index 097cdcda43a5..0942710608bf 100644
> --- a/arch/x86/entry/vdso/vdso.lds.S
> +++ b/arch/x86/entry/vdso/vdso.lds.S
> @@ -35,6 +35,16 @@ VERSION {
> __vdso_movdir64b_supported;
> movdir64b;
> __vdso_movdir64b;
> + waitpkg_supported;
> + __vdso_waitpkg_supported;
> + umonitor;
> + __vdso_umonitor;
> + umwait;
> + __vdso_umwait;
> + tpause;
> + __vdso_tpause;
> + nsec_to_tsc;
> + __vdso_nsec_to_tsc;
> local: *;
> };
> }
> diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
> index edbe5e63e5c2..006dfb5e5003 100644
> --- a/arch/x86/entry/vdso/vma.c
> +++ b/arch/x86/entry/vdso/vma.c
> @@ -372,10 +372,19 @@ static int vgetcpu_online(unsigned int cpu)
>
> static void __init init_vdso_funcs_data(void)
> {
> + struct system_counterval_t sys_counterval;
> +
> if (static_cpu_has(X86_FEATURE_MOVDIRI))
> vdso_funcs_data.movdiri_supported = true;
> if (static_cpu_has(X86_FEATURE_MOVDIR64B))
> vdso_funcs_data.movdir64b_supported = true;
> + if (static_cpu_has(X86_FEATURE_WAITPKG))
> + vdso_funcs_data.waitpkg_supported = true;
> + if (static_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
> + vdso_funcs_data.tsc_known_freq = true;
> + sys_counterval = convert_art_ns_to_tsc(1);
> + vdso_funcs_data.tsc_per_nsec = sys_counterval.cycles;
> + }
You're losing a ton of precision here. You might even be losing *all*
of the precision and malfunctioning rather badly.
The correct way to do this is:
tsc_counts = ns * mul >> shift;
and the vclock code illustrates it. convert_art_ns_to_tsc() is a bad
example because it uses an expensive division operation for no good
reason except that no one bothered to optimize it.
> +notrace int __vdso_nsec_to_tsc(unsigned long nsec, unsigned long *tsc)
> +{
> + if (!_vdso_funcs_data->tsc_known_freq)
> + return -ENODEV;
> +
> + *tsc = _vdso_funcs_data->tsc_per_nsec * nsec;
> +
> + return 0;
> +}
Please don't expose this one at all. It would be nice for programs that
use waitpkg to be migratable using CRIU-like tools, and this export
actively harms any such effort. If you omit this function, then the
kernel could learn to abort an in-progress __vdso_umwait if preempted
(rseq-style) and CRIU would just work. It would be a bit of a hack, but
it solves a real problem.
> +notrace int __vdso_umwait(int state, unsigned long nsec)
__vdso_umwait_relative(), please. Because some day (possibly soon)
someone will want __vdso_umwait_absolute() and its friend
__vdso_read_art_ns() so they can do:
u64 start = __vdso_read_art_ns();
__vdso_umonitor(...);
... do something potentially slow or that might fault ...
__vdso_umwait_absolute(start + timeout);
Also, this patch appears to have a subtle but show-stopping race. Consider:
1. Task A does UMONITOR on CPU 1
2. Task A is preempted.
3. Task B does UMONITOR on CPU 1 at a different address
4. Task A resumes
5. Task A does UMWAIT
Now task A hangs, at least until the next external wakeup happens.
It's not entirely clear to me how you're supposed to fix this without
some abomination that's so bad that it torpedoes the entire feature.
Except that there is no chicken bit to turn this thing off. Sigh.
Powered by blists - more mailing lists