linux-kernel - [RFC PATCH 0/4] Scheduler time slice extension

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241113000126.967713-1-prakash.sangappa@oracle.com>
Date: Wed, 13 Nov 2024 00:01:22 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: linux-kernel@...r.kernel.org
Cc: rostedt@...dmis.org, peterz@...radead.org, tglx@...utronix.de,
        daniel.m.jordan@...cle.com, prakash.sangappa@...cle.com
Subject: [RFC PATCH 0/4] Scheduler time slice extension

A user thread can get preempted in the middle of executing a critical
section in user space while holding locks, can have undesirable affect on
performance. Having a way for the thread to request additional execution
time on cpu so that it can complete the critical section will be useful in
such scenario.  The request can be made by setting a bit in mapped memory,
such that the kernel can also access to check and grant extra execution time
on the cpu. 

There have been couple of proposals[1][2] for such a feature, which attempt
to address the above scenario by granting one extra tick of execution time.
In patch thread [1] posted by Steven Rostedt, there is ample discussion about
need for this feature. 

However, the concern has been that this can lead to abuse. One extra tick can
be a long time(about a millisec or more). Peter Zijlstra in response posted a 
prototype solution[3], which grants 50us execution time extension only.
This is achieved with the help of a timer started on that cpu at the time of
granting extra execution time. When the timer fires the thread will be
preempted, if still running. 

This patch set implements the above mentioned 50us extension time as posted
by Peter. But instead of using restartable sequences as API to set the flag
to request the extension, this patch proposes a new API with use of a per
thread shared structure implementation described below. This shared structure
is accessible in both users pace and kernel. The user thread will set the
flag in this shared structure to request execution time extension.

We tried this change with a database workload. Here are the results, comparing
runs with and without use of the 50us scheduler time slice extension. This
clearly shows there is benefit.

Test results:
=============
	Test system 2 socket AMD Genoa

	Lock table test:- a simple database test to grab table lock(spin lock).
	   Simulates sql query executions.
	   300 clients + 400 cpu hog tasks to generate load.

		Without extension : 182K SQL exec/sec
		With extension    : 262K SQL exec/sec
	   44% improvement.

	Swingbench - standard database benchmark 
	   Cached(database files on tmpfs) run, with 1000 clients.

		Without extension : 99K SQL exec/sec
		with extension    : 153K SQL exec/sec
	   55% improvement in throughput.


Shared structure mechanism:
==========================

A per thread structure is allocated from a page shared mapped between user
space and kernel. This will be useful in sharing thread specific information
between user space and kernel without the need for making system calls in
latency sensitive code path.

Implementation:

A new system call is added to request use of a shared structure by a user
thread. Kernel will allocate page(s), shared mapped with user space in
which per-thread shared structures will be allocated. These structures
are padded to 128 bytes. Multiple such shared structures will be allocated
from that page(upto 32 per 4k page) to accommodate requests from multiple
threads in a process, thus avoiding the need to allocate one page per thread. 
Additional pages are allocated as needed to accommodate more thread's
requesting the shared structure. The number of pages required will depend
on the number of threads of the process requesting/using shared structure.

These pages are pinned and so the kernel can access/update the shared
structure thru the kernel address, without the need for
copy_from_user/copy_to_user() calls. 

The system call will return a pointer(user address) to the per thread shared
structure. Application threads could save this per thread pointer in a TLS
variable and reference it.

Request for scheduler time extension described above, will be a use case 
of this shared structure API. The user thread will request execution time 
extension by setting a flag in the shared structure.

Additional members can be appended to the shared structure to implement 
new features that address other use cases. For example sharing thread's
time spent off cpu and on run queue(described in [4]). These would help
the user thread measure cpu time consumption accross some operation,
without having to call getrusage() system call or read /proc/pid/schedstat
frequently.

Another use case is to share user thread's state 'on' or 'off' cpu thru the
shared structure - which can be useful in implementing adaptive waits in
user space(as discussed in [1]). The waiter thread checks if the owner 
of the resource(lock) is 'on' cpu to continue spinning.

API:
===
	The system call 
	int task_getshared(int option, int flags, void __user *uaddr)

	Only supports TASK_SHAREDINFO option for now, 'flags' are not used.

	/* option */
	#define TASK_SHAREDINFO 1

	struct task_sharedinfo {
		volatile unsigned short sched_delay;    
	};

	#define TASK_PREEMPT_DELAY_REQ     1
	#define TASK_PREEMPT_DELAY_GRANTED 2
	#define TASK_PREEMPT_DELAY_DENIED  3


	Following call:
	__thread struct task_sharedinfo *ts;
	task_getshared(TASK_SHAREDINFO, 0, &ts);

	User task sets 'sched_delay' member to TASK_PREEMPT_DELAY_REQ to
	request scheduler time extension. Kernel sets 'sched_delay' to 
	TASK_PREEMPT_DELAY_GRANTED to indicate if the request for execution 
	time extension was granted or TASK_PREEMPT_DELAY_DENIED if denied.

[1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
[2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/
[3] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/
[4] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/

Prakash Sangappa (4):
  Introduce per thread user-kernel shared structure
  Scheduler time extention
  Indicate if schedular preemption delay request is granted
  Add scheduler preemption delay granted stats

 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/entry-common.h           |  10 +-
 include/linux/mm_types.h               |   4 +
 include/linux/sched.h                  |  30 ++
 include/linux/syscalls.h               |   2 +
 include/linux/task_shared.h            |  63 +++++
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/task_shared.h       |  29 ++
 init/Kconfig                           |  10 +
 kernel/entry/common.c                  |  15 +-
 kernel/fork.c                          |  12 +
 kernel/sched/core.c                    |  28 ++
 kernel/sched/debug.c                   |   4 +
 kernel/sched/syscalls.c                |   7 +
 kernel/sys_ni.c                        |   2 +
 mm/Makefile                            |   1 +
 mm/mmap.c                              |  13 +
 mm/task_shared.c                       | 366 +++++++++++++++++++++++++
 19 files changed, 593 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/task_shared.h
 create mode 100644 include/uapi/linux/task_shared.h
 create mode 100644 mm/task_shared.c

-- 
2.43.5