[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <84406285-6A4C-493B-B010-1FAD512EFAD8@oracle.com>
Date: Mon, 16 Dec 2024 18:59:22 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC: Peter Zijlstra <peterz@...radead.org>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>,
"rostedt@...dmis.org" <rostedt@...dmis.org>,
"tglx@...utronix.de" <tglx@...utronix.de>,
Daniel Jordan
<daniel.m.jordan@...cle.com>
Subject: Re: [RFC PATCH 0/4] Scheduler time slice extension
> On Dec 9, 2024, at 1:17 PM, Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
>
> On 2024-12-09 15:36, Prakash Sangappa wrote:
>>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@...cle.com> wrote:
>>>
>>>
>>>
>>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@...radead.org> wrote:
>>>>
>>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>>
>>>>>
>>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
>>>>>>
>>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>>> flag in this shared structure to request execution time extension.
>>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>>> another thing?
>>>>>>
>>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>>> explanation that justifies adding yet another per-thread memory area
>>>>>> shared between kernel and userspace when we have extensible rseq
>>>>>> already.
>>>>>
>>>>> It mainly provides pinned memory, can be useful for future use cases
>>>>> where updating user memory in kernel context can be fast or needs to
>>>>> avoid pagefaults.
>>>>
>>>> 'might be useful' it not good enough a justification. Also, I don't
>>>> think you actually need this.
>>>
>>> Will get back with database benchmark results using rseq API for scheduler time extension.
>> Sorry about the delay in response.
>> Here are the database swingbench numbers - includes results with use of rseq API.
>> Test results:
>> =========
>> Test system 2 socket AMD Genoa
>> Swingbench - standard database benchmark
>> Cached(database files on tmpfs) run, with 1000 clients.
>> Baseline(Without Sched time extension): 99K SQL exec/sec
>> With Sched time extension:
>> Shared structure API use: 153K SQL exec/sec (Previously reported)
>> 55% improvement in throughput.
>> Restartable sequences API use: 147K SQL exec/sec
>> 48% improvement in throughput
>> While both show good performance benefit with scheduler time extension,
>> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
>> Use of shared structure is faster.
>
> Can you share the code for both test cases ? And do you have relevant
> perf profile showing where time is spent ?
>
> Thanks,
>
> Mathieu
>
The changes are in the database(Oracle DB).
The test is swingbench. https://www.dominicgiles.com/downloads/
Our database team is running the benchmark. I have requested them to repeat the test and
capture perf profile.
With restartable sequences API, once a thread registers the 'structure rseq',
kernel will need to 'copy_from_user' to check if the thread is requesting an extension in
exit_to_user_mode_loop(), this potentially adds overhead? Were as with shared structure
It is just a memory access.
I was trying to reproduce the performance difference using a microbenchmark.
Used a modified version of the test(extend-sched.c) the Steven Rosted’s posted here
https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/
Modified test to include use of the Shared structure API and Restartable sequences API to request
scheduler time extension. Increased number of data objects that the threads contend on
and grab_lock(). Added some delay inside critical section to increase lock hold time.
Test runs for 180secs.
(Modified test included below)
The test spawns number of threads equal to number of cpus. They update a shared data object.
Simultaneously ran a cpu hog program with # threads equal to number of cpus.
The test takes an argument
0 - No scheduler time extension
1 - Use Shared structure API to request extension
2 - Use Restartable sequences API to request extension.
Option 1 & 2 would require relevant kernel running.
The test increments a count in the shared data object indicating number of times it could complete
the critical section. Higher the number is better.
Running this test on a 50 core VM shows lot of variance in the results. Generally we see
performance improvement of 4 to 6% with use of either APIs in this test.
Unfortunately It does not show consistent difference between the two APIs.
Approximately :-
No extension:
#./extend-sched2 0 |egrep -e "^Ran|^Total|^avg”
Ran for 402660 times <---
Total wait time: 16132.358985
avg # grants 0 avg reqs 4026
avg # blocks 14440
With extension: using shared structure API
#./extend-sched2 1 |egrep -e "^Ran|^Total|^avg”
Ran for 425210 times <---
Total wait time: 16043.160130
avg # grants 2582 avg reqs 4252
avg # blocks 14920
About 5.6% improvement.
WIth extension: Using restartable sequences API
#./extend-sched2 2 |egrep -e "^Ran|^Total|^avg”
Ran for 423765 times <---
Total wait time: 16045.406387
avg # grants 2580 avg reqs 4237
avg # blocks 14851
About 5.3 % improvement.
The perf profile of the test are similar between the two APIs. With restartable sequences,
we see ‘_copy_from_user’ appear.
# Total Lost Samples: 0
#
# Samples: 890K of event 'cycles:P'
# Event count (approx.): 32581540690314
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ............ .................. .............................................
#
98.67% 877109 lckshrsq2scl lckshrsq2scl [.] grab_lock
0.84% 7496 lckshrsq2scl [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt
0.03% 285 lckshrsq2scl [kernel.kallsyms] [k] native_write_msr
0.02% 200 lckshrsq2scl [kernel.kallsyms] [k] native_read_msr
0.02% 175 lckshrsq2scl [kernel.kallsyms] [k] pvclock_clocksource_read_nowd
0.01% 128 lckshrsq2scl [kernel.kallsyms] [k] sync_regs
0.01% 120 lckshrsq2scl [kernel.kallsyms] [k] get_jiffies_update
0.01% 116 lckshrsq2scl [kernel.kallsyms] [k] __update_load_avg_cfs_rq
0.01% 113 lckshrsq2scl [kernel.kallsyms] [k] update_curr
0.01% 137 lckshrsq2scl [kernel.kallsyms] [k] srso_alias_safe_ret
0.01% 96 lckshrsq2scl [kernel.kallsyms] [k] __update_load_avg_se
0.01% 85 lckshrsq2scl [kernel.kallsyms] [k] update_process_times
0.01% 76 lckshrsq2scl [kernel.kallsyms] [k] update_rq_clock_task
0.01% 101 lckshrsq2scl [kernel.kallsyms] [k] __raw_callee_save___pv_queued_spin_unlock
0.01% 73 lckshrsq2scl [kernel.kallsyms] [k] update_irq_load_avg
0.01% 72 lckshrsq2scl lckshrsq2scl [.] run_thread
0.01% 66 lckshrsq2scl [kernel.kallsyms] [k] rcu_pending
0.01% 65 lckshrsq2scl [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
0.01% 62 lckshrsq2scl [kernel.kallsyms] [k] sched_tick
0.01% 62 lckshrsq2scl [kernel.kallsyms] [k] update_load_avg
0.01% 62 lckshrsq2scl [kernel.kallsyms] [k] hrtimer_interrupt
..
0.00% 14 lckshrsq2scl [kernel.kallsyms] [k] _copy_from_user
..
0.00% 1 lckshrsq2scl [kernel.kallsyms] [k] _copy_to_user
Test program:
/*— extend-sched2.c — */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#include <errno.h>
#include <pthread.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/time.h>
#include <linux/rseq.h>
#include <sys/syscall.h>
#define barrier() asm volatile ("" ::: "memory")
#define rmb() asm volatile ("lfence" ::: "memory")
#define wmb() asm volatile ("sfence" ::: "memory")
/*----- shared struct ------ */
/*
* Shared page - shared structure API/ABI
*/
#define __NR_task_getshared 463
#define TASK_SHAREDINFO 1
/* a per thread shared structure */
struct task_sharedinfo {
volatile unsigned short sched_delay;
};
/*
* Returns address of the task_ushared structure in the shared page
* mapped between userspace and kernel, for the calling thread.
*/
struct task_sharedinfo *task_getshared()
{
struct task_sharedinfo *ts;
if (!syscall(__NR_task_getshared, TASK_SHAREDINFO, 0, &ts))
return ts;
return NULL;
}
/*------- restartable seq ------- */
#define RSEQ_CS_SCHED_DELAY 3
#define RSEQ_SCHED_DELAY ((1U << RSEQ_CS_SCHED_DELAY))
static __thread struct rseq myrseq;
static __thread struct rseq *extend_map_rseq;
int init_rseql()
{
syscall(__NR_rseq, &myrseq, 32, 0, 0x31456);
extend_map_rseq = &myrseq;
}
/*--------------------------------*/
static pthread_barrier_t pbarrier;
static __thread struct task_sharedinfo *extend_map;
int enable_extend;
static void init_extend_map(void)
{
if (!enable_extend)
return;
if (enable_extend == 1) {
extend_map = task_getshared();
if (extend_map == NULL) {
printf("Failed to allocate shared struct\n");
}
}
if (enable_extend == 2) {
init_rseql();
if (extend_map_rseq == NULL) {
printf("Failed to init startable seq\n");
}
}
}
/*called from main only */
static void test_extend(void)
{
int x;
init_extend_map();
if (extend_map == NULL && extend_map_rseq == NULL)
return;
printf("Spinning...");
if (enable_extend == 1) {
extend_map->sched_delay = 1;
while(extend_map->sched_delay != 2) {
x++;
barrier();
}
}
if (enable_extend == 2) {
extend_map_rseq->flags = RSEQ_SCHED_DELAY;
while (extend_map_rseq->flags & RSEQ_SCHED_DELAY) {
x++;
barrier();
}
}
printf("Done\n");
}
struct thread_data {
unsigned long long start_wait;
unsigned long long x_count;
unsigned long long total;
unsigned long long max;
unsigned long long min;
unsigned long long total_wait;
unsigned long long max_wait;
unsigned long long min_wait;
unsigned long long blocks;
unsigned long long dlygrant;
struct data *data;
};
struct data {
unsigned long long x;
unsigned long lock;
bool done;
};
int nobj;
struct data *data;
struct thread_data *thrdata;
static inline unsigned long
cmpxchg(volatile unsigned long *ptr, unsigned long old, unsigned long new)
{
unsigned long prev;
asm volatile("lock; cmpxchg %b1,%2"
: "=a"(prev)
: "q"(new), "m"(*(ptr)), "0"(old)
: "memory");
return prev;
}
static inline unsigned long
xchg(volatile unsigned long *ptr, unsigned long new)
{
unsigned long ret = new;
asm volatile("xchg %b0,%1"
: "+r"(ret), "+m"(*(ptr))
: : "memory");
return ret;
}
static void extend(struct thread_data *tdata)
{
if (extend_map == NULL && extend_map_rseq == NULL)
return;
if (enable_extend == 1)
extend_map->sched_delay = 1;
if (enable_extend == 2)
extend_map_rseq->flags = RSEQ_SCHED_DELAY;
}
static void unextend(struct thread_data *tdata)
{
unsigned long prev;
if (extend_map == NULL && extend_map_rseq == NULL)
return;
if (enable_extend == 1) {
prev = extend_map->sched_delay;
extend_map->sched_delay = 0;
// prev = xchg(&extend_map->sched_delay, 0);
if (prev == 2) {
tdata->dlygrant++;
//printf("Yield!\n");
sched_yield();
}
}
if (enable_extend == 2) {
prev = extend_map_rseq->flags;
extend_map_rseq->flags = 0;
if (!(prev & RSEQ_SCHED_DELAY)) {
tdata->dlygrant++;
//printf("Yield!\n");
sched_yield();
}
}
}
#define sec2usec(sec) (sec * 1000000ULL)
#define usec2sec(usec) (usec / 1000000ULL)
static unsigned long long get_time(void)
{
struct timeval tv;
unsigned long long time;
gettimeofday(&tv, NULL);
time = sec2usec(tv.tv_sec);
time += tv.tv_usec;
return time;
}
static void grab_lock(struct thread_data *tdata, struct data *data)
{
unsigned long long start, end, delta;
unsigned long long end_wait;
unsigned long long last;
unsigned long prev;
if (!tdata->start_wait)
tdata->start_wait = get_time();
while (data->lock && !data->done)
rmb();
extend(tdata);
start = get_time();
prev = cmpxchg(&data->lock, 0, 1);
if (prev) {
unextend(tdata);
tdata->blocks++;
return;
}
end_wait = get_time();
//printf("Have lock!\n");
delta = end_wait - tdata->start_wait;
tdata->start_wait = 0;
if (!tdata->total_wait || tdata->max_wait < delta)
tdata->max_wait = delta;
if (!tdata->total_wait || tdata->min_wait > delta)
tdata->min_wait = delta;
tdata->total_wait += delta;
data->x++;
last = data->x;
if (data->lock != 1) {
printf("Failed locking\n");
exit(-1);
}
// extend hold time
for(int i = 0; i < 3000000; i++);
prev = cmpxchg(&data->lock, 1, 0);
end = get_time();
if (prev != 1) {
printf("Failed unlocking\n");
exit(-1);
}
//printf("released lock!\n");
unextend(tdata);
delta = end - start;
if (!tdata->total || tdata->max < delta)
tdata->max = delta;
if (!tdata->total || tdata->min > delta)
tdata->min = delta;
tdata->total += delta;
tdata->x_count++;
/* Let someone else have a turn */
while (data->x == last && !data->done)
rmb();
}
static void *run_thread(void *d)
{
struct thread_data *tdata = d;
struct data *data = tdata->data;
init_extend_map();
pthread_barrier_wait(&pbarrier);
while (!data->done) {
grab_lock(tdata, data);
}
return NULL;
}
/* arg1 == 1 use shared struct, arg1 ==2 use rseq */
int main (int argc, char **argv)
{
unsigned long long total_wait = 0;
unsigned long long total_blocks = 0;
unsigned long long total_grants = 0;
unsigned long long total_runs = 0;
unsigned long long secs;
pthread_t *threads;
int cpus;
enable_extend = 0;
if (argc < 2) {
printf("Usage:%s <0 - no extend, 1 = shared page, 2 = rseq>\n",
argv[0]);
exit(1);
}
enable_extend = atoi(argv[1]);
printf("enable extend %d\n", enable_extend);
test_extend();
cpus = sysconf(_SC_NPROCESSORS_CONF);
nobj = cpus / 10;
threads = calloc(cpus + 1, sizeof(*threads));
if (!threads) {
perror("threads");
exit(-1);
}
thrdata = calloc(cpus + 1, sizeof(struct thread_data));
if (!thrdata) {
perror("Allocating tdata");
exit(-1);
}
data = calloc(nobj + 1, sizeof(struct data));
if (!data) {
perror("Allocating tdata");
exit(-1);
}
pthread_barrier_init(&pbarrier, NULL, cpus + 1);
for (int i = 0; i < cpus; i++) {
int ret;
thrdata[i].data = &data[ i % nobj];
ret = pthread_create(&threads[i], NULL, run_thread, &thrdata[i]);
if (ret < 0) {
perror("creating threads");
exit(-1);
}
}
pthread_barrier_wait(&pbarrier);
sleep(180);
printf("Finish up\n");
for (int i = 0; i < nobj; i++)
data[i].done = true;
wmb();
for (int i = 0; i < cpus; i++) {
pthread_join(threads[i], NULL);
printf("thread %i:\n", i);
printf(" count:\t%lld\n", thrdata[i].x_count);
printf(" total:\t%lld\n", thrdata[i].total);
printf(" max:\t%lld\n", thrdata[i].max);
printf(" min:\t%lld\n", thrdata[i].min);
printf(" total wait:\t%lld\n", thrdata[i].total_wait);
printf(" max wait:\t%lld\n", thrdata[i].max_wait);
printf(" min wait:\t%lld\n", thrdata[i].min_wait);
printf(" #blocks :\t%lld\n", thrdata[i].blocks);
printf(" #dlygrnt:\t%lld\n", thrdata[i].dlygrant);
total_wait += thrdata[i].total_wait;
total_blocks += thrdata[i].blocks;
total_grants += thrdata[i].dlygrant;
}
for(int i = 0; i < nobj; i++)
total_runs += data[i].x;
secs = usec2sec(total_wait);
printf("Ran for %lld times\n", total_runs);
printf("Total wait time: %lld.%06lld\n", secs, total_wait - sec2usec(secs));
printf("avg # grants %lld avg reqs %lld \n", total_grants/cpus, total_runs/cpus);
printf("avg # blocks %lld\n", total_blocks/cpus);
return 0;
}
/*———————————*/
Powered by blists - more mailing lists