linux-kernel - Re: [RFC PATCH 0/4] Scheduler time slice extension

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84406285-6A4C-493B-B010-1FAD512EFAD8@oracle.com>
Date: Mon, 16 Dec 2024 18:59:22 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC: Peter Zijlstra <peterz@...radead.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        Daniel Jordan
	<daniel.m.jordan@...cle.com>
Subject: Re: [RFC PATCH 0/4] Scheduler time slice extension



> On Dec 9, 2024, at 1:17 PM, Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
> 
> On 2024-12-09 15:36, Prakash Sangappa wrote:
>>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@...cle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@...radead.org> wrote:
>>>> 
>>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>> 
>>>>> 
>>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
>>>>>> 
>>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>>> flag in this shared structure to request execution time extension.
>>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>>> another thing?
>>>>>> 
>>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>>> explanation that justifies adding yet another per-thread memory area
>>>>>> shared between kernel and userspace when we have extensible rseq
>>>>>> already.
>>>>> 
>>>>> It mainly provides pinned memory, can be useful for  future use cases
>>>>> where updating user memory in kernel context can be fast or needs to
>>>>> avoid pagefaults.
>>>> 
>>>> 'might be useful' it not good enough a justification. Also, I don't
>>>> think you actually need this.
>>> 
>>> Will get back with database benchmark results using rseq API for scheduler time extension.
>> Sorry about the delay in response.
>> Here are the database swingbench numbers   - includes results with use of rseq API.
>> Test results:
>> =========
>> Test system 2 socket AMD Genoa
>>   Swingbench - standard database benchmark
>> Cached(database files on tmpfs) run, with 1000 clients.
>> Baseline(Without Sched time extension):  99K SQL exec/sec
>> With Sched time extension:
>> Shared structure API use: 153K SQL exec/sec  (Previously reported)
>> 55% improvement in throughput.
>> Restartable sequences API use: 147K SQL exec/sec
>>   48% improvement in throughput
>> While both show good performance benefit with scheduler time extension,
>> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
>> Use of shared structure is faster.
> 
> Can you share the code for both test cases ? And do you have relevant
> perf profile showing where time is spent ?
> 
> Thanks,
> 
> Mathieu
> 

The changes are in the database(Oracle DB). 
The test is swingbench. https://www.dominicgiles.com/downloads/

Our database team is running the benchmark. I have requested them to repeat the test and 
capture perf profile.

With restartable sequences API, once a thread registers the 'structure rseq', 
kernel will need to 'copy_from_user' to check if the thread is requesting an extension in 
exit_to_user_mode_loop(), this potentially adds overhead? Were as with shared structure
It is just a memory access. 

I was trying to reproduce the performance difference using a microbenchmark. 
Used a modified version of the test(extend-sched.c) the Steven Rosted’s posted here 
https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/

Modified test to include use of the Shared structure API and Restartable sequences API to request
scheduler time extension.  Increased number of data objects that the threads contend on 
and grab_lock(). Added some delay inside critical section to increase lock hold time. 
Test runs for 180secs. 
(Modified test included below)

The test spawns number of threads equal to number of cpus. They update a shared data object. 
Simultaneously ran a cpu hog program with # threads equal to number of cpus.

The test takes an argument  
0 - No scheduler time extension
1 - Use Shared structure API to request extension 
2 - Use Restartable sequences API to request extension.
Option 1 & 2 would require relevant kernel running. 

The test increments a count in the shared data object indicating number of times it could complete
the critical section. Higher the number is better.

Running this test on a 50 core VM shows lot of variance in the results. Generally we see
performance improvement of 4 to 6% with use of either APIs in this test.
Unfortunately It does not show  consistent difference between the two APIs.

Approximately :-

No extension:

#./extend-sched2 0 |egrep -e "^Ran|^Total|^avg”
Ran for 402660 times <---
Total wait time: 16132.358985
avg # grants 0 avg reqs 4026
avg # blocks 14440


With extension: using shared structure API

#./extend-sched2 1 |egrep -e "^Ran|^Total|^avg”
Ran for 425210 times <---
Total wait time: 16043.160130
avg # grants 2582 avg reqs 4252
avg # blocks 14920

About 5.6% improvement.

WIth extension: Using restartable sequences API

#./extend-sched2 2 |egrep -e "^Ran|^Total|^avg”
Ran for 423765 times <---
Total wait time: 16045.406387
avg # grants 2580 avg reqs 4237
avg # blocks 14851

About 5.3 % improvement. 

The perf profile of the test are similar between the two APIs. With restartable sequences, 
we see ‘_copy_from_user’ appear.

# Total Lost Samples: 0
#
# Samples: 890K of event 'cycles:P'
# Event count (approx.): 32581540690314
#
# Overhead       Samples  Command       Shared Object       Symbol
# ........  ............  ............  ..................  .............................................
#
   98.67%        877109  lckshrsq2scl  lckshrsq2scl        [.] grab_lock
    0.84%          7496  lckshrsq2scl  [kernel.kallsyms]   [k] asm_sysvec_apic_timer_interrupt
    0.03%           285  lckshrsq2scl  [kernel.kallsyms]   [k] native_write_msr
    0.02%           200  lckshrsq2scl  [kernel.kallsyms]   [k] native_read_msr
    0.02%           175  lckshrsq2scl  [kernel.kallsyms]   [k] pvclock_clocksource_read_nowd
    0.01%           128  lckshrsq2scl  [kernel.kallsyms]   [k] sync_regs
    0.01%           120  lckshrsq2scl  [kernel.kallsyms]   [k] get_jiffies_update
    0.01%           116  lckshrsq2scl  [kernel.kallsyms]   [k] __update_load_avg_cfs_rq
    0.01%           113  lckshrsq2scl  [kernel.kallsyms]   [k] update_curr
    0.01%           137  lckshrsq2scl  [kernel.kallsyms]   [k] srso_alias_safe_ret
    0.01%            96  lckshrsq2scl  [kernel.kallsyms]   [k] __update_load_avg_se
    0.01%            85  lckshrsq2scl  [kernel.kallsyms]   [k] update_process_times
    0.01%            76  lckshrsq2scl  [kernel.kallsyms]   [k] update_rq_clock_task
    0.01%           101  lckshrsq2scl  [kernel.kallsyms]   [k] __raw_callee_save___pv_queued_spin_unlock
    0.01%            73  lckshrsq2scl  [kernel.kallsyms]   [k] update_irq_load_avg
    0.01%            72  lckshrsq2scl  lckshrsq2scl          [.] run_thread
    0.01%            66  lckshrsq2scl  [kernel.kallsyms]   [k] rcu_pending
    0.01%            65  lckshrsq2scl  [kernel.kallsyms]   [k] perf_adjust_freq_unthr_context
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] sched_tick
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] update_load_avg
    0.01%            62  lckshrsq2scl  [kernel.kallsyms]   [k] hrtimer_interrupt
..
    0.00%             14  lckshrsq2scl  [kernel.kallsyms]   [k] _copy_from_user
..
    0.00%             1  lckshrsq2scl  [kernel.kallsyms]   [k] _copy_to_user



Test program:
/*— extend-sched2.c — */

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdbool.h>
#include <errno.h>
#include <pthread.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/time.h>
#include <linux/rseq.h>
#include <sys/syscall.h>

#define barrier() asm volatile ("" ::: "memory")
#define rmb() asm volatile ("lfence" ::: "memory")
#define wmb() asm volatile ("sfence" ::: "memory")


/*----- shared struct ------ */
/*
* Shared page -  shared structure API/ABI
*/
#define __NR_task_getshared 463
#define TASK_SHAREDINFO 1


/* a per thread shared structure */
struct task_sharedinfo {
       volatile unsigned short sched_delay;
};

/*
* Returns address of the task_ushared structure in the shared page
* mapped between userspace and kernel, for the calling thread.
*/
struct task_sharedinfo *task_getshared()
{
       struct task_sharedinfo *ts;
       if (!syscall(__NR_task_getshared, TASK_SHAREDINFO, 0, &ts))
               return ts;
       return NULL;
}

/*------- restartable seq ------- */
#define RSEQ_CS_SCHED_DELAY 3
#define RSEQ_SCHED_DELAY ((1U << RSEQ_CS_SCHED_DELAY))

static __thread struct rseq myrseq;
static __thread struct rseq *extend_map_rseq;


int init_rseql()
{
       syscall(__NR_rseq, &myrseq, 32, 0, 0x31456);
       extend_map_rseq = &myrseq;
}

/*--------------------------------*/

static pthread_barrier_t pbarrier;
static __thread struct task_sharedinfo *extend_map;
int enable_extend;

static void init_extend_map(void)
{
       if (!enable_extend)
               return;
       if (enable_extend == 1) {
               extend_map = task_getshared();
               if (extend_map == NULL) {
                       printf("Failed to allocate shared struct\n");
               }
       }
       if (enable_extend == 2) {
               init_rseql();
               if (extend_map_rseq == NULL) {
                       printf("Failed to init startable seq\n");
               }
       }

}

/*called from main only */
static void test_extend(void)
{
       int x;

       init_extend_map();
       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;
       printf("Spinning...");
       if (enable_extend == 1) {
               extend_map->sched_delay = 1;
               while(extend_map->sched_delay != 2) {
                       x++;
                       barrier();
               }
       }

       if (enable_extend == 2) {
               extend_map_rseq->flags = RSEQ_SCHED_DELAY;
               while (extend_map_rseq->flags & RSEQ_SCHED_DELAY) {
                       x++;
                       barrier();
               }
       }
       printf("Done\n");

}

struct thread_data {
       unsigned long long                      start_wait;
       unsigned long long                      x_count;
       unsigned long long                      total;
       unsigned long long                      max;
       unsigned long long                      min;
       unsigned long long                      total_wait;
       unsigned long long                      max_wait;
       unsigned long long                      min_wait;
       unsigned long long                      blocks;
       unsigned long long                      dlygrant;
       struct data                            *data;
};

struct data {
       unsigned long long              x;
       unsigned long                   lock;
       bool                            done;
};

int nobj;
struct data *data;
struct thread_data *thrdata;

static inline unsigned long
cmpxchg(volatile unsigned long *ptr, unsigned long old, unsigned long new)
{
       unsigned long prev;

       asm volatile("lock; cmpxchg %b1,%2"
                    : "=a"(prev)
                    : "q"(new), "m"(*(ptr)), "0"(old)
                    : "memory");
       return prev;
}

static inline unsigned long
xchg(volatile unsigned long *ptr, unsigned long new)
{
       unsigned long ret = new;

       asm volatile("xchg %b0,%1"
                    : "+r"(ret), "+m"(*(ptr))
                    : : "memory");
       return ret;
}

static void extend(struct thread_data *tdata)
{
       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;

       if (enable_extend == 1)
               extend_map->sched_delay = 1;
       if (enable_extend == 2)
               extend_map_rseq->flags = RSEQ_SCHED_DELAY;
}

static void unextend(struct thread_data *tdata)
{
       unsigned long prev;

       if (extend_map == NULL &&  extend_map_rseq == NULL)
               return;

       if (enable_extend == 1) {
               prev = extend_map->sched_delay;
               extend_map->sched_delay = 0;
       //      prev = xchg(&extend_map->sched_delay, 0);
               if (prev == 2) {
                       tdata->dlygrant++;
                       //printf("Yield!\n");
                       sched_yield();
               }
       }

       if (enable_extend == 2) {
               prev = extend_map_rseq->flags;
               extend_map_rseq->flags = 0;
               if (!(prev & RSEQ_SCHED_DELAY)) {
                       tdata->dlygrant++;
                       //printf("Yield!\n");
                       sched_yield();
               }
       }
}

#define sec2usec(sec) (sec * 1000000ULL)
#define usec2sec(usec) (usec / 1000000ULL)

static unsigned long long get_time(void)
{
       struct timeval tv;
       unsigned long long time;

       gettimeofday(&tv, NULL);

       time = sec2usec(tv.tv_sec);
       time += tv.tv_usec;

       return time;
}

static void grab_lock(struct thread_data *tdata, struct data *data)
{
       unsigned long long start, end, delta;
       unsigned long long end_wait;
       unsigned long long last;
       unsigned long prev;

       if (!tdata->start_wait)
               tdata->start_wait = get_time();

       while (data->lock && !data->done)
               rmb();

       extend(tdata);
       start = get_time();
       prev = cmpxchg(&data->lock, 0, 1);
       if (prev) {
               unextend(tdata);
               tdata->blocks++;
               return;
       }
       end_wait = get_time();
       //printf("Have lock!\n");

       delta = end_wait - tdata->start_wait;
       tdata->start_wait = 0;
       if (!tdata->total_wait || tdata->max_wait < delta)
               tdata->max_wait = delta;
       if (!tdata->total_wait || tdata->min_wait > delta)
               tdata->min_wait = delta;
       tdata->total_wait += delta;

       data->x++;
       last = data->x;

       if (data->lock != 1) {
               printf("Failed locking\n");
               exit(-1);
       }

       // extend hold time
       for(int i = 0; i < 3000000; i++);

       prev = cmpxchg(&data->lock, 1, 0);
       end = get_time();
       if (prev != 1) {
               printf("Failed unlocking\n");
               exit(-1);
       }
       //printf("released lock!\n");
       unextend(tdata);

       delta = end - start;
       if (!tdata->total || tdata->max < delta)
               tdata->max = delta;

       if (!tdata->total || tdata->min > delta)
               tdata->min = delta;

       tdata->total += delta;
       tdata->x_count++;

       /* Let someone else have a turn */
       while (data->x == last && !data->done)
               rmb();
}

static void *run_thread(void *d)
{
       struct thread_data *tdata = d;
       struct data *data = tdata->data;

       init_extend_map();

       pthread_barrier_wait(&pbarrier);

       while (!data->done) {
               grab_lock(tdata, data);
       }
       return NULL;
}

/* arg1 == 1 use shared struct, arg1 ==2 use rseq */
int main (int argc, char **argv)
{
       unsigned long long total_wait = 0;
       unsigned long long total_blocks = 0;
       unsigned long long total_grants = 0;
       unsigned long long total_runs = 0;
       unsigned long long secs;
       pthread_t *threads;
       int cpus;

       enable_extend = 0;
       if (argc < 2) {
               printf("Usage:%s <0 - no extend, 1 = shared page, 2 = rseq>\n",
                                                        argv[0]);
               exit(1);
       }
       enable_extend = atoi(argv[1]);
       printf("enable extend %d\n", enable_extend);

       test_extend();

       cpus = sysconf(_SC_NPROCESSORS_CONF);
       nobj = cpus / 10;


       threads = calloc(cpus + 1, sizeof(*threads));
       if (!threads) {
               perror("threads");
               exit(-1);
       }

       thrdata = calloc(cpus + 1, sizeof(struct thread_data));
       if (!thrdata) {
               perror("Allocating tdata");
               exit(-1);
       }

       data = calloc(nobj + 1, sizeof(struct data));
       if (!data) {
               perror("Allocating tdata");
               exit(-1);
       }

       pthread_barrier_init(&pbarrier, NULL, cpus + 1);

       for (int i = 0; i < cpus; i++) {
               int ret;

               thrdata[i].data = &data[ i % nobj];
               ret = pthread_create(&threads[i], NULL, run_thread, &thrdata[i]);
               if (ret < 0) {
                       perror("creating threads");
                       exit(-1);
               }
       }

       pthread_barrier_wait(&pbarrier);
       sleep(180);

       printf("Finish up\n");
       for (int i = 0; i < nobj; i++)
               data[i].done = true;
       wmb();

       for (int i = 0; i < cpus; i++) {
               pthread_join(threads[i], NULL);
               printf("thread %i:\n", i);
               printf("   count:\t%lld\n", thrdata[i].x_count);
               printf("   total:\t%lld\n", thrdata[i].total);
               printf("     max:\t%lld\n", thrdata[i].max);
               printf("     min:\t%lld\n", thrdata[i].min);
               printf("   total wait:\t%lld\n", thrdata[i].total_wait);
               printf("     max wait:\t%lld\n", thrdata[i].max_wait);
               printf("     min wait:\t%lld\n", thrdata[i].min_wait);
               printf("     #blocks :\t%lld\n", thrdata[i].blocks);
               printf("     #dlygrnt:\t%lld\n", thrdata[i].dlygrant);
               total_wait += thrdata[i].total_wait;
               total_blocks += thrdata[i].blocks;
               total_grants += thrdata[i].dlygrant;
       }

       for(int i = 0; i < nobj; i++)
               total_runs += data[i].x;

       secs = usec2sec(total_wait);

       printf("Ran for %lld times\n", total_runs);
       printf("Total wait time: %lld.%06lld\n", secs, total_wait - sec2usec(secs));
       printf("avg # grants %lld avg reqs %lld \n", total_grants/cpus, total_runs/cpus);
       printf("avg # blocks %lld\n", total_blocks/cpus);
       return 0;
}

/*———————————*/