[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <C9D3DC1A-CBF5-4AB3-B500-C932A6868B13@oracle.com>
Date: Thu, 20 Nov 2025 07:37:34 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: LKML <linux-kernel@...r.kernel.org>,
Peter Zijlstra
<peterz@...radead.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
"Paul E. McKenney" <paulmck@...nel.org>,
Boqun Feng <boqun.feng@...il.com>, Jonathan Corbet <corbet@....net>,
Madadi Vineeth Reddy
<vineethr@...ux.ibm.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Steven
Rostedt <rostedt@...dmis.org>,
Sebastian Andrzej Siewior
<bigeasy@...utronix.de>,
Arnd Bergmann <arnd@...db.de>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Subject: Re: [patch V3 07/12] rseq: Implement syscall entry work for time
slice extensions
> On Nov 19, 2025, at 7:25 AM, Thomas Gleixner <tglx@...utronix.de> wrote:
>
> On Wed, Nov 19 2025 at 00:20, Prakash Sangappa wrote:
>>> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@...utronix.de> wrote:
>>> + if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
>>> + force_sig(SIGSEGV);
>>> +}
>>
>> I have been trying to get our Database team to implement changes to
>> use the slice extension API. They encounter the issue with a system
>> call being made within the slice extension window and the process dies
>> with SEGV.
>
> Good. Works as designed.
>
>> Apparently it will be hard to enforce not calling a system call in the
>> slice extension window due to layering.
>
> Why do I have a smell of rotten onions in my nose right now?
>
>> For the DB use case, It is fine to terminate the slice extension if a
>> system call is made, but the process getting killed will not work.
>
> That's not a question of being fine or not.
>
> The point is that on PREEMPT_NONE/VOLUNATRY that arbitrary syscall can
> consume tons of CPU cycles until it either schedules out voluntarily or
> reaches __exit_to_user_mode_loop(), which is defeating the whole
> mechanism. The timer does not help in that case because once the task is
> in the kernel it won't be preempted on return from interrupt.
>
> sys_rseq_sched_yield() is time bound, which is why it was implemented
> that way.
>
> I was absolutely right when I asked to tie this mechanism to
> PREEMPT_LAZY|FULL in the first place. That would nicely avoid the whole
> problem.
>
> Something like the uncompiled and untested below should work. Though I
> hate it with a passion.
That works. It addresses DB issue.
> + * Grudgingly support onion layer applications which cannot
> + * guarantee that rseq_slice_yield() is used to yield the CPU for
> + * terminating a grant. This is a NOP on PREEMPT_FULL/LAZY because
> + * enabling preemption above already scheduled, but required for
> + * PREEMPT_NONE/VOLUNTARY to prevent that the slice is further
> + * expanded up to the point where the syscall code schedules
> + * voluntarily or reaches exit_to_user_mode_loop().
> */
> - if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
> - force_sig(SIGSEGV);
> + if (syscall != __NR_rseq_slice_yield)
> + cond_resched();
> }
With this change, here are the ’swingbench’ performance results I received from our Database team.
https://www.dominicgiles.com/swingbench/
Kernel based on rseq/slice v3 + above change.
System: 2 socket AMD.
Cached DB config - i.e DB files cached on tmpfs.
Response from Database performance engineer:-
Overall the results are very positive and consistent with the earlier findings, we see a clear benefit from the optimization running the same tests as earlier.
• The sgrant figure in /sys/kernel/debug/rseq/stats increases with the DB side optimization enabled, while it stays flat when disabled. I believe this indicates that both the kernel-side code & the DB side triggers are working as expected.
• Due to the contentious nature of the workload these tests produce highly erratic results, but the optimization is showing improved performance across 3x tests with/without use of time slice extension.
• Swingbench throughput with use of time slice optimization
• Run 1: 50,008.10
• Run 2: 59,160.60
• Run 3: 67,342.70
• Swingbench throughput without use of time slice optimization
• Run 1: 36,422.80
• Run 2: 33,186.00
• Run 3: 44,309.80
• The application performs 55% better on average with the optimization.
-Prakash
Powered by blists - more mailing lists