linux-kernel - Re: [RFC PATCH 0/4] Scheduler time slice extension

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CFB8E54E-0E12-438C-AD10-F8D2AF2755E5@oracle.com>
Date: Wed, 13 Nov 2024 23:24:57 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
CC: Peter Zijlstra <peterz@...radead.org>,
        "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        Daniel Jordan
	<daniel.m.jordan@...cle.com>
Subject: Re: [RFC PATCH 0/4] Scheduler time slice extension



> On Nov 13, 2024, at 12:57 PM, Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
> 
> On 2024-11-13 15:10, Prakash Sangappa wrote:
>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
>>> 
>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>> thread shared structure implementation described below. This shared structure
>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>> flag in this shared structure to request execution time extension.
>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>> another thing?
>>> 
>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>> explanation that justifies adding yet another per-thread memory area
>>> shared between kernel and userspace when we have extensible rseq
>>> already.
>> It mainly provides pinned memory, can be useful for  future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults.
> 
> Does the targeted use-case (scheduler time slice extension) require
> pinned memory, or just future use-cases ?

Probably not for sched time slice extension. Although, we have not run the database workload using restartable sequence. We can try that and get back.

What about publishing thread state in the shared structure? It would require updating user memory in context switching code path when thread is going off the cpu. Here having pinned memory would help. Thread state is the other requirement we are interested in.

> 
> Does having a missing time slice extension hint for a short while in
> case of high memory pressure (rseq page swapped out) have any measurable

Yes, but the restartable sequence based implementation does copy_from_user(), which would go thru the pagefault. It may be ok here. Perpaps it would require disabling page faults in this case? If the page is not present, then the hint would most likely not matter and kernel can skip.

> impact compared to the overhead of the page faults which will be
> happening in case of the high memory pressure required to trigger this
> scenario ?

We would have to test that. 

> 
>>> 
>>> Peter, was there anything fundamentally wrong with your approach based
>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net
>>> 
>>> The main thing I wonder is whether loading the rseq delay resched flag
>>> on return to userspace is too late in your patch. Also, I'm not sure it is
>>> realistic to require that no system calls should be done within time extension
>>> slice. If we have this scenario:
>> I am also not sure if we need to prevent system calls in this scenario.
> 
> I suspect that if we prohibit system calls from being issued from a
> delay-resched userspace critical section, then loading the delay-resched
> rseq flag on return to userspace is always fine, because the kernel only
> reschedules on return from interrupt or trap.
> 
> But I see this no-syscall restriction as being very cumbersome for
> userspace.
> 
>> Was that restriction mainly because of restartable sequence API implements it?
> 
> I suspect that this restriction is just to avoid loading the
> delay-resched flag from the scheduler when reschedule is called
> from an interrupt handler nested over a system call for preemptible
> kernels, but Peter could tell us more.

Hoping Peter can comment on this.

> One open question here is whether we want to pin memory for
> each thread in the system to hold this shared data between
> userspace and the scheduler. AFAIU, this is a trade-off between
> slice extension accuracy in high memory usage scenarios and
> pinned memory footprint impact. If the tradeoff goes towards
> making this memory pinned, then we may want to consider pinning
> the per-thread rseq area on rseq registration.
> 
> Another option to consider is to use rseq to index a userspace
> per-cpu data structure, which will be used as shared memory
> between kernel and userspace. Userspace could store this
> delay-resched flag into the current CPU's shared data, and
> the scheduler could load it from there. If pinning per-cpu
> data is more acceptable than pinning per-thread data, then
> it could be an improvement.

Interesting. Having pre cpu shared memory may not work for all use cases.

Thanks,
-Prakash

> 
> This could be a new ABI between kernel and userspace, e.g.:
> 
> struct rseq_percpu_area {
>    __u32 sched_state; /* includes time slice extension flag. */
>    char end[];
> };
> 
> Registered to the kernel with the following parameters:
> 
> - Address of rseq_percpu_area for CPU 0,
> - The stride of the per-cpu indexing (see librseq mempool per-cpu
>  allocator [1]),
> - offsetof(struct rseq_percpu_area, end) to have the feature size
>  for extensibility.
> 
> Thanks,
> 
> Mathieu
> 
> [1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf
> 
>> -Prakash
>>> 
>>> A) userspace grabs lock
>>>   - set rseq delay resched flag
>>> B) syscall
>>>   - reschedule
>>>    [...]
>>>   - return to userspace, load rseq delay-resched flag from userspace (too late)
>>> 
>>> I would have thought loading the delay resched flag should be attempted much
>>> earlier in the scheduler code. Perhaps we could do this from a page fault
>>> disable critical section, and accept that this hint may be a no-op if the
>>> rseq page happens to be swapped out (which is really unlikely). This is
>>> similar to the "on_cpu" sched state rseq extension RFC I posted a while back,
>>> which needed to be accessed from the scheduler:
>>> 
>>>  https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/
>>>  https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/
>>> 
>>> And we'd leave the delay-resched load in place on return to userspace, so
>>> in the unlikely scenario where it is swapped out, at least it gets paged
>>> back at that point.
>>> 
>>> Feel free to let me know if I'm missing an important point and/or saying
>>> nonsense here.
>>> 
>>> Thanks,
>>> 
>>> Mathieu
>>> 
>>> -- 
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com