linux-kernel - Re: [RFC PATCH 00/86] Make the kernel preemptible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87r0l0rb8c.fsf@oracle.com>
Date:   Tue, 07 Nov 2023 22:49:07 -0800
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Christoph Lameter <cl@...ux.com>,
        Ankur Arora <ankur.a.arora@...cle.com>,
        linux-kernel@...r.kernel.org, tglx@...utronix.de,
        peterz@...radead.org, torvalds@...ux-foundation.org,
        paulmck@...nel.org, linux-mm@...ck.org, x86@...nel.org,
        akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        willy@...radead.org, mgorman@...e.de, jon.grimm@....com,
        bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        jgross@...e.com, andrew.cooper3@...rix.com, mingo@...nel.org,
        bristot@...nel.org, mathieu.desnoyers@...icios.com,
        geert@...ux-m68k.org, glaubitz@...sik.fu-berlin.de,
        anton.ivanov@...bridgegreys.com, mattst88@...il.com,
        krypton@...ich-teichert.org, David.Laight@...LAB.COM,
        richard@....at, mjguzik@...il.com
Subject: Re: [RFC PATCH 00/86] Make the kernel preemptible


Steven Rostedt <rostedt@...dmis.org> writes:

> On Tue, 7 Nov 2023 20:52:39 -0800 (PST)
> Christoph Lameter <cl@...ux.com> wrote:
>
>> On Tue, 7 Nov 2023, Ankur Arora wrote:
>>
>> > This came up in an earlier discussion (See
>> > https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/) and Thomas mentioned
>> > that preempt_enable/_disable() overhead was relatively minimal.
>> >
>> > Is your point that always-on preempt_count is far too expensive?
>>
>> Yes over the years distros have traditionally delivered their kernels by
>> default without preemption because of these issues. If the overhead has
>> been minimized then that may have changed. Even if so there is still a lot
>> of code being generated that has questionable benefit and just
>> bloats the kernel.
>>
>> >> These are needed to avoid adding preempt_enable/disable to a lot of primitives
>> >> that are used for synchronization. You cannot remove those without changing a
>> >> lot of synchronization primitives to always have to consider being preempted
>> >> while operating.
>> >
>> > I'm afraid I don't understand why you would need to change any
>> > synchronization primitives. The code that does preempt_enable/_disable()
>> > is compiled out because CONFIG_PREEMPT_NONE/_VOLUNTARY don't define
>> > CONFIG_PREEMPT_COUNT.
>>
>> In the trivial cases it is simple like that. But look f.e.
>> in the slub allocator at the #ifdef CONFIG_PREEMPTION section. There is a
>> overhead added to be able to allow the cpu to change under us. There are
>> likely other examples in the source.
>>
>
> preempt_disable() and preempt_enable() are much lower overhead today than
> it use to be.
>
> If you are worried about changing CPUs, there's also migrate_disable() too.
>
>> And the whole business of local data
>> access via per cpu areas suffers if we cannot rely on two accesses in a
>> section being able to see consistent values.
>>
>> > The intent here is to always have CONFIG_PREEMPT_COUNT=y.
>>
>> Just for fun? Code is most efficient if it does not have to consider too
>> many side conditions like suddenly running on a different processor. This
>> introduces needless complexity into the code. It would be better to remove
>> PREEMPT_COUNT for good to just rely on voluntary preemption. We could
>> probably reduce the complexity of the kernel source significantly.
>
> That is what caused this thread in the first place. Randomly scattered
> "preemption points" does not scale!
>
> And I'm sorry, we have latency sensitive use cases that require full
> preemption.
>
>>
>> I have never noticed a need to preemption at every instruction in the
>> kernel (if that would be possible at all... Locks etc prevent that ideal
>> scenario frequently). Preemption like that is more like a pipe dream.

The intent isn't to preempt at every other instruction in the kernel.

As Thomas describes, the idea is that for voluntary preemption kernels
resched happens at cond_resched() points which have been distributed
heuristically. As a consequence you might get both too little preemption
and too much preemption.

The intent is to bring preemption in control of the scheduler which
can do a better job than randomly placed cond_resched() points.

>> High performance kernel solution usually disable
>> overhead like that.

You are also missing all the ways in which voluntary preemption
points are responsible for poor performance. For instance, if you look
atq clear_huge_page() it does page by page copy with a cond_resched()
call after clearing each page.

But if you can expose the full extent to the CPU, it can optimize
differently (for the 1GB page it can now elide cacheline allocation):

  *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
                          (GB/s)                (GB/s)

  pg-sz=2MB                14.55                 19.29    +32.5%
  pg-sz=1GB                19.34                 49.60   +156.4%

(See https://lore.kernel.org/all/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

> Please read the email from Thomas:
>
>    https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
>
> This is not technically getting rid of PREEMPT_NONE. It is adding a new
> NEED_RESCHED_LAZY flag, that will have the kernel preempt only when
> entering or in user space. It will behave the same as PREEMPT_NONE, but
> without the need for all the cond_resched() scattered randomly throughout
> the kernel.

And a corollary of that is that with a scheduler controlled PREEMPT_NONE
a task might end up running to completion where earlier it could have been
preempted early because it crossed a cond_resched().

> If the task is in the kernel for more than one tick (1ms at 1000Hz, 4ms at
> 250Hz and 10ms at 100Hz), it will then set NEED_RESCHED, and you will
> preempt at the next available location (preempt_count == 0).
>
> But yes, all locations that do not explicitly disable preemption, will now
> possibly preempt (due to long running kernel threads).

--
ankur