linux-kernel - Re: [PATCH v2 7/9] sched: define TIF_ALLOW

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whnwC01m_1f-gaM1xbvvwzwTiKitrWniA-ChZv+bM03dg@mail.gmail.com>
Date:   Mon, 18 Sep 2023 18:57:50 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ankur Arora <ankur.a.arora@...cle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
        akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        willy@...radead.org, mgorman@...e.de, rostedt@...dmis.org,
        jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        jgross@...e.com, andrew.cooper3@...rix.com
Subject: Re: [PATCH v2 7/9] sched: define TIF_ALLOW_RESCHED

On Mon, 18 Sept 2023 at 16:42, Thomas Gleixner <tglx@...utronix.de> wrote:
>
> What about the following:
>
>    1) Keep preemption count and the real preemption points enabled
>       unconditionally.

Well, it's certainly the simplest solution, and gets rid of not just
the 'rep string' issue, but gets rid of all the cond_resched() hackery
entirely.

>       20 years ago this was a real issue because we did not have:
>
>        - the folding of NEED_RESCHED into the preempt count
>
>        - the cacheline optimizations which make the preempt count cache
>          pretty much always cache hot
>
>        - the hardware was way less capable
>
>       I'm not saying that preempt_count is completely free today as it
>       obviously adds more text and affects branch predictors, but as the
>       major distros ship with DYNAMIC_PREEMPT enabled it is obviously an
>       acceptable and tolerable tradeoff.

Yeah, the fact that we do presumably have PREEMPT_COUNT enabled in
most distros does speak for just admitting that the PREEMPT_NONE /
VOLUNTARY approach isn't actually used, and is only causing pain.

>    2) When the scheduler wants to set NEED_RESCHED due it sets
>       NEED_RESCHED_LAZY instead which is only evaluated in the return to
>       user space preemption points.

Is this just to try to emulate the existing PREEMPT_NONE behavior?

If the new world order is that the time slice is always honored, then
the "this might be a latency issue" goes away. Good.

And we'd also get better coverage for the *debug* aim of
"might_sleep()" and CONFIG_DEBUG_ATOMIC_SLEEP, since we'd rely on
PREEMPT_COUNT always existing.

But because the latency argument is gone, the "might_resched()" should
then just be removed entirely from "might_sleep()", so that
might_sleep() would *only* be that DEBUG_ATOMIC_SLEEP thing.

That argues for your suggestion too, since we had a performance issue
due to "might_sleep()" _not_ being just a debug thing, and pointlessly
causing a reschedule in a place where reschedules were _allowed_, but
certainly much less than optimal.

Which then caused that fairly recent commit 4542057e18ca ("mm: avoid
'might_sleep()' in get_mmap_lock_carefully()").

However, that does bring up an issue: even with full preemption, there
are certainly places where we are *allowed* to schedule (when the
preempt count is zero), but there are also some places that are
*better* than other places to schedule (for example, when we don't
hold any other locks).

So, I do think that if we just decide to go "let's just always be
preemptible", we might still have points in the kernel where
preemption might be *better* than in others points.

But none of might_resched(), might_sleep() _or_ cond_resched() are
necessarily that kind of "this is a good point" thing. They come from
a different background.

So what I think what you are saying is that we'd have the following situation:

 - scheduling at "return to user space" is presumably always a good thing.

A non-preempt-count bit NEED_RESCHED_LAZY (or TIF_RESCHED, or
whatever) would cover that, and would give us basically the existing
CONFIG_PREEMPT_NONE behavior.

So a config variable (either compile-time with PREEMPT_NONE or a
dynamic one with DYNAMIC_PREEMPT set to none) would make any external
wakeup only set that bit.

And then a "fully preemptible low-latency desktop" would set the
preempt-count bit too.

 - but the "timeslice over" case would always set the
preempt-count-bit, regardless of any config, and would guarantee that
we have reasonable latencies.

This all makes cond_resched() (and might_resched()) pointless, and
they can just go away.

Then the question becomes whether we'd want to introduce a *new*
concept, which is a "if you are going to schedule, do it now rather
than later, because I'm taking a lock, and while it's a preemptible
lock, I'd rather not sleep while holding this resource".

I suspect we want to avoid that for now, on the assumption that it's
hopefully not a problem in practice (the recently addressed problem
with might_sleep() was that it actively *moved* the scheduling point
to a bad place, not that scheduling could happen there, so instead of
optimizing scheduling, it actively pessimized it). But I thought I'd
mention it.

Anyway, I'm definitely not opposed. We'd get rid of a config option
that is presumably not very widely used, and we'd simplify a lot of
issues, and get rid of all these badly defined "cond_preempt()"
things.

                Linus