linux-kernel - Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1623715482.4lskm3cx10.astroid@bobo.none>
Date:   Tue, 15 Jun 2021 10:55:14 +1000
From:   Nicholas Piggin <npiggin@...il.com>
To:     Andrew Morton <akpm@...ux-foundation.org>,
        Andy Lutomirski <luto@...nel.org>
Cc:     Anton Blanchard <anton@...abs.org>, linux-arch@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linuxppc-dev@...ts.ozlabs.org,
        Randy Dunlap <rdunlap@...radead.org>,
        Rik van Riel <riel@...riel.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be
 configurable

Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
> Replying to several emails at once...
> 
> 
> On 6/13/21 10:21 PM, Nicholas Piggin wrote:
>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
>>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>>>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>>>>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>>>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>>>>> when it is context switched. This can be disabled by architectures that
>>>>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>>>>
>>>>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>>>>
>>>>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>>>>> with the resulting code if task->active_mm were at least better
>>>>>>> documented and possibly even guarded by ifdefs.
>>>>>>
>>>>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>>>>> I don't think anything has changed in 20 years, I don't know what more
>>>>>> is needed, but if you can add to documentation that would be nice. Maybe
>>>>>> moving a bit of that into .c and .h files?
>>>>>>
>>>>>
>>>>> Quoting from that file:
>>>>>
>>>>>   - however, we obviously need to keep track of which address space we
>>>>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>>>>     which shows what the currently active address space is.
>>>>>
>>>>> This isn't even true right now on x86.
>>>>
>>>> From the perspective of core code, it is. x86 might do something crazy 
>>>> with it, but it has to make it appear this way to non-arch code that
>>>> uses active_mm.
>>>>
>>>> Is x86's scheme documented?
> 
> arch/x86/include/asm/tlbflush.h documents it a bit:
> 
>         /*
>          * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
>          * are on.  This means that it may not match current->active_mm,
>          * which will contain the previous user mm when we're in lazy TLB
>          * mode even if we've already switched back to swapper_pg_dir.
>          *
>          * During switch_mm_irqs_off(), loaded_mm will be set to
>          * LOADED_MM_SWITCHING during the brief interrupts-off window
>          * when CR3 and loaded_mm would otherwise be inconsistent.  This
>          * is for nmi_uaccess_okay()'s benefit.
>          */

So the only documentation relating to the current active_mm value or 
refcounting is that it may not match what the x86 specific code is 
doing?

All this complexity you accuse me of adding is entirely in x86 code.
On other architectures, it's very simple and understandable, and 
documented. I don't know how else to explain this.

>>>>
>>>>> With your patch applied:
>>>>>
>>>>>  To support all that, the "struct mm_struct" now has two counters: a
>>>>>  "mm_users" counter that is how many "real address space users" there are,
>>>>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>>>>  users) plus one if there are any real users.
>>>>>
>>>>> isn't even true any more.
>>>>
>>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>>> change is hopefully reasonably documented?
> 
> active_mm is *only* refcounting in the core code.  See below.

It's just not. It's passed in to switch_mm. Most architectures except 
for x86 require this.

>>>>>
>>>>> I looked through all active_mm references in core code.  We have:
>>>>>
>>>>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>>>>> with membarrier.
>>>>>
>>>>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>>>>>
>>>>> kernel/exit.c: exit_mm() a BUG_ON().
>>>>>
>>>>> kernel/fork.c: initialization code and a warning.
>>>>>
>>>>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>>>>>
>>>>> fs/exec.c: nothing of interest
>>>>
>>>> I might not have been clear. Core code doesn't need active_mm if 
>>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>>
>>>> The part I don't understand is when you say it can just go away. How? 
> 
> #ifdef CONFIG_MMU_TLB_REFCOUNT
> 	struct mm_struct *active_mm;
> #endif

Thanks for returning the snark.

>>>>
>>>>> I didn't go through drivers, but I maintain my point.  active_mm is
>>>>> there for refcounting.  So please don't just make it even more confusing
>>>>> -- do your performance improvement, but improve the code at the same
>>>>> time: get rid of active_mm, at least on architectures that opt out of
>>>>> the refcounting.
>>>>
>>>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>>>> Not even in theory.
>>>
>>> That is to say, it does do a type of reference management that requires 
>>> active_mm so you can argue it has not entirely opted out of refcounting.
>>> But we're not just doing refcounting for the sake of refcounting! That
>>> would make no sense.
>>>
>>> active_mm is required because that's the mm that we have switched to 
>>> (from core code's perspective), and it is integral to know when to 
>>> switch to a different mm. See how active_mm is a fundamental concept
>>> in core code? It's part of the contract between core code and the
>>> arch mm context management calls. reference counting follows from there
>>> but it's not the _reason_ for this code.
> 
> I don't understand what contract you're talking about.  The core code
> maintains an active_mm counter and keeps that mm_struct from
> disappearing.  That's *it*.  The core code does not care that active_mm
> is active, and x86 provides evidence of that -- on x86,
> current->active_mm may well be completely unused.

I already acknowledged archs can do their own thing under the covers if 
they want.

> 
>>>
>>> Pretend the reference problem does not exit (whether by refcounting or 
>>> shootdown or garbage collection or whatever). We still can't remove 
>>> active_mm! We need it to know how to call into arch functions like 
>>> switch_mm.
> 
> static inline void do_switch_mm(struct task_struct *prev_task, ...)
> {
> #ifdef CONFIG_MMU_TLB_REFCOUNT
> 	switch_mm(...);
> #else
> 	switch_mm(fewer parameters);
> 	/* or pass NULL or whatever. */
> #endif
> }

And prev_task comes from active_mm, ergo core code requires the concept 
of active_mm.

I already said I'm quite happy for wrappers to be added so those can 
compile out of an arch like x86. That's not doing anything for 
complexity of core code, because it still needs to maintain the active 
mm wrappers.

>>> I don't know if you just forgot that critical requirement in your above 
>>> list, or you actually are entirely using x86's mental model for this 
>>> code which is doing something entirely different that does not need it 
>>> at all. If that is the case I really don't mind some cleanup or wrapper 
>>> functions for x86 do entirely do its own thing, but if that's the case
>>> you can't criticize core code's use of active_mm due to the current
>>> state of x86. It's x86 that needs documentation and cleaning up.
>> 
>> Ah, that must be where your confusion is coming from: x86's switch_mm 
>> doesn't use prev anywhere, and the reference scheme it is using appears 
>> to be under-documented, although vague references in changelogs suggest 
>> it has not actually "opted out" of active_mm refcounting.
> 
> All of this is true, except I don't believe I'm confused.

Well someone is. My patches don't change any of the fundamental 
complexity of core code's maintaining of active_mm, nor alleviate the
requirement to keep active_mm around under the new config introduced,
and yet you are saying:

  I am in favor of this approach, but I would be a lot more comfortable
  with the resulting code if task->active_mm were at least better
  documented and possibly even guarded by ifdefs.

It doesn't make sense. It can't be guarded by ifdefs even if we took
away the shootdown IPI's use of it as part of reference / lifetime
management.

And that you don't like the state of the code but I'm trying to wrangle 
specifics out of you on that and it's difficult.

>> 
>> That's understandable, but please redirect your objections to the proper 
>> place. git blame suggests 3d28ebceaffab.
> 
> Thanks for the snark.

Is it not true? I don't mean that one patch causing all the x86 
complexity or even making the situation worse itself. But you seem to be 
asking my series to do things that really apply to the x86 changes over
the past few years that got us here.

> 
> Here's the situation.
> 
> Before that patch, x86 more or less fully used the core scheme.  With
> that patch, x86 has the property that loaded_mm (which is used) either
> points to init_mm (which is permanently live) or to active_mm (which the
> core code keeps alive, which is the whole point).  x86 still uses the
> core active_mm refcounting scheme.  The result is a bit overcomplicated,
> but it works, and it enabled massive improvements to the x86 arch code.
> 
> You are proposing a whole new simplification in which an arch can opt in
> to telling the core code that it doesn't need to keep active_mm alive.
> Great!  But now it's possible to get quite confused -- either an
> elaborate dance is needed or current->active_mm could point to freed
> memory.  This is poor design.

powerpc still needs active_mm, there's zero point to configuring it away 
and maintaining exactly the same thing in a per-cpu variable in the arch
code because the code has to be there in the core for all other archs
anyway so that would be stupid.

I would be more than happy to try help review a patch that helps the
x86 situation, but my patch is not intended to do what x86 code wants.

> I'm entirely in favor of allowing arches to opt out of active_mm
> refcounting.  As you've noticed, it's quite expensive on large systems.
> x86 would opt out, too, if given the opportunity [1].  But I think you
> should do it right.  If an arch opts out of active_mm refcounting, then
> keeping a lazy mm (if any!) alive becomes entirely the arch code's
> responsibility.
>
> Once that happens, task->active_mm is not just a waste
> of 4-8 bytes of memory per task, it's actively harmful -- some code,
> somewhere in the kernel, might dereference it and access freed memory!

That's not an accurate description of my patches. powerpc still requires 
active_mm exactly the same, and the "elaborate dance" amounts to having
CPUs stop using the mm as a lazy tlb when the last user goes away.

> 
> So please do your patches right.  By all means add a new config option,
> but make that config option make active_mm go away entirely.  Then it
> can't be misused.

I already have a patch that's halfway there I pulled out of the series
for the last submission. As I said, feel free to build on it.

> 
> NAK to the current set.

I propose instead we do take the series, and write some patches on top 
of that which helps x86. Here's 1/n

2/n is to make wrappers for active_mm maintenance in core code.

3/n is adjust context_switch_nolazy() to just use init_mm (or NULL) directly
and put ifdefs around active_mm.

Is that what x86 wants?

Thanks,
Nick

commit 99f90520821b9717d4447872354c2933894a1100
Author: Nicholas Piggin <npiggin@...il.com>
Date:   Thu Jul 9 15:01:25 2020 +1000

    lazy tlb: allow lazy tlb mm switching to be configurable
    
    Add CONFIG_MMU_LAZY_TLB which can be configured out to disable the lazy
    tlb mechanism entirely, and switches to init_mm when switching to a
    kernel thread.
    
    NOMMU systems could easily go without this and save a bit of code and
    the refcount atomics, because their mm switch is a no-op. They have not
    been switched over by default because the arch code needs to be audited
    and tested for lazy tlb mm refcounting and converted to _lazy_tlb
    refcounting if necessary.
    
    CONFIG_MMU_LAZY_TLB_REFCOUNT is also added, but it must always be
    enabled if CONFIG_MMU_LAZY_TLB is enabled until the next patch which
    provides an alternate scheme.
    
    Signed-off-by: Nicholas Piggin <npiggin@...il.com>

diff --git a/arch/Kconfig b/arch/Kconfig
index cf468c9777d8..3b80042bd0c2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,26 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Enable "lazy TLB", which means a user->kernel thread context switch does not
+# switch the mm to init_mm and the kernel thread takes a reference to the user
+# mm to provide its kernel mapping. This is how Linux has traditionally worked
+# (see Documentation/vm/active_mm.rst), for performance. Switching to and from
+# idle thread is a performance-critical case.
+#
+# If mm context switches are inexpensive or free (in the case of NOMMU) then
+# this could be disabled.
+#
+# It would make sense to have this depend on MMU, but need to audit and test
+# the NOMMU architectures for lazy mm refcounting first.
+config MMU_LAZY_TLB
+	def_bool y
+	depends on !NO_MMU_LAZY_TLB
+
+# This allows archs to disable MMU_LAZY_TLB. mmgrab/mmdrop in arch/ code has
+# to be audited and switched to _lazy_tlb postfix as necessary.
+config NO_MMU_LAZY_TLB
+	def_bool n
+
 # Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
 # MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
 # to/from kernel threads when the same mm is running on a lot of CPUs (a large
@@ -431,6 +451,7 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # to a kthread ->active_mm (non-arch code has been converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on MMU_LAZY_TLB
 	depends on !MMU_LAZY_TLB_SHOOTDOWN
 
 # This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
@@ -445,6 +466,7 @@ config MMU_LAZY_TLB_REFCOUNT
 # MMU_LAZY_TLB_REFCOUNT=n (see above).
 config MMU_LAZY_TLB_SHOOTDOWN
 	bool
+	depends on MMU_LAZY_TLB
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10cb712be3..5cb039c686a6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4285,22 +4285,10 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	calculate_sigpending();
 }
 
-/*
- * context_switch - switch to the new MM and the new thread's register state.
- */
-static __always_inline struct rq *
-context_switch(struct rq *rq, struct task_struct *prev,
-	       struct task_struct *next, struct rq_flags *rf)
+static __always_inline void
+context_switch_mm(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next)
 {
-	prepare_task_switch(rq, prev, next);
-
-	/*
-	 * For paravirt, this is coupled with an exit in switch_to to
-	 * combine the page table reload and the switch backend into
-	 * one hypercall.
-	 */
-	arch_start_context_switch(prev);
-
 	/*
 	 * kernel -> kernel   lazy + transfer active
 	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
@@ -4345,6 +4333,40 @@ context_switch(struct rq *rq, struct task_struct *prev,
 #endif
 		}
 	}
+}
+
+static __always_inline void
+context_switch_mm_nolazy(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next)
+{
+	if (!next->mm)
+		next->active_mm = &init_mm;
+	membarrier_switch_mm(rq, prev->active_mm, next->active_mm);
+	switch_mm_irqs_off(prev->active_mm, next->active_mm, next);
+	if (!prev->mm)
+		prev->active_mm = NULL;
+}
+
+/*
+ * context_switch - switch to the new MM and the new thread's register state.
+ */
+static __always_inline struct rq *
+context_switch(struct rq *rq, struct task_struct *prev,
+	       struct task_struct *next, struct rq_flags *rf)
+{
+	prepare_task_switch(rq, prev, next);
+
+	/*
+	 * For paravirt, this is coupled with an exit in switch_to to
+	 * combine the page table reload and the switch backend into
+	 * one hypercall.
+	 */
+	arch_start_context_switch(prev);
+
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+		context_switch_mm(rq, prev, next);
+	else
+		context_switch_mm_nolazy(rq, prev, next);
 
 	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);