lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230411110344.GC576825@hirez.programming.kicks-ass.net>
Date:   Tue, 11 Apr 2023 13:03:44 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     linux-kernel@...r.kernel.org, Aaron Lu <aaron.lu@...el.com>,
        Olivier Dion <odion@...icios.com>, michael.christie@...cle.com,
        npiggin@...il.com
Subject: Re: [RFC PATCH v3] sched: Fix performance regression introduced by
 mm_cid

On Fri, Apr 07, 2023 at 09:14:36PM -0400, Mathieu Desnoyers wrote:

> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 2a243616f222..f20fc0600fcc 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -37,6 +37,11 @@ static inline void mmgrab(struct mm_struct *mm)
>  	atomic_inc(&mm->mm_count);
>  }
> +static inline void smp_mb__after_mmgrab(void)
> +{
> +	smp_mb__after_atomic();
> +}
> +
>  extern void __mmdrop(struct mm_struct *mm);
>  static inline void mmdrop(struct mm_struct *mm)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9e0fa4193499..8d410c0dcb39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5117,7 +5117,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
>  	sched_info_switch(rq, prev, next);
>  	perf_event_task_sched_out(prev, next);
>  	rseq_preempt(prev);
> -	switch_mm_cid(prev, next);
>  	fire_sched_out_preempt_notifiers(prev, next);
>  	kmap_local_sched_out();
>  	prepare_task(next);
> @@ -5273,6 +5272,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  	 *
>  	 * kernel ->   user   switch + mmdrop() active
>  	 *   user ->   user   switch
> +	 *
> +	 * switch_mm_cid() needs to be updated if the barriers provided
> +	 * by context_switch() are modified.
>  	 */
>  	if (!next->mm) {                                // to kernel
>  		enter_lazy_tlb(prev->active_mm, next);
> @@ -5302,6 +5304,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  		}
>  	}
> +	/* switch_mm_cid() requires the memory barriers above. */
> +	switch_mm_cid(prev, next);
> +
>  	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
>  	prepare_lock_switch(rq, next, rf);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index bc0e1cd0d6ac..f3e7dc2cd1cc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3354,6 +3354,37 @@ static inline int mm_cid_get(struct mm_struct *mm)
>  static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
>  {
> +	/*
> +	 * Provide a memory barrier between rq->curr store and load of
> +	 * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
> +	 *
> +	 * Should be adapted if context_switch() is modified.
> +	 */
> +	if (!next->mm) {                                // to kernel
> +		/*
> +		 * user -> kernel transition does not guarantee a barrier, but
> +		 * we can use the fact that it performs an atomic operation in
> +		 * mmgrab().
> +		 */
> +		if (prev->mm)                           // from user
> +			smp_mb__after_mmgrab();
> +		/*
> +		 * kernel -> kernel transition does not change rq->curr->mm
> +		 * state. It stays NULL.
> +		 */
> +	} else {                                        // to user
> +		/*
> +		 * kernel -> user transition does not provide a barrier
> +		 * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> +		 * Provide it here.
> +		 */
> +		if (!prev->mm)                          // from kernel
> +			smp_mb();
> +		/*
> +		 * user -> user transition guarantees a memory barrier through
> +		 * switch_mm().
> +		 */
> +	}
>  	if (prev->mm_cid_active) {
>  		mm_cid_put_lazy(prev);
>  		prev->mm_cid = -1;
> 

This is going to be pain wrt.:

  https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com

which is already in -next. Also, I recon Nick isn't going to too happy
-- although I recond smp_mb() is better than an atomic op on Power. But
still.

Urgh...

For Nick; the TL;DR is we need an smp_mb() after setting rq->curr and
before calling switch_mm_cid() *IFF* rq->curr->mm changes. Normally this
is provided by switch_mm() itself per actually changing the address
space, except for the whole active_mm/lazy swizzle nonsense, which gives
a few holes still.

The very much longer explanation is upthread here:

  https://lkml.kernel.org/r/fdaa7242-4ddd-fbe2-bc0e-6c62054dbde8@efficios.com

 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ