linux-kernel - Re: [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z6s3u6Uu8I7r2Jox@tiehlicka>
Date: Tue, 11 Feb 2025 12:42:51 +0100
From: Michal Hocko <mhocko@...e.com>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: LKML <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Oleg Nesterov <oleg@...hat.com>, linux-mm@...ck.org
Subject: Re: [PATCH 6/6 v2] mm: Drain LRUs upon resume to userspace on
 nohz_full CPUs

On Sun 09-02-25 23:30:04, Frederic Weisbecker wrote:
> LRUs can be drained through several ways. One of them may add disturbances
> to isolated workloads while queuing a work at any time to any target,
> whether running in nohz_full mode or not.
> 
> Prevent from that on isolated tasks with defering LRUs drains upon
> resuming to userspace using the isolated task work framework.

I have to say this is rather cryptic description of the udnerlying
problem. What do you think about the following:

LRU batching can be source of disturbances for isolated workloads
running in the userspace because it requires kernel worker to handle
that and that would preempt the said task. The primary source for such
disruption would be __lru_add_drain_all which could be triggered from
non-isolated CPUs.

Why would an isolated CPU have anything on the pcp cache? Many syscalls
allocate pages that might end there. A typical and unavoidable one would
be fork/exec leaving pages on the cache behind just waiting for somebody
to drain.

This patch addresses the problem by noting a patch has been added to the
cache and schedule draining to the return path to the userspace so the
work is done while the syscall is still executing and there are no
suprises while the task runs in the userspace where it doesn't want to
be preempted.

> 
> Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
> ---
>  include/linux/swap.h     | 1 +
>  kernel/sched/isolation.c | 3 +++
>  mm/swap.c                | 8 +++++++-
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b13b72645db3..a6fdcc04403e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -406,6 +406,7 @@ extern void lru_add_drain(void);
>  extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_cpu_zone(struct zone *zone);
>  extern void lru_add_drain_all(void);
> +extern void lru_add_and_bh_lrus_drain(void);
>  void folio_deactivate(struct folio *folio);
>  void folio_mark_lazyfree(struct folio *folio);
>  extern void swap_setup(void);
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index f25a5cb33c0d..1f9ec201864c 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -8,6 +8,8 @@
>   *
>   */
>  
> +#include <linux/swap.h>
> +
>  enum hk_flags {
>  	HK_FLAG_DOMAIN		= BIT(HK_TYPE_DOMAIN),
>  	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
> @@ -253,6 +255,7 @@ __setup("isolcpus=", housekeeping_isolcpus_setup);
>  #if defined(CONFIG_NO_HZ_FULL)
>  static void isolated_task_work(struct callback_head *head)
>  {
> +	lru_add_and_bh_lrus_drain();
>  }
>  
>  int __isolated_task_work_queue(void)
> diff --git a/mm/swap.c b/mm/swap.c
> index fc8281ef4241..da1e569ee3ce 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -37,6 +37,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/local_lock.h>
>  #include <linux/buffer_head.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "internal.h"
>  
> @@ -376,6 +377,8 @@ static void __lru_cache_activate_folio(struct folio *folio)
>  	}
>  
>  	local_unlock(&cpu_fbatches.lock);
> +
> +	isolated_task_work_queue();
>  }

This placement doens't make much sense to me. I would put
isolated_task_work_queue when we queue something up. That would be
folio_batch_add if folio_batch_space(fbatch) > 0.

>  
>  #ifdef CONFIG_LRU_GEN
> @@ -738,7 +741,7 @@ void lru_add_drain(void)
>   * the same cpu. It shouldn't be a problem in !SMP case since
>   * the core is only one and the locks will disable preemption.
>   */
> -static void lru_add_and_bh_lrus_drain(void)
> +void lru_add_and_bh_lrus_drain(void)
>  {
>  	local_lock(&cpu_fbatches.lock);
>  	lru_add_drain_cpu(smp_processor_id());
> @@ -769,6 +772,9 @@ static bool cpu_needs_drain(unsigned int cpu)
>  {
>  	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
>  
> +	if (!housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
> +		return false;
> +

Would it make more sense to use cpu_is_isolated() and use it explicitly
in __lru_add_drain_all so that it is clearly visible - with a comment
that isolated workloads are dealing with cache on their return to
userspace.

>  	/* Check these in order of likelihood that they're not zero */
>  	return folio_batch_count(&fbatches->lru_add) ||
>  		folio_batch_count(&fbatches->lru_move_tail) ||
> -- 
> 2.46.0

-- 
Michal Hocko
SUSE Labs