linux-kernel - Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0d6bf5a97777bec1e0b425f2fb33dbb80d848621.camel@redhat.com>
Date:   Tue, 26 Apr 2022 18:42:43 +0200
From:   Nicolas Saenz Julienne <nsaenzju@...hat.com>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Marcelo Tosatti <mtosatti@...hat.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Michal Hocko <mhocko@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH 5/6] mm/page_alloc: Protect PCP lists with a spinlock

On Wed, 2022-04-20 at 10:59 +0100, Mel Gorman wrote:
> Currently the PCP lists are protected by using local_lock_irqsave to
> prevent migration and IRQ reentrancy but this is inconvenient. Remote
> draining of the lists is impossible and a workqueue is required and
> every task allocation/free must disable then enable interrupts which is
> expensive.
> 
> As preparation for dealing with both of those problems, protect the
> lists with a spinlock. The IRQ-unsafe version of the lock is used
> because IRQs are already disabled by local_lock_irqsave. spin_trylock
> is used in preparation for a time when local_lock could be used instead
> of lock_lock_irqsave.
> 
> The per_cpu_pages still fits within the same number of cache lines after
> this patch relative to before the series.
> 
> struct per_cpu_pages {
>         spinlock_t                 lock;                 /*     0     4 */
>         int                        count;                /*     4     4 */
>         int                        high;                 /*     8     4 */
>         int                        batch;                /*    12     4 */
>         short int                  free_factor;          /*    16     2 */
>         short int                  expire;               /*    18     2 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         struct list_head           lists[13];            /*    24   208 */
> 
>         /* size: 256, cachelines: 4, members: 7 */
>         /* sum members: 228, holes: 1, sum holes: 4 */
>         /* padding: 24 */
> } __attribute__((__aligned__(64)));
> 
> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
> ---
>  include/linux/mmzone.h |   1 +
>  mm/page_alloc.c        | 155 +++++++++++++++++++++++++++++++++++------
>  2 files changed, 136 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index abe530748de6..8b5757735428 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -385,6 +385,7 @@ enum zone_watermarks {
>  
>  /* Fields and list protected by pagesets local_lock in page_alloc.c */
>  struct per_cpu_pages {
> +	spinlock_t lock;	/* Protects lists field */
>  	int count;		/* number of pages in the list */
>  	int high;		/* high watermark, emptying needed */
>  	int batch;		/* chunk size for buddy add/remove */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dc0fdeb3795c..813c84b67c65 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -132,6 +132,17 @@ static DEFINE_PER_CPU(struct pagesets, pagesets) __maybe_unused = {
>  	.lock = INIT_LOCAL_LOCK(lock),
>  };
>  
> +#ifdef CONFIG_SMP
> +/* On SMP, spin_trylock is sufficient protection */
> +#define pcp_trylock_prepare(flags)	do { } while (0)
> +#define pcp_trylock_finish(flag)	do { } while (0)
> +#else
> +
> +/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */

This is only needed on non-RT kernels. RT UP builds go through
rt_spin_trylock(), which behaves like its SMP counterpart.

> +#define pcp_trylock_prepare(flags)	local_irq_save(flags)
> +#define pcp_trylock_finish(flags)	local_irq_restore(flags)
> +#endif
> +
>  #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>  DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
> @@ -3082,15 +3093,22 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>   */
>  void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
>  {
> -	unsigned long flags;
>  	int to_drain, batch;
>  
> -	local_lock_irqsave(&pagesets.lock, flags);
>  	batch = READ_ONCE(pcp->batch);
>  	to_drain = min(pcp->count, batch);
> -	if (to_drain > 0)
> +	if (to_drain > 0) {
> +		unsigned long flags;
> +
> +		/* free_pcppages_bulk expects IRQs disabled for zone->lock */
> +		local_irq_save(flags);

Why dropping the local_lock? That approach is nicer to RT builds, and I don't
think it makes a difference from a non-RT perspective.

That said, IIUC, this will eventually disappear with subsequent patches, right?

> +
> +		spin_lock(&pcp->lock);
>  		free_pcppages_bulk(zone, to_drain, pcp, 0);
> -	local_unlock_irqrestore(&pagesets.lock, flags);
> +		spin_unlock(&pcp->lock);
> +
> +		local_irq_restore(flags);
> +	}
>  }
>  #endif
>  

[...]

> @@ -3668,9 +3748,30 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>  			int migratetype,
>  			unsigned int alloc_flags,
>  			struct per_cpu_pages *pcp,
> -			struct list_head *list)
> +			struct list_head *list,
> +			bool locked)
>  {
>  	struct page *page;
> +	unsigned long __maybe_unused UP_flags;
> +
> +	/*
> +	 * spin_trylock is not necessary right now due to due to
> +	 * local_lock_irqsave and is a preparation step for
> +	 * a conversion to local_lock using the trylock to prevent
> +	 * IRQ re-entrancy. If pcp->lock cannot be acquired, the caller
> +	 * uses rmqueue_buddy.
> +	 *
> +	 * TODO: Convert local_lock_irqsave to local_lock. Care
> +	 * 	 is needed as the type of local_lock would need a
> +	 * 	 PREEMPT_RT version due to threaded IRQs.
> +	 */

I missed this in our previous conversation, but as far as RT is concerned
local_lock_irqsave() is the same as local_lock(), see in local_lock_internal.h:

#define __local_lock_irqsave(lock, flags)			\
	do {							\
		typecheck(unsigned long, flags);		\
		flags = 0;					\
		__local_lock(lock);				\
	} while (0)

Something similar happens with spin_lock_irqsave(), see in spinlock_rt.h:

#define spin_lock_irqsave(lock, flags)			 \
	do {						 \
		typecheck(unsigned long, flags);	 \
		flags = 0;				 \
		spin_lock(lock);			 \
	} while (0)

IIUC, RT will run this code paths in threaded IRQ context, where we can think
of RT spinlocks as mutexes (plus some extra priority inheritance magic).

Regards,

-- 
Nicolás Sáenz