linux-kernel - Re: [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page allocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f9acd363-bf53-c582-78ec-347fd7ec5c37@suse.cz>
Date:   Fri, 18 Nov 2022 15:30:57 +0100
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Mel Gorman <mgorman@...hsingularity.net>,
        Andrew Morton <akpm@...ux-foundation.org>
Cc:     Hugh Dickins <hughd@...gle.com>, Yu Zhao <yuzhao@...gle.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        Michal Hocko <mhocko@...nel.org>,
        Marek Szyprowski <m.szyprowski@...sung.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH 2/2] mm/page_alloc: Leave IRQs enabled for per-cpu page
 allocations

On 11/18/22 11:17, Mel Gorman wrote:
> The pcp_spin_lock_irqsave protecting the PCP lists is IRQ-safe as a task
> allocating from the PCP must not re-enter the allocator from IRQ context.
> In each instance where IRQ-reentrancy is possible, the lock is acquired
> using pcp_spin_trylock_irqsave() even though IRQs are disabled and
> re-entrancy is impossible.
> 
> Demote the lock to pcp_spin_lock avoids an IRQ disable/enable in the common
> case at the cost of some IRQ allocations taking a slower path. If the PCP
> lists need to be refilled, the zone lock still needs to disable IRQs but
> that will only happen on PCP refill and drain. If an IRQ is raised when
> a PCP allocation is in progress, the trylock will fail and fallback to
> using the buddy lists directly. Note that this may not be a universal win
> if an interrupt-intensive workload also allocates heavily from interrupt
> context and contends heavily on the zone->lock as a result.
> 
> [yuzhao@...gle.com: Reported lockdep issue on IO completion from softirq]
> [hughd@...gle.com: Fix list corruption, lock improvements, micro-optimsations]
> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>

Reviewed-by: Vlastimil Babka <vbabka@...e.cz>

Some nits below:

> @@ -3516,10 +3485,10 @@ void free_unref_page(struct page *page, unsigned int order)
>   */
>  void free_unref_page_list(struct list_head *list)
>  {
> +	unsigned long __maybe_unused UP_flags;
>  	struct page *page, *next;
>  	struct per_cpu_pages *pcp = NULL;
>  	struct zone *locked_zone = NULL;
> -	unsigned long flags;
>  	int batch_count = 0;
>  	int migratetype;
>  
> @@ -3550,11 +3519,26 @@ void free_unref_page_list(struct list_head *list)
>  
>  		/* Different zone, different pcp lock. */
>  		if (zone != locked_zone) {
> -			if (pcp)
> -				pcp_spin_unlock_irqrestore(pcp, flags);
> +			if (pcp) {
> +				pcp_spin_unlock(pcp);
> +				pcp_trylock_finish(UP_flags);
> +			}
>  
> +			/*
> +			 * trylock is necessary as pages may be getting freed
> +			 * from IRQ or SoftIRQ context after an IO completion.
> +			 */
> +			pcp_trylock_prepare(UP_flags);
> +			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
> +			if (!pcp) {

Perhaps use unlikely() here?

> +				pcp_trylock_finish(UP_flags);
> +				free_one_page(zone, page, page_to_pfn(page),
> +					      0, migratetype, FPI_NONE);

Not critical for correctness, but the migratepage here might be stale and we
should do get_pcppage_migratetype(page);

> +				locked_zone = NULL;
> +				continue;
> +			}
>  			locked_zone = zone;
> -			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
> +			batch_count = 0;
>  		}
>  
>  		/*
> @@ -3569,18 +3553,23 @@ void free_unref_page_list(struct list_head *list)
>  		free_unref_page_commit(zone, pcp, page, migratetype, 0);
>  
>  		/*
> -		 * Guard against excessive IRQ disabled times when we get
> -		 * a large list of pages to free.
> +		 * Guard against excessive lock hold times when freeing
> +		 * a large list of pages. Lock will be reacquired if
> +		 * necessary on the next iteration.
>  		 */
>  		if (++batch_count == SWAP_CLUSTER_MAX) {
> -			pcp_spin_unlock_irqrestore(pcp, flags);
> +			pcp_spin_unlock(pcp);
> +			pcp_trylock_finish(UP_flags);
>  			batch_count = 0;
> -			pcp = pcp_spin_lock_irqsave(locked_zone->per_cpu_pageset, flags);
> +			pcp = NULL;
> +			locked_zone = NULL;

AFAICS if this block was just "locked_zone = NULL;" then the existing code
would do the right thing.
Or maybe to have simpler code, just do batch_count++ here and
make the relocking check do
if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX)

>  		}
>  	}
>  
> -	if (pcp)
> -		pcp_spin_unlock_irqrestore(pcp, flags);
> +	if (pcp) {
> +		pcp_spin_unlock(pcp);
> +		pcp_trylock_finish(UP_flags);
> +	}
>  }
>  
>  /*