linux-kernel - Re: [PATCH 6/6] mm: page_alloc: consolidate free page accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <37dbd4d0-c125-6694-dec4-6322ae5b6dee@suse.cz>
Date:   Wed, 13 Sep 2023 22:18:17 +0200
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Johannes Weiner <hannes@...xchg.org>,
        Andrew Morton <akpm@...ux-foundation.org>
Cc:     Mel Gorman <mgorman@...hsingularity.net>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Kefeng Wang <wangkefeng.wang@...wei.com>,
        Zi Yan <ziy@...dia.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 6/6] mm: page_alloc: consolidate free page accounting

On 9/11/23 21:41, Johannes Weiner wrote:
> Free page accounting currently happens a bit too high up the call
> stack, where it has to deal with guard pages, compaction capturing,
> block stealing and even page isolation. This is subtle and fragile,
> and makes it difficult to hack on the code.
> 
> Now that type violations on the freelists have been fixed, push the
> accounting down to where pages enter and leave the freelist.
> 
> v3:
> - fix CONFIG_UNACCEPTED_MEMORY build (lkp)
> v2:
> - fix CONFIG_DEBUG_PAGEALLOC build (Mel)
> 
> Signed-off-by: Johannes Weiner <hannes@...xchg.org>

<snip>

>  
>  /* Used for pages not on another list */
> -static inline void add_to_free_list_tail(struct page *page, struct zone *zone,
> -					 unsigned int order, int migratetype)
> +static inline void add_to_free_list(struct page *page, struct zone *zone,
> +				    unsigned int order, int migratetype,
> +				    bool tail)
>  {
>  	struct free_area *area = &zone->free_area[order];
>  
> -	list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
> +	VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype,
> +		     "page type is %lu, passed migratetype is %d (nr=%d)\n",
> +		     get_pageblock_migratetype(page), migratetype, 1 << order);

Ok, IIUC so you now assume pageblock migratetype is now matching freelist
placement at all times. This is a change from the previous treatment as a
heuristic that may be sometimes imprecise. Let's assume the previous patches
handled the deterministic reasons why those would deviate (modulo my concern
about pageblocks spanning multiple zones in reply to 5/6).

But unless I'm missing something, I don't think the possible race scenarios
were dealt with? Pageblock migratetype is set under zone->lock but there are
places that read it outside of zone->lock and then trust it to perform the
freelist placement. See for example __free_pages_ok(), or free_unref_page()
in the cases it calls free_one_page(). These determine pageblock migratetype
before taking the zone->lock. Only for has_isolate_pageblock() cases we are
more careful, because previously isolation was the only case where precision
was needed. So I think this warning is going to trigger?

> +
> +	if (tail)
> +		list_add_tail(&page->buddy_list, &area->free_list[migratetype]);
> +	else
> +		list_add(&page->buddy_list, &area->free_list[migratetype]);
>  	area->nr_free++;
> +
> +	account_freepages(page, zone, 1 << order, migratetype);
>  }
>  
>  /*

<snip>

> @@ -757,23 +783,21 @@ static inline void __free_one_page(struct page *page,
>  	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
>  
>  	VM_BUG_ON(migratetype == -1);
> -	if (likely(!is_migrate_isolate(migratetype)))
> -		__mod_zone_freepage_state(zone, 1 << order, migratetype);
> -
>  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>  
>  	while (order < MAX_ORDER) {
> -		if (compaction_capture(capc, page, order, migratetype)) {
> -			__mod_zone_freepage_state(zone, -(1 << order),
> -								migratetype);
> +		int buddy_mt;
> +
> +		if (compaction_capture(capc, page, order, migratetype))
>  			return;
> -		}
>  
>  		buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn);
>  		if (!buddy)
>  			goto done_merging;
>  
> +		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);

You should assume buddy_mt equals migratetype, no? It's the same assumption
as the VM_WARN_ONCE() I've discussed?

> +
>  		if (unlikely(order >= pageblock_order)) {

Only here buddy_mt can differ and the code in this block already handles that.

>  			/*
>  			 * We want to prevent merge between freepages on pageblock
> @@ -801,9 +825,9 @@ static inline void __free_one_page(struct page *page,
>  		 * merge with it and move up one order.
>  		 */
>  		if (page_is_guard(buddy))
> -			clear_page_guard(zone, buddy, order, migratetype);
> +			clear_page_guard(zone, buddy, order);
>  		else
> -			del_page_from_free_list(buddy, zone, order);
> +			del_page_from_free_list(buddy, zone, order, buddy_mt);

Ugh so this will add account_freepages() call to each iteration of the
__free_one_page() hot loop, which seems like a lot of unnecessary overhead -
as long as we are within pageblock_order the migratetype should be the same,
and thus also is_migrate_isolate() and is_migrate_cma() tests should return
the same value so we shouldn't need to call __mod_zone_page_state()
piecemeal like this.

>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;