linux-kernel - Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ce4214ef-706f-46b9-a88a-463fe0afe56b@huawei.com>
Date: Fri, 13 Dec 2024 20:09:37 +0800
From: Yunsheng Lin <linyunsheng@...wei.com>
To: Alexander Duyck <alexander.duyck@...il.com>
CC: <davem@...emloft.net>, <kuba@...nel.org>, <pabeni@...hat.com>,
	<netdev@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Shuah Khan
	<skhan@...uxfoundation.org>, Andrew Morton <akpm@...ux-foundation.org>,
	Linux-MM <linux-mm@...ck.org>
Subject: Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache
 (Part-2)

On 2024/12/11 20:52, Yunsheng Lin wrote:
 > It seems that bottleneck is still the freeing side that the above
> result might not be as meaningful as it should be.

Through 'perf top' annotating, there seems to be about 70%+ cpu usage
for the atmoic operation of put_page_testzero() in page_frag_free(),
it was unexpected that the atmoic operation had that much overhead:(

> 
> As we can't use more than one cpu for the free side without some
> lock using a single ptr_ring, it seems something more complicated
> might need to be done in order to support more than one CPU for the
> freeing side?
> 
> Before patch 1, __page_frag_alloc_align took up to 3.62% percent of
> CPU using 'perf top'.
> After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref()
> took up to 4.67% + 1.01% = 5.68%.
> Having a similar result, I am not sure if the CPU usages is able tell us
> the performance degradation here as it seems to be quite large?
> 

And using 'struct page_frag' to pass the parameter seems to cause some
observable overhead as the testing is very low level, peformance seems to
be negligible using the below patch to avoid passing 'struct page_frag',
3.62% and 3.27% for the cpu usages for __page_frag_alloc_align() before
patch 1 and __page_frag_cache_prepare() after patch 1 respectively.

The new refatcoring avoid some overhead for the old API, but might cause
some overhead for the new API as it is not able to skip the virt_to_page()
for refilling and reusing case, though it seems to be an unlikely case.
Or any better idea how to do refatcoring for unifying the page_frag API?

diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index 41a91df82631..b83e7655654e 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -39,8 +39,24 @@ static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc)

 void page_frag_cache_drain(struct page_frag_cache *nc);
 void __page_frag_cache_drain(struct page *page, unsigned int count);
-void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz,
-			      gfp_t gfp_mask, unsigned int align_mask);
+void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz,
+				gfp_t gfp_mask, unsigned int align_mask);
+
+static inline void *__page_frag_alloc_align(struct page_frag_cache *nc,
+					    unsigned int fragsz, gfp_t gfp_mask,
+					    unsigned int align_mask)
+{
+	void *va;
+
+	va = __page_frag_cache_prepare(nc, fragsz, gfp_mask, align_mask);
+	if (likely(va)) {
+		va += nc->offset;
+		nc->offset += fragsz;
+		nc->pagecnt_bias--;
+	}
+
+	return va;
+}

 static inline void *page_frag_alloc_align(struct page_frag_cache *nc,
 					  unsigned int fragsz, gfp_t gfp_mask,
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index 3f7a203d35c6..729309aee27a 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -90,9 +90,9 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);

-void *__page_frag_alloc_align(struct page_frag_cache *nc,
-			      unsigned int fragsz, gfp_t gfp_mask,
-			      unsigned int align_mask)
+void *__page_frag_cache_prepare(struct page_frag_cache *nc,
+				unsigned int fragsz, gfp_t gfp_mask,
+				unsigned int align_mask)
 {
 	unsigned long encoded_page = nc->encoded_page;
 	unsigned int size, offset;
@@ -151,12 +151,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 		offset = 0;
 	}

-	nc->pagecnt_bias--;
-	nc->offset = offset + fragsz;
-
-	return encoded_page_decode_virt(encoded_page) + offset;
+	nc->offset = offset;
+	return encoded_page_decode_virt(encoded_page);
 }
-EXPORT_SYMBOL(__page_frag_alloc_align);
+EXPORT_SYMBOL(__page_frag_cache_prepare);

 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.