linux-kernel - Re: [PATCH v2] gup: optimize longterm pin_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250604075830.27751-1-lizhe.67@bytedance.com>
Date: Wed,  4 Jun 2025 15:58:30 +0800
From: lizhe.67@...edance.com
To: akpm@...ux-foundation.org
Cc: david@...hat.com,
	dev.jain@....com,
	jgg@...pe.ca,
	jhubbard@...dia.com,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	lizhe.67@...edance.com,
	muchun.song@...ux.dev,
	peterx@...hat.com
Subject: Re: [PATCH v2] gup: optimize longterm pin_user_pages() for large folio

On Tue, 3 Jun 2025 20:44:14 -0700, akpm@...ux-foundation.org wrote:

> On Wed,  4 Jun 2025 11:15:36 +0800 lizhe.67@...edance.com wrote:
> 
> > From: Li Zhe <lizhe.67@...edance.com>
> > 
> > In the current implementation of the longterm pin_user_pages() function,
> > we invoke the collect_longterm_unpinnable_folios() function. This function
> > iterates through the list to check whether each folio belongs to the
> > "longterm_unpinnabled" category. The folios in this list essentially
> > correspond to a contiguous region of user-space addresses, with each folio
> > representing a physical address in increments of PAGESIZE. If this
> > user-space address range is mapped with large folio, we can optimize the
> > performance of function pin_user_pages() by reducing the frequency of
> > memory accesses using READ_ONCE. This patch leverages this approach to
> > achieve performance improvements.
> > 
> > The performance test results obtained through the gup_test tool from the
> > kernel source tree are as follows. We achieve an improvement of over 70%
> > for large folio with pagesize=2M. For normal page, we have only observed
> > a very slight degradation in performance.
> > 
> > Without this patch:
> > 
> >     [root@...alhost ~] ./gup_test -HL -m 8192 -n 512
> >     TAP version 13
> >     1..1
> >     # PIN_LONGTERM_BENCHMARK: Time: get:13623 put:10799 us#
> >     ok 1 ioctl status 0
> >     # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> >     [root@...alhost ~]# ./gup_test -LT -m 8192 -n 512
> >     TAP version 13
> >     1..1
> >     # PIN_LONGTERM_BENCHMARK: Time: get:129733 put:31753 us#
> >     ok 1 ioctl status 0
> >     # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> > 
> > With this patch:
> > 
> >     [root@...alhost ~] ./gup_test -HL -m 8192 -n 512
> >     TAP version 13
> >     1..1
> >     # PIN_LONGTERM_BENCHMARK: Time: get:4075 put:10792 us#
> >     ok 1 ioctl status 0
> >     # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> >     [root@...alhost ~]# ./gup_test -LT -m 8192 -n 512
> >     TAP version 13
> >     1..1
> >     # PIN_LONGTERM_BENCHMARK: Time: get:130727 put:31763 us#
> >     ok 1 ioctl status 0
> >     # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> 
> I see no READ_ONCE()s in the patch and I had to go off and read the v1
> review to discover that the READ_ONCE is invoked in
> page_folio()->_compound_head().  Please help us out by including such
> details in the changelogs.

Sorry for the inconvenience. I will refine the wording of this part in
the next version.

> Is it credible that a humble READ_ONCE could yield a 3x improvement in
> one case?  Why would this happen?

Sorry for the incomplete description. I believe that this optimization
is the result of multiple factors working together. In addition to
reducing the use of READ_ONCE(), when dealing with a large folio, we
simplify the check from comparing with prev_folio after invoking
pofs_get_folio() to determine if the next page is within the folio.
This change reduces the number of branches and increase cache hit rates.
The overall effect is a combination of these optimizations. I will
incorporate these details into the commit message in the next version.

Thanks,
Zhe