[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250604075830.27751-1-lizhe.67@bytedance.com>
Date: Wed, 4 Jun 2025 15:58:30 +0800
From: lizhe.67@...edance.com
To: akpm@...ux-foundation.org
Cc: david@...hat.com,
dev.jain@....com,
jgg@...pe.ca,
jhubbard@...dia.com,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
lizhe.67@...edance.com,
muchun.song@...ux.dev,
peterx@...hat.com
Subject: Re: [PATCH v2] gup: optimize longterm pin_user_pages() for large folio
On Tue, 3 Jun 2025 20:44:14 -0700, akpm@...ux-foundation.org wrote:
> On Wed, 4 Jun 2025 11:15:36 +0800 lizhe.67@...edance.com wrote:
>
> > From: Li Zhe <lizhe.67@...edance.com>
> >
> > In the current implementation of the longterm pin_user_pages() function,
> > we invoke the collect_longterm_unpinnable_folios() function. This function
> > iterates through the list to check whether each folio belongs to the
> > "longterm_unpinnabled" category. The folios in this list essentially
> > correspond to a contiguous region of user-space addresses, with each folio
> > representing a physical address in increments of PAGESIZE. If this
> > user-space address range is mapped with large folio, we can optimize the
> > performance of function pin_user_pages() by reducing the frequency of
> > memory accesses using READ_ONCE. This patch leverages this approach to
> > achieve performance improvements.
> >
> > The performance test results obtained through the gup_test tool from the
> > kernel source tree are as follows. We achieve an improvement of over 70%
> > for large folio with pagesize=2M. For normal page, we have only observed
> > a very slight degradation in performance.
> >
> > Without this patch:
> >
> > [root@...alhost ~] ./gup_test -HL -m 8192 -n 512
> > TAP version 13
> > 1..1
> > # PIN_LONGTERM_BENCHMARK: Time: get:13623 put:10799 us#
> > ok 1 ioctl status 0
> > # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> > [root@...alhost ~]# ./gup_test -LT -m 8192 -n 512
> > TAP version 13
> > 1..1
> > # PIN_LONGTERM_BENCHMARK: Time: get:129733 put:31753 us#
> > ok 1 ioctl status 0
> > # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> >
> > With this patch:
> >
> > [root@...alhost ~] ./gup_test -HL -m 8192 -n 512
> > TAP version 13
> > 1..1
> > # PIN_LONGTERM_BENCHMARK: Time: get:4075 put:10792 us#
> > ok 1 ioctl status 0
> > # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
> > [root@...alhost ~]# ./gup_test -LT -m 8192 -n 512
> > TAP version 13
> > 1..1
> > # PIN_LONGTERM_BENCHMARK: Time: get:130727 put:31763 us#
> > ok 1 ioctl status 0
> > # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> I see no READ_ONCE()s in the patch and I had to go off and read the v1
> review to discover that the READ_ONCE is invoked in
> page_folio()->_compound_head(). Please help us out by including such
> details in the changelogs.
Sorry for the inconvenience. I will refine the wording of this part in
the next version.
> Is it credible that a humble READ_ONCE could yield a 3x improvement in
> one case? Why would this happen?
Sorry for the incomplete description. I believe that this optimization
is the result of multiple factors working together. In addition to
reducing the use of READ_ONCE(), when dealing with a large folio, we
simplify the check from comparing with prev_folio after invoking
pofs_get_folio() to determine if the next page is within the folio.
This change reduces the number of branches and increase cache hit rates.
The overall effect is a combination of these optimizations. I will
incorporate these details into the commit message in the next version.
Thanks,
Zhe
Powered by blists - more mailing lists