linux-kernel - Re: [PATCH] mm: reduce lock contention of pcp buffer refill

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8dc0a48f-f29a-27de-0ced-d39a920b2afd@suse.cz>
Date:   Wed, 8 Feb 2023 16:11:28 +0100
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Alexander Halbuer <halbuer@....uni-hannover.de>,
        Andrew Morton <akpm@...ux-foundation.org>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [PATCH] mm: reduce lock contention of pcp buffer refill

On 2/7/23 17:11, Alexander Halbuer wrote:
> On 2/3/23 00:25, Andrew Morton wrote:
>> On Wed,  1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@....uni-hannover.de> wrote:
>>
>>> The `rmqueue_bulk` function batches the allocation of multiple elements to
>>> refill the per-CPU buffers into a single hold of the zone lock. Each
>>> element is allocated and checked using the `check_pcp_refill` function.
>>> The check touches every related struct page which is especially expensive
>>> for higher order allocations (huge pages). This patch reduces the time
>>> holding the lock by moving the check out of the critical section similar
>>> to the `rmqueue_buddy` function which allocates a single element.
>>> Measurements of parallel allocation-heavy workloads show a reduction of
>>> the average huge page allocation latency of 50 percent for two cores and
>>> nearly 90 percent for 24 cores.
>> Sounds nice.
>>
>> Were you able to test how much benefit we get by simply removing the
>> check_new_pages() call from rmqueue_bulk()?
> I did some further investigations and measurements to quantify potential
> performance gains. Benchmarks ran on a machine with 24 physical cores
> fixed at 2.1 GHz. The results show significant performance gains with the
> patch applied in parallel scenarios. Eliminating the check reduces
> allocation latency further, especially for low core counts.

Are the benchmarks done via kernel module calling bulk allocation, or from
userspace?

> The following tables show the average allocation latencies for huge pages
> and normal pages for three different configurations:
> The unpatched kernel, the patched kernel, and an additional version
> without the check.
> 
> Huge pages
> +-------+---------+-------+----------+---------+----------+
> | Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
> |       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
> +-------+---------+-------+----------+---------+----------+
> |     1 |     127 |   124 |  (-2.4%) |     118 |  (-7.1%) |
> |     2 |     140 |   140 |  (-0.0%) |     134 |  (-4.3%) |
> |     3 |     143 |   142 |  (-0.7%) |     134 |  (-6.3%) |
> |     4 |     178 |   159 | (-10.7%) |     156 | (-12.4%) |
> |     6 |     269 |   239 | (-11.2%) |     237 | (-11.9%) |
> |     8 |     363 |   321 | (-11.6%) |     319 | (-12.1%) |
> |    10 |     454 |   409 |  (-9.9%) |     409 |  (-9.9%) |
> |    12 |     545 |   494 |  (-9.4%) |     488 | (-10.5%) |
> |    14 |     639 |   578 |  (-9.5%) |     574 | (-10.2%) |
> |    16 |     735 |   660 | (-10.2%) |     653 | (-11.2%) |
> |    20 |     915 |   826 |  (-9.7%) |     815 | (-10.9%) |
> |    24 |    1105 |   992 | (-10.2%) |     982 | (-11.1%) |
> +-------+---------+-------+----------+---------+----------+
> 
> Normal pages
> +-------+---------+-------+----------+---------+----------+
> | Cores | Default | Patch |    Patch | NoCheck |  NoCheck |
> |       |    (ns) |  (ns) |     Diff |    (ns) |     Diff |
> +-------+---------+-------+----------+---------+----------+
> |     1 |    2790 |  2767 |  (-0.8%) |     171 | (-93.9%) |
> |     2 |    6685 |  3484 | (-47.9%) |     519 | (-92.2%) |
> |     3 |   10501 |  3599 | (-65.7%) |     855 | (-91.9%) |
> |     4 |   14264 |  3635 | (-74.5%) |    1139 | (-92.0%) |
> |     6 |   21800 |  3551 | (-83.7%) |    1713 | (-92.1%) |
> |     8 |   29563 |  3570 | (-87.9%) |    2268 | (-92.3%) |
> |    10 |   37210 |  3845 | (-89.7%) |    2872 | (-92.3%) |
> |    12 |   44780 |  4452 | (-90.1%) |    3417 | (-92.4%) |
> |    14 |   52576 |  5100 | (-90.3%) |    4020 | (-92.4%) |
> |    16 |   60118 |  5785 | (-90.4%) |    4604 | (-92.3%) |
> |    20 |   75037 |  7270 | (-90.3%) |    6486 | (-91.4%) |
> |    24 |   90226 |  8712 | (-90.3%) |    7061 | (-92.2%) |
> +-------+---------+-------+----------+---------+----------+

Nice improvements. I assume the table headings are switched? Why would
Normal be more epensive than Huge?

>>
>> Vlastimil, I find this quite confusing:
>>
>> #ifdef CONFIG_DEBUG_VM
>> /*
>>  * With DEBUG_VM enabled, order-0 pages are checked for expected state when
>>  * being allocated from pcp lists. With debug_pagealloc also enabled, they are
>>  * also checked when pcp lists are refilled from the free lists.
>>  */
>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>> {
>> 	if (debug_pagealloc_enabled_static())
>> 		return check_new_pages(page, order);
>> 	else
>> 		return false;
>> }
>>
>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>> {
>> 	return check_new_pages(page, order);
>> }
>> #else
>> /*
>>  * With DEBUG_VM disabled, free order-0 pages are checked for expected state
>>  * when pcp lists are being refilled from the free lists. With debug_pagealloc
>>  * enabled, they are also checked when being allocated from the pcp lists.
>>  */
>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>> {
>> 	return check_new_pages(page, order);
>> }
>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>> {
>> 	if (debug_pagealloc_enabled_static())
>> 		return check_new_pages(page, order);
>> 	else
>> 		return false;
>> }
>> #endif /* CONFIG_DEBUG_VM */
>>
>> and the 4462b32c9285b5 changelog is a struggle to follow.
>>
>> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
>> debug_pagealloc_enabled is false?
>>
>> Anyway, these checks sounds quite costly so let's revisit their
>> desirability?
>>