linux-kernel - Re: [FIX PATCH] mm: pcp: fix pcp->free

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <871pw33695.fsf@DESKTOP-5N7EMDA>
Date: Wed, 12 Feb 2025 16:40:54 +0800
From: "Huang, Ying" <ying.huang@...ux.alibaba.com>
To: Nikhil Dhama <nikhil.dhama@....com>
Cc: <akpm@...ux-foundation.org>,  <bharata@....com>,
  <huang.ying.caritas@...il.com>,  <linux-kernel@...r.kernel.org>,
  <linux-mm@...ck.org>,  <mgorman@...hsingularity.net>,
  <raghavendra.kodsarathimmappa@....com>
Subject: Re: [FIX PATCH] mm: pcp: fix pcp->free_count reduction on page
 allocation

Nikhil Dhama <nikhil.dhama@....com> writes:

> On 1/29/2025 10:01 AM, Andrew Morton wrote:
>>
>> On Wed, 15 Jan 2025 19:19:02 +0800 "Huang, Ying" <ying.huang@...ux.alibaba.com> wrote:
>>
>>> Andrew Morton <akpm@...ux-foundation.org> writes:
>>>
>>>> On Tue, 7 Jan 2025 14:47:24 +0530 Nikhil Dhama <nikhil.dhama@....com> wrote:
>>>>
>>>>> In current PCP auto-tuning desgin, free_count was introduced to track
>>>>> the consecutive page freeing with a counter, This counter is incremented
>>>>> by the exact amount of pages that are freed, but reduced by half on
>>>>> allocation. This is causing a 2-node iperf3 client to server's network
>>>>> bandwidth to drop by 30% if we scale number of client-server pairs from 32
>>>>> (where we achieved peak network bandwidth) to 64.
>>>>>
>>>>> To fix this issue, on allocation, reduce free_count by the exact number
>>>>> of pages that are allocated instead of halving it.
>>>> The present division by two appears to be somewhat randomly chosen.
>>>> And as far as I can tell, this patch proposes replacing that with
>>>> another somewhat random adjustment.
>>>>
>>>> What's the actual design here?  What are we attempting to do and why,
>>>> and why is the proposed design superior to the present one?
>>> Cc Mel for the original design.
>>>
>>> IIUC, pcp->free_count is used to identify the consecutive, pure, large
>>> number of page freeing pattern.  For that pattern, larger batch will be
>>> used to free pages from PCP to buddy to improve the performance.  Mixed
>>> free/allocation pattern should not make pcp->free_count large, even if
>>> the number of the pages freed is much larger than that of the pages
>>> allocated in the long run.  So, pcp->free_count decreases rapidly for
>>> the page allocation.
>>>
>>> Hi, Mel, please correct me if my understanding isn't correct.
>>>
>> hm, no Mel.
>>
>> Nikhil, please do continue to work on this - it seems that there will
>> be a significant benefit to retuning this.
>
>
> Hi Andrew,
>
> I have analyzed the performance of different memory-sensitive workloads for these
> two different ways to decrement pcp->free_count. I compared the score amongst
> v6.6 mainline, v6.7 mainline and v6.7 with our patch.
>
> For all the benchmarks, I used a 2-socket AMD server with 382 logical CPUs.
>
> Results I got are as follows:
> All scores are normalized with respect to v6.6 (base).
>
>
> For all the benchmarks below (iperf3, lmbench3 unix, netperf, redis, gups, xsbench),
> a higher score is better.
>
>                     iperf3    lmbench3 Unix       1-node netperf       2-node netperf
>                                   (AF_UNIX)   (SCTP_STREAM_MANY)   (SCTP_STREAM_MANY)
>                    -------   --------------   ------------------   ------------------
> v6.6 (base)            100              100                  100                  100
> v6.7                    69            113.2                   99                98.59
> v6.7 with my patch     100            112.1                100.3               101.16
>
>
>                   redis standard    redis core    redis L3 Heavy    Gups    xsbench
>                   --------------    ----------    --------------    ----    -------
> v6.6 (base)                  100           100              100      100        100
> v6.7                       99.45        101.66            99.47      100      98.14
> v6.7 with my patch         99.76        101.12            99.75      100      99.56
>
>
> and for graph500, hashjoin, pagerank and Kbuild, a lower score is better.
>
>                      graph500     hashjoin      hashjoin    pagerank     Kbuild
>                                (THP always)   (THP never)
>                     ---------  ------------   -----------   --------     ------
> v6.6 (base)              100           100           100         100        100
> v6.7                  101.08         101.3         101.9         100       98.8
> v6.7 with my patch     99.73           100        101.66         100       99.6
>
> from these result I can conclude that this patch is performing better
> or as good as base v6.7 on almost all of these workloads.

Sorry, this change doesn't make sense to me.

For example, if a large size process exits on a CPU, pcp->free_count
will increase on this CPU.  This is good, because the process can free
pages quicker during exiting with the larger batching.  However, after
that, pcp->free_count may be kept large for a long duration unless a
large number of page allocation (without large number of page freeing)
are done on the CPU.  So, the page freeing parameter may be influenced
by some unrelated workload for long time.  That doesn't sound good.

In effect, the larger pcp->free_count will increase page freeing batch
size.  That will improve the page freeing throughput but hurt page
freeing latency.  Please check the page freeing latency too.  If larger
batch number helps performance without regressions, just increase batch
number directly instead of playing with pcp->free_count.

And, do you run network related workloads on one machine?  If so, please
try to run them on two machines instead, with clients and servers run on
different machines.  At least, please use different sockets for clients
and servers.  Because larger pcp->free_count will make it easier to
trigger free_high heuristics.  If that is the case, please try to
optimize free_high heuristics directly too.

---
Best Regards,
Huang, Ying