linux-kernel - Re: [RFC PATCH 0/2] Hot page promotion optimization for large address space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87edbulwom.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Thu, 28 Mar 2024 14:03:53 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Bharata B Rao <bharata@....com>
Cc: <linux-mm@...ck.org>,  <linux-kernel@...r.kernel.org>,
  <akpm@...ux-foundation.org>,  <mingo@...hat.com>,
  <peterz@...radead.org>,  <mgorman@...hsingularity.net>,
  <raghavendra.kt@....com>,  <dave.hansen@...ux.intel.com>,
  <hannes@...xchg.org>
Subject: Re: [RFC PATCH 0/2] Hot page promotion optimization for large
 address space

Bharata B Rao <bharata@....com> writes:

> On 28-Mar-24 11:05 AM, Huang, Ying wrote:
>> Bharata B Rao <bharata@....com> writes:
>> 
>>> In order to check how efficiently the existing NUMA balancing
>>> based hot page promotion mechanism can detect hot regions and
>>> promote pages for workloads with large memory footprints, I
>>> wrote and tested a program that allocates huge amount of
>>> memory but routinely touches only small parts of it.
>>>
>>> This microbenchmark provisions memory both on DRAM node and CXL node.
>>> It then divides the entire allocated memory into chunks of smaller
>>> size and randomly choses a chunk for generating memory accesses.
>>> Each chunk is then accessed for a fixed number of iterations to
>>> create the notion of hotness. Within each chunk, the individual
>>> pages at 4K granularity are again accessed in random fashion.
>>>
>>> When a chunk is taken up for access in this manner, its pages
>>> can either be residing on DRAM or CXL. In the latter case, the NUMA
>>> balancing driven hot page promotion logic is expected to detect and
>>> promote the hot pages that reside on CXL.
>>>
>>> The experiment was conducted on a 2P AMD Bergamo system that has
>>> CXL as the 3rd node.
>>>
>>> $ numactl -H
>>> available: 3 nodes (0-2)
>>> node 0 cpus: 0-127,256-383
>>> node 0 size: 128054 MB
>>> node 1 cpus: 128-255,384-511
>>> node 1 size: 128880 MB
>>> node 2 cpus:
>>> node 2 size: 129024 MB
>>> node distances:
>>> node   0   1   2 
>>>   0:  10  32  60 
>>>   1:  32  10  50 
>>>   2:  255  255  10
>>>
>>> It is seen that number of pages that get promoted is really low and
>>> the reason for it happens to be that the NUMA hint fault latency turns
>>> out to be much higher than the hot threshold most of the times. Here
>>> are a few latency and threshold sample values captured from
>>> should_numa_migrate_memory() routine when the benchmark was run:
>>>
>>> latency	threshold (in ms)
>>> 20620	1125
>>> 56185	1125
>>> 98710	1250
>>> 148871	1375
>>> 182891	1625
>>> 369415	1875
>>> 630745	2000
>> 
>> The access latency of your workload is 20s to 630s, which appears too
>> long.  Can you try to increase the range of threshold to deal with that?
>> For example,
>> 
>> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms
>
> That of course should help. But I was exploring alternatives where the
> notion of hotness can be de-linked from the absolute scanning time to

In fact, only relative time from scan to hint fault is recorded and
calculated, we have only limited bits.

> the extent possible. For large memory workloads where only parts of memory
> get accessed at once, the scanning time can lag from the actual access
> time significantly as the data above shows. Wondering if such cases can
> be addressed without having to be workload-specific.

Does it really matter to promote the quite cold pages (accessed every
more than 20s)?  And if so, how can we adjust the current algorithm to
cover that?  I think that may be possible via extending the threshold
range.  And I think that we can find some way to extending the range by
default if necessary.

--
Best Regards,
Huang, Ying