linux-kernel - Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID:
 <SA2PR10MB47145047CBF0AE1B6E099E299BBD2@SA2PR10MB4714.namprd10.prod.outlook.com>
Date: Wed, 16 Apr 2025 21:13:46 +0000
From: Chris Hyser <chris.hyser@...cle.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
CC: Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman
	<mgorman@...hsingularity.net>,
        "longman@...hat.com" <longman@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Chris Hyser
	<chris.hyser@...cle.com>
Subject: Re: [PATCH 1/2] sched/numa: Add ability to override task's
 numa_preferred_nid.

> From: Madadi Vineeth Reddy
> Sent: Wednesday, April 16, 2025 3:00 AM
> To: Chris Hyser
> Cc: Peter Zijlstra; Mel Gorman; longman@...hat.com; linux-kernel@...r.kernel.org; Madadi Vineeth Reddy
> Subject: Re: [PATCH 1/2] sched/numa: Add ability to override task's numa_preferred_nid.
>
>
> Hi Chris,
>
> On 15/04/25 07:05, Chris Hyser wrote:
>> From: chris hyser <chris.hyser@...cle.com>
>> 
>
>[..snip..]
>
>> The following results were from TPCC runs on an Oracle Database. The system
>> was a 2-node Intel machine with a database running on each node with local
>> memory allocations. No tasks or memory were pinned.
>> 
>> There are four scenarios of interest:
>> 
>> - Auto NUMA Balancing OFF.
>>     base value
>> 
>> - Auto NUMA Balancing ON.
>>     1.2% - ANB ON better than ANB OFF.
>> 
>> - Use the prctl(), ANB ON, parameters set to prevent faulting.
>>     2.4% - prctl() better then ANB OFF.
>>     1.2% - prctl() better than ANB ON.
>> 
>> - Use the prctl(), ANB parameters normal.
>>     3.1% - prctl() and ANB ON better than ANB OFF.
>>     1.9% - prctl() and ANB ON better than just ANB ON.
>>     0.7% - prctl() and ANB ON better than prctl() and ANB ON/faulting off
>> 
>
> Are you using prctl() to set the preferred node id for all the tasks of your run?
> If yes, then how `prctl() and ANB ON better than prctl() and ANB ON/faulting off`
> case happens?

Not every task in the system (including some DB tasks) has a prctl() set preferred node as the expected preference is not always known. So that is part of it, however the bigger influence even with a prctl() set preferred node, is that faulting drives physical page migration.  You only want to migrate pages that the task is accessing. The fault tells you it was accessed and what node it is currently in allowing a migration decision to be made.

> IIUC, when setting preferred node in numa_preferred_nid_force, the original
> numa_preferred_nid which is derived from page faults will be a nop which should
> be an overhead.

As mentioned above faulting drives physical page migration with the usual trade-off between faulting overhead and the benefits of consolidating pages on the same node. 

One issue I've seen repeatably is that if you monitor a task (numa fields in /proc/<pid>/sched) some tasks keep changing their preferred node. This makes sense since spatial access locality can change over time, but you also see the migrated page count going up independent of which node is currently preferred. So on a two node system, there are pages being migrated back and forth (not necessarily the same ones). One possible effect of forcing the preferred node is that it isn't changing and migrated pages should be going the same way. 

> Let me know if my understanding is correct. Also, can you tell how to set the
> parameters of ANB to prevent faulting.

Basically, I set the sampling periods to a large number of seconds. Sampling frequency then is 1/large is ~0. Monitoring the task again, it should show no NUMA faults and no pages migrated. 

kernel.numa_balancing : 1
scan_period_max_ms: 4294967295
scan_period_min_ms: 4294967295
scan_delay_ms: 4294967295

-chrish