linux-kernel - Re: [RFC PATCH v0 2/3] sched/numa: Add cumulative history of per-process fault stats

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2866edcd-37e5-ac74-0788-ef1dac1b22d4@amd.com>
Date:   Tue, 1 Feb 2022 18:00:56 +0530
From:   Bharata B Rao <bharata@....com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     linux-kernel@...r.kernel.org, mingo@...hat.com,
        peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, bristot@...hat.com,
        dishaa.talreja@....com, Wei Huang <wei.huang2@....com>
Subject: Re: [RFC PATCH v0 2/3] sched/numa: Add cumulative history of
 per-process fault stats


On 1/31/2022 5:47 PM, Mel Gorman wrote:
> On Fri, Jan 28, 2022 at 10:58:50AM +0530, Bharata B Rao wrote:
>> From: Disha Talreja <dishaa.talreja@....com>
>>
>> The cumulative history of local/remote (lr) and private/shared (ps)
>> will be used for calculating adaptive scan period.
>>
> 
> How it used to calculate adaptive scan period?

Fault stats from different windows are accumulated and the cumulative stats
are used to arrive at the per-mm scan period unlike in the current case
where the stats from the last window determines the per-task scan period.

> 
> As it is likely used in a later patch, note here that the per-thread
> stats are simply accumulated in the address space for now.

Yes, will make that clear in the patch description here.

> 
>> Co-developed-by: Wei Huang <wei.huang2@....com>
>> Signed-off-by: Wei Huang <wei.huang2@....com>
>> Signed-off-by: Disha Talreja <dishaa.talreja@....com>
>> Signed-off-by: Bharata B Rao <bharata@....com>
>> ---
>>  include/linux/mm_types.h |  2 ++
>>  kernel/sched/fair.c      | 49 +++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 4f978c09d3db..2c6f119b947f 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -614,6 +614,8 @@ struct mm_struct {
>>  		/* Process-based Adaptive NUMA */
>>  		atomic_long_t faults_locality[2];
>>  		atomic_long_t faults_shared[2];
>> +		unsigned long faults_locality_history[2];
>> +		unsigned long faults_shared_history[2];
>>  
>>  		spinlock_t pan_numa_lock;
>>  		unsigned int numa_scan_period;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1d6404b2d42e..4911b3841d00 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2102,14 +2102,56 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
>>  /**********************************************/
>>  /*  Process-based Adaptive NUMA (PAN) Design  */
>>  /**********************************************/
>> +/*
>> + * Update the cumulative history of local/remote and private/shared
>> + * statistics. If the numbers are too small worthy of updating,
>> + * return FALSE, otherwise return TRUE.
>> + */
>> +static bool pan_update_history(struct task_struct *p)
>> +{
>> +	unsigned long local, remote, shared, private;
>> +	long diff;
>> +	int i;
>> +
>> +	remote = atomic_long_read(&p->mm->faults_locality[0]);
>> +	local = atomic_long_read(&p->mm->faults_locality[1]);
>> +	shared = atomic_long_read(&p->mm->faults_shared[0]);
>> +	private = atomic_long_read(&p->mm->faults_shared[1]);
>> +
>> +	/* skip if the activities in this window are too small */
>> +	if (local + remote < 100)
>> +		return false;
>> +
> 
> Why 100?

We need some minimum number of faults to make a decision and
figured out 100 could be a good minimum here.

> 
>> +	/* decay over the time window by 1/4 */
>> +	diff = local - (long)(p->mm->faults_locality_history[1] / 4);
>> +	p->mm->faults_locality_history[1] += diff;
>> +	diff = remote - (long)(p->mm->faults_locality_history[0] / 4);
>> +	p->mm->faults_locality_history[0] += diff;
>> +
>> +	/* decay over the time window by 1/2 */
>> +	diff = shared - (long)(p->mm->faults_shared_history[0] / 2);
>> +	p->mm->faults_shared_history[0] += diff;
>> +	diff = private - (long)(p->mm->faults_shared_history[1] / 2);
>> +	p->mm->faults_shared_history[1] += diff;
>> +
> 
> Why are the decay windows different?

Like in the existing algorithm, we started with a decay factor of 1/2
for both local/remote and private/shared.  However we found lr_ratio
oscillating too much with that and hence dampened it to 1/4.

Decay factor of 1/4 for ps_ratio too may not change the overall
behaviour that much, but will have to experiment and check.

> 
> 
>> +	/* clear the statistics for the next window */
>> +	for (i = 0; i < 2; i++) {
>> +		atomic_long_set(&(p->mm->faults_locality[i]), 0);
>> +		atomic_long_set(&(p->mm->faults_shared[i]), 0);
>> +	}
>> +
>> +	return true;
>> +}
>> +
>>  /*
>>   * Updates mm->numa_scan_period under mm->pan_numa_lock.
>> - *
>>   * Returns p->numa_scan_period now but updated to return
>>   * p->mm->numa_scan_period in a later patch.
>>   */
> 
> Spurious whitespace change.

Sorry, will fix.

> 
>>  static unsigned long pan_get_scan_period(struct task_struct *p)
>>  {
>> +	pan_update_history(p);
>> +
>>  	return p->numa_scan_period;
>>  }
>>  
> 
> Ok, so the spinlock is protecting the RMW of the PAN history. It still
> may be a concern that task_numa_work gets aborted if the spinlock cannot
> be acquired.

As replied in 1/3, the thread which is holding the lock for stats update
should start the scanning is our current understanding.

Regards,
Bharata.