linux-kernel - Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when moving task to resource group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d80e3d5d-1db2-0f47-dbf3-dfc607e1b451@intel.com>
Date:   Wed, 9 Dec 2020 16:22:40 -0800
From:   Reinette Chatre <reinette.chatre@...el.com>
To:     James Morse <james.morse@....com>, fenghua.yu@...el.com
Cc:     tglx@...utronix.de, bp@...en8.de, tony.luck@...el.com,
        kuo-lang.tseng@...el.com, shakeelb@...gle.com,
        valentin.schneider@....com, mingo@...hat.com, babu.moger@....com,
        hpa@...or.com, x86@...nel.org, linux-kernel@...r.kernel.org,
        stable@...r.kernel.org
Subject: Re: [PATCH 2/3] x86/resctrl: Update PQR_ASSOC MSR synchronously when
 moving task to resource group

Hi James,

On 12/9/2020 8:51 AM, James Morse wrote:
> Hi Reinette, Fenghua,
> 
> Subject nit: I think 'use IPI instead of task_work() to update PQR_ASSOC_MSR' conveys the
> guts of this change more quickly!

Sure. Thank you. A small change is that I plan to write "PQR_ASSOC MSR" 
instead to closer match the name.

> 
> On 03/12/2020 23:25, Reinette Chatre wrote:
>> From: Fenghua Yu <fenghua.yu@...el.com>
>>
>> Currently when moving a task to a resource group the PQR_ASSOC MSR
>> is updated with the new closid and rmid in an added task callback.
>> If the task is running the work is run as soon as possible. If the
>> task is not running the work is executed later
> 
>> in the kernel exit path when the kernel returns to the task again.
> 
> kernel exit makes me thing of user-space... is it enough to just say:
> "by __switch_to() when task is next run"?

I do not think that would be accurate. The paragraph to which you are 
proposing the change states the current context, before the fix, when 
task_work_add() is still used to update PQR_ASSOC. The "kernel exit" 
text you refer to is quite close to task_work_add()'s comments and 
indeed refers to the work that is run before returning to user space. If 
a function name would make things clearer perhaps adding 
exit_to_user_mode_loop() instead? Changing the text to __switch_to() 
does not reflect the context described here since __switch_to() is what 
is called when task is context switched back from where 
resctrl_sched_in() is called, not the queued work being described here.

>> Updating the PQR_ASSOC MSR as soon as possible on the CPU a moved task
>> is running is the right thing to do. Queueing work for a task that is
>> not running is unnecessary (the PQR_ASSOC MSR is already updated when the
>> task is scheduled in) and causing system resource waste with the way in
>> which it is implemented: Work to update the PQR_ASSOC register is queued
>> every time the user writes a task id to the "tasks" file, even if the task
>> already belongs to the resource group. This could result in multiple pending
>> work items associated with a single task even if they are all identical and
>> even though only a single update with most recent values is needed.
>> Specifically, even if a task is moved between different resource groups
>> while it is sleeping then it is only the last move that is relevant but
>> yet a work item is queued during each move.
>> This unnecessary queueing of work items could result in significant system
>> resource waste, especially on tasks sleeping for a long time. For example,
>> as demonstrated by Shakeel Butt in [1] writing the same task id to the
>> "tasks" file can quickly consume significant memory. The same problem
>> (wasted system resources) occurs when moving a task between different
>> resource groups.
>>
>> As pointed out by Valentin Schneider in [2] there is an additional issue with
>> the way in which the queueing of work is done in that the task_struct update
>> is currently done after the work is queued, resulting in a race with the
>> register update possibly done before the data needed by the update is available.
>>
>> To solve these issues, the PQR_ASSOC MSR is updated in a synchronous way
>> right after the new closid and rmid are ready during the task movement,
>> only if the task is running. If a moved task is not running nothing is
>> done since the PQR_ASSOC MSR will be updated next time the task is scheduled.
>> This is the same way used to update the register when tasks are moved as
>> part of resource group removal.
> 
> (as t->on_cpu is already used...)
> 
> Reviewed-by: James Morse <james.morse@....com>

Thank you very much. I do plan to follow your suggestion below though.


>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 68db7d2dec8f..9d62f1fadcc3 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -525,6 +525,16 @@ static void rdtgroup_remove(struct rdtgroup *rdtgrp)
> 
> 
>> +static void update_task_closid_rmid(struct task_struct *t)
>>   {
>> +	int cpu;
>>   
>> +	if (task_on_cpu(t, &cpu))
>> +		smp_call_function_single(cpu, _update_task_closid_rmid, t, 1);
> 
> 
> I think:
> |	if (task_curr(t))
> |		smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
> 
> here would make for an easier backport as it doesn't depend on the previous patch.
> 

Will do, thank you very much.

The previous patch has now become a bugfix in its own right after you 
pointed out the issue with using t->on_cpu. In addressing that I plan to 
remove the helpers found in patch #1 so backporting should continue to 
be easier.

> 
>> +}
> 
> [...]
> 
>>   static int __rdtgroup_move_task(struct task_struct *tsk,
>>   				struct rdtgroup *rdtgrp)
>>   {
> 
>> +	if (rdtgrp->type == RDTCTRL_GROUP) {
>> +		tsk->closid = rdtgrp->closid;
>> +		tsk->rmid = rdtgrp->mon.rmid;
>> +	} else if (rdtgrp->type == RDTMON_GROUP) {
> 
> [...]
> 
>> +	} else {
> 
>> +		rdt_last_cmd_puts("Invalid resource group type\n");
>> +		return -EINVAL;
> 
> Wouldn't this be a kernel bug?
> I'd have thought there would be a WARN_ON_ONCE() here to make it clear this isn't
> user-space's fault!

You are right, this would be a kernel bug. I'll add a WARN_ON_ONCE().

> 
> 
>>   	}
>> -	return ret;
>> +
>> +	/*
>> +	 * By now, the task's closid and rmid are set. If the task is current
>> +	 * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
>> +	 * group go into effect. If the task is not current, the MSR will be
>> +	 * updated when the task is scheduled in.
>> +	 */
>> +	update_task_closid_rmid(tsk);
>> +
>> +	return 0;
>>   }
> 

Thank you very much.

Reinette