linux-kernel - Re: [PATCH 2/2] x86/intel_rdt: Plug task_work vs task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c3370c24-98f2-dac8-e390-9c80a3321cf0@arm.com>
Date:   Fri, 20 Nov 2020 14:53:16 +0000
From:   James Morse <james.morse@....com>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     linux-kernel@...r.kernel.org, x86@...nel.org,
        Fenghua Yu <fenghua.yu@...el.com>,
        Reinette Chatre <reinette.chatre@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 2/2] x86/intel_rdt: Plug task_work vs task_struct
 {rmid,closid} update race

Hi Valentin,

On 18/11/2020 18:00, Valentin Schneider wrote:
> Upon moving a task to a new control / monitor group, said task's {closid,
> rmid} fields are updated *after* triggering the move_myself() task_work
> callback. This can cause said callback to miss the update, e.g. if the
> triggering thread got preempted before fiddling with task_struct, or if the
> targeted task was already on its way to return to userspace.

So, if move_myself() runs after task_work_add() but before tsk is written to.
Sounds fun!

> Update the task_struct's {closid, rmid} tuple *before* invoking
> task_work_add(). As they can happen concurrently, wrap {closid, rmid}
> accesses with READ_ONCE() and WRITE_ONCE(). Highlight the required ordering
> with a pair of comments.

... and this one is if move_myself() or __resctrl_sched_in() runs while tsk is being
written to on another CPU. It might get torn values, or multiple-reads get different values.

The READ_ONCE/WRITE_ONCEry would have been easier to read as a separate patch as you touch
all sites, and move/change some of them.

Regardless:
Reviewed-by: James Morse <james.morse@....com>

I don't 'get' memory-ordering, so one curiosity below:

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b6b5b95df833..135a51529f70 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -524,11 +524,13 @@ static void move_myself(struct callback_head *head)
>  	 * If resource group was deleted before this task work callback
>  	 * was invoked, then assign the task to root group and free the
>  	 * resource group.
> +	 *
> +	 * See pairing atomic_inc() in __rdtgroup_move_task()
>  	 */
>  	if (atomic_dec_and_test(&rdtgrp->waitcount) &&
>  	    (rdtgrp->flags & RDT_DELETED)) {
> -		current->closid = 0;
> -		current->rmid = 0;
> +		WRITE_ONCE(current->closid, 0);
> +		WRITE_ONCE(current->rmid, 0);
>  		kfree(rdtgrp);
>  	}
>  
> @@ -553,14 +555,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,

>  	/*
>  	 * Take a refcount, so rdtgrp cannot be freed before the
>  	 * callback has been invoked.
> +	 *
> +	 * Also ensures above {closid, rmid} writes are observed by
> +	 * move_myself(), as it can run immediately after task_work_add().
> +	 * Otherwise old values may be loaded, and the move will only actually
> +	 * happen at the next context switch.

But __resctrl_sched_in() can still occur at anytime and READ_ONCE() a pair of values that
don't go together?
I don't think this is a problem for RDT as with old-rmid the task was a member of that
monitor-group previously, and 'freed' rmid are kept in limbo for a while after.
(old-closid is the same as the task having not schedule()d since the change, which is fine).

For MPAM, this is more annoying as changing just the closid may put the task in a
monitoring group that never existed, meaning its surprise dirty later.

If this all makes sense, I guess the fix (for much later) is to union closid/rmid, and
WRITE_ONCE() them together where necessary.
(I've made a note for when I next pass that part of the MPAM tree)

> +	 *
> +	 * Pairs with atomic_dec() in move_myself().
>  	 */
>  	atomic_inc(&rdtgrp->waitcount);
> +
>  	ret = task_work_add(tsk, &callback->work, TWA_RESUME);
>  	if (ret) {
>  		/*

Thanks!

James