[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <jhjtutkuipe.mognet@arm.com>
Date: Fri, 20 Nov 2020 15:54:53 +0000
From: Valentin Schneider <valentin.schneider@....com>
To: James Morse <james.morse@....com>
Cc: linux-kernel@...r.kernel.org, x86@...nel.org,
Fenghua Yu <fenghua.yu@...el.com>,
Reinette Chatre <reinette.chatre@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
"H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 2/2] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race
Hi James,
On 20/11/20 14:53, James Morse wrote:
> Hi Valentin,
>
> On 18/11/2020 18:00, Valentin Schneider wrote:
>> Upon moving a task to a new control / monitor group, said task's {closid,
>> rmid} fields are updated *after* triggering the move_myself() task_work
>> callback. This can cause said callback to miss the update, e.g. if the
>> triggering thread got preempted before fiddling with task_struct, or if the
>> targeted task was already on its way to return to userspace.
>
> So, if move_myself() runs after task_work_add() but before tsk is written to.
> Sounds fun!
>
>
>> Update the task_struct's {closid, rmid} tuple *before* invoking
>> task_work_add(). As they can happen concurrently, wrap {closid, rmid}
>> accesses with READ_ONCE() and WRITE_ONCE(). Highlight the required ordering
>> with a pair of comments.
>
> ... and this one is if move_myself() or __resctrl_sched_in() runs while tsk is being
> written to on another CPU. It might get torn values, or multiple-reads get different values.
>
> The READ_ONCE/WRITE_ONCEry would have been easier to read as a separate patch as you touch
> all sites, and move/change some of them.
>
True, I initially only fixed up the reads/writes involved with
__rdtgroup_move_task(), but ended up coccinelle'ing the whole lot - which I
should have then moved to a dedicated patch. Thanks for powering through
it, I'll send a v2 with a neater split.
> Regardless:
> Reviewed-by: James Morse <james.morse@....com>
>
Thanks!
>
> I don't 'get' memory-ordering, so one curiosity below:
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index b6b5b95df833..135a51529f70 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -524,11 +524,13 @@ static void move_myself(struct callback_head *head)
>> * If resource group was deleted before this task work callback
>> * was invoked, then assign the task to root group and free the
>> * resource group.
>> + *
>> + * See pairing atomic_inc() in __rdtgroup_move_task()
>> */
>> if (atomic_dec_and_test(&rdtgrp->waitcount) &&
>> (rdtgrp->flags & RDT_DELETED)) {
>> - current->closid = 0;
>> - current->rmid = 0;
>> + WRITE_ONCE(current->closid, 0);
>> + WRITE_ONCE(current->rmid, 0);
>> kfree(rdtgrp);
>> }
>>
>> @@ -553,14 +555,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>
>> /*
>> * Take a refcount, so rdtgrp cannot be freed before the
>> * callback has been invoked.
>> + *
>> + * Also ensures above {closid, rmid} writes are observed by
>> + * move_myself(), as it can run immediately after task_work_add().
>> + * Otherwise old values may be loaded, and the move will only actually
>> + * happen at the next context switch.
>
> But __resctrl_sched_in() can still occur at anytime and READ_ONCE() a pair of values that
> don't go together?
Yes, the thought did cross my mind...
> I don't think this is a problem for RDT as with old-rmid the task was a member of that
> monitor-group previously, and 'freed' rmid are kept in limbo for a while after.
> (old-closid is the same as the task having not schedule()d since the change, which is fine).
>
> For MPAM, this is more annoying as changing just the closid may put the task in a
> monitoring group that never existed, meaning its surprise dirty later.
>
> If this all makes sense, I guess the fix (for much later) is to union closid/rmid, and
> WRITE_ONCE() them together where necessary.
> (I've made a note for when I next pass that part of the MPAM tree)
>
It does make sense to me - one more question back to you: can RDT exist on
an X86_32 system? It shouldn't be a stopper, but would be an inconvenience.
FWIW kernel/sched/fair.c uses two synced u64's for this; see
struct cfs_rq { .min_vruntime, .min_vruntime_copy }
and
kernel/sched/fair.c:update_min_vruntime()
kernel/sched/fair.c:migrate_task_rq_fair()
>
>> + *
>> + * Pairs with atomic_dec() in move_myself().
>> */
>> atomic_inc(&rdtgrp->waitcount);
>> +
>> ret = task_work_add(tsk, &callback->work, TWA_RESUME);
>> if (ret) {
>> /*
>
>
> Thanks!
>
> James
Powered by blists - more mailing lists