[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190925073643.GA4536@hirez.programming.kicks-ass.net>
Date: Wed, 25 Sep 2019 09:36:43 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: "Eric W. Biederman" <ebiederm@...ssion.com>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Oleg Nesterov <oleg@...hat.com>,
Russell King - ARM Linux admin <linux@...linux.org.uk>,
Chris Metcalf <cmetcalf@...hip.com>,
Christoph Lameter <cl@...ux.com>,
Kirill Tkhai <tkhai@...dex.ru>, Mike Galbraith <efault@....de>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
Davidlohr Bueso <dave@...olabs.net>, mpe@...erman.id.au
Subject: Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
On Mon, Sep 09, 2019 at 07:22:15AM -0500, Eric W. Biederman wrote:
> Let me see if I can explain my confusion in terms of task_numa_compare.
>
> The function task_numa_comare now does:
>
> rcu_read_lock();
> cur = rcu_dereference(dst_rq->curr);
>
> Then it proceeds to examine a few fields of cur.
>
> cur->flags;
> cur->cpus_ptr;
> cur->numa_group;
> cur->numa_faults[];
> cur->total_numa_faults;
> cur->se.cfs_rq;
>
> My concern here is the ordering between setting rq->curr and and setting
> those other fields. If we read values of those fields that were not
> current when we set cur than I think task_numa_compare would give an
> incorrect answer.
That code is explicitly without serialization. One memory barrier isn't
going to make any difference what so ever.
Specifically, those numbers will change _after_ we make it current, not
before.
> From what I can see pick_next_task_fair does not mess with any of
> the fields that task_numa_compare uses. So the existing memory barrier
> should be more than sufficient for that case.
There should not be, but even if there is, load-balancing in general
(very much including the NUMA balancing) is completely unserialized and
prone to reading stale data at all times.
This is on purpose; it minimizes the interference of load-balancing, and
if we make a 'wrong' choice, the next pass can fix it up.
> So I think I just need to write up a patch 4/3 that replaces the my
> "rcu_assign_pointer(rq->curr, next)" with "RCU_INIT_POINTER(rq->curr,
> next)". And includes a great big comment to that effect.
>
>
>
> Now maybe I am wrong. Maybe we can afford to see data that was changed
> before the task became rq->curr in task_numa_compare. But I don't think
> so. I really think it is that full memory barrier just a bit earlier
> in __schedule that is keeping us safe.
>
> Am I wrong?
Not in so far that we now both seem to agree on using
RCU_INIT_POINTER(); but we're still disagreeing on why.
My argument is that since we can already obtain any task_struct pointer
through find_task_by_vpid() and friends, any access to its individual
members must already have serialzation rules (or explicitly none, as for
load-balancing).
I'm viewing this as just another variant of find_task_by_vpid(), except
we index by CPU instead of PID. There can be changes to task_struct
between to invocations of find_task_by_vpid(), any user will have to
somehow deal with that. This is no different.
Take for instance p->cpus_ptr, it has very specific serialzation rules
(which that NUMA balancer thing blatantly ignores), as do a lot of other
task_struct fields.
The memory barrier in question isn't going to help with any of that.
Specifically, if we care about the consistency of the numbers changed by
pick_next_task(), we must acquire rq->lock (of course, once we do that,
we can immediately use rq->curr without further RCU use).
Powered by blists - more mailing lists