linux-kernel - Re: [PATCH 2/3] task: RCU protect tasks on the runqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190925073643.GA4536@hirez.programming.kicks-ass.net>
Date:   Wed, 25 Sep 2019 09:36:43 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     "Paul E. McKenney" <paulmck@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Chris Metcalf <cmetcalf@...hip.com>,
        Christoph Lameter <cl@...ux.com>,
        Kirill Tkhai <tkhai@...dex.ru>, Mike Galbraith <efault@....de>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Davidlohr Bueso <dave@...olabs.net>, mpe@...erman.id.au
Subject: Re: [PATCH 2/3] task: RCU protect tasks on the runqueue

On Mon, Sep 09, 2019 at 07:22:15AM -0500, Eric W. Biederman wrote:

> Let me see if I can explain my confusion in terms of task_numa_compare.
> 
> The function task_numa_comare now does:
> 
> 	rcu_read_lock();
> 	cur = rcu_dereference(dst_rq->curr);
> 
> Then it proceeds to examine a few fields of cur.
> 
> 	cur->flags;
>         cur->cpus_ptr;
>         cur->numa_group;
>         cur->numa_faults[];
>         cur->total_numa_faults;
>         cur->se.cfs_rq;
> 
> My concern here is the ordering between setting rq->curr and and setting
> those other fields.  If we read values of those fields that were not
> current when we set cur than I think task_numa_compare would give an
> incorrect answer.

That code is explicitly without serialization. One memory barrier isn't
going to make any difference what so ever.

Specifically, those numbers will change _after_ we make it current, not
before.

> From what I can see pick_next_task_fair does not mess with any of
> the fields that task_numa_compare uses.  So the existing memory barrier
> should be more than sufficient for that case.

There should not be, but even if there is, load-balancing in general
(very much including the NUMA balancing) is completely unserialized and
prone to reading stale data at all times.

This is on purpose; it minimizes the interference of load-balancing, and
if we make a 'wrong' choice, the next pass can fix it up.

> So I think I just need to write up a patch 4/3 that replaces the my
> "rcu_assign_pointer(rq->curr, next)" with "RCU_INIT_POINTER(rq->curr,
> next)".  And includes a great big comment to that effect.
> 
> 
> 
> Now maybe I am wrong.  Maybe we can afford to see data that was changed
> before the task became rq->curr in task_numa_compare.  But I don't think
> so.  I really think it is that full memory barrier just a bit earlier
> in __schedule that is keeping us safe.
> 
> Am I wrong?

Not in so far that we now both seem to agree on using
RCU_INIT_POINTER(); but we're still disagreeing on why.

My argument is that since we can already obtain any task_struct pointer
through find_task_by_vpid() and friends, any access to its individual
members must already have serialzation rules (or explicitly none, as for
load-balancing).

I'm viewing this as just another variant of find_task_by_vpid(), except
we index by CPU instead of PID. There can be changes to task_struct
between to invocations of find_task_by_vpid(), any user will have to
somehow deal with that. This is no different.

Take for instance p->cpus_ptr, it has very specific serialzation rules
(which that NUMA balancer thing blatantly ignores), as do a lot of other
task_struct fields.

The memory barrier in question isn't going to help with any of that.

Specifically, if we care about the consistency of the numbers changed by
pick_next_task(), we must acquire rq->lock (of course, once we do that,
we can immediately use rq->curr without further RCU use).