lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190920230247.GA6449@lenoir>
Date:   Sat, 21 Sep 2019 01:02:49 +0200
From:   Frederic Weisbecker <frederic@...nel.org>
To:     "Eric W. Biederman" <ebiederm@...ssion.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Oleg Nesterov <oleg@...hat.com>,
        Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Chris Metcalf <cmetcalf@...hip.com>,
        Christoph Lameter <cl@...ux.com>,
        Kirill Tkhai <tkhai@...dex.ru>, Mike Galbraith <efault@....de>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Davidlohr Bueso <dave@...olabs.net>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v2 4/4] task: RCUify the assignment of rq->curr

On Sat, Sep 14, 2019 at 07:35:02AM -0500, Eric W. Biederman wrote:
> 
> The current task on the runqueue is currently read with rcu_dereference().
> 
> To obtain ordinary rcu semantics for an rcu_dereference of rq->curr it needs
> to be paird with rcu_assign_pointer of rq->curr.  Which provides the
> memory barrier necessary to order assignments to the task_struct
> and the assignment to rq->curr.
> 
> Unfortunately the assignment of rq->curr in __schedule is a hot path,
> and it has already been show that additional barriers in that code
> will reduce the performance of the scheduler.  So I will attempt to
> describe below why you can effectively have ordinary rcu semantics
> without any additional barriers.
> 
> The assignment of rq->curr in init_idle is a slow path called once
> per cpu and that can use rcu_assign_pointer() without any concerns.
> 
> As I write this there are effectively two users of rcu_dereference on
> rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
> that only looks at "->mm" after the rcu_dereference.  Then there is
> task_numa_compare() in kernel/sched/fair.c.  My best reading of the
> code shows that task_numa_compare only access: "->flags",
> "->cpus_ptr", "->numa_group", "->numa_faults[]",
> "->total_numa_faults", and "->se.cfs_rq".
> 
> The code in __schedule() essentially does:
> 	rq_lock(...);
> 	smp_mb__after_spinlock();
> 
> 	next = pick_next_task(...);
> 	rq->curr = next;
> 
> 	context_switch(prev, next);
> 
> At the start of the function the rq_lock/smp_mb__after_spinlock
> pair provides a full memory barrier.  Further there is a full memory barrier
> in context_switch().
> 
> This means that any task that has already run and modified itself (the
> common case) has already seen two memory barriers before __schedule()
> runs and begins executing.  A task that modifies itself then sees a
> third full memory barrier pair with the rq_lock();
> 
> For a brand new task that is enqueued with wake_up_new_task() there
> are the memory barriers present from the taking and release the
> pi_lock and the rq_lock as the processes is enqueued as well as the
> full memory barrier at the start of __schedule() assuming __schedule()
> happens on the same cpu.
> 
> This means that by the time we reach the assignment of rq->curr
> except for values on the task struct modified in pick_next_task
> the code has the same guarantees as if it used rcu_assign_pointer.
> 
> Reading through all of the implementations of pick_next_task it
> appears pick_next_task is limited to modifying the task_struct fields
> "->se", "->rt", "->dl".  These fields are the sched_entity structures
> of the varies schedulers.
> 
> Further "->se.cfs_rq" is only changed in cgroup attach/move operations
> initialized by userspace.
> 
> Unless I have missed something this means that in practice that the
> users of "rcu_dereerence(rq->curr)" get normal rcu semantics of
> rcu_dereference() for the fields the care about, despite the
> assignment of rq->curr in __schedule() ot using rcu_assign_pointer.
> 
> Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
> Signed-off-by: "Eric W. Biederman" <ebiederm@...ssion.com>
> ---
>  kernel/sched/core.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 69015b7c28da..668262806942 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3857,7 +3857,11 @@ static void __sched notrace __schedule(bool preempt)
>  
>  	if (likely(prev != next)) {
>  		rq->nr_switches++;
> -		rq->curr = next;
> +		/*
> +		 * RCU users of rcu_dereference(rq->curr) may not see
> +		 * changes to task_struct made by pick_next_task().
> +		 */
> +		RCU_INIT_POINTER(rq->curr, next);

It would be nice to have more explanations in the comments as to why we
don't use rcu_assign_pointer() here (the very fast-path issue) and why
it is expected to be fine (the rq_lock() + post spinlock barrier) under
which condition. Some short summary of the changelog. Because that line
implies way too many subtleties.

Thanks.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ