lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu,  8 Dec 2022 23:30:59 +0100
From:   Peter Newman <peternewman@...gle.com>
To:     reinette.chatre@...el.com
Cc:     bp@...en8.de, derkling@...gle.com, eranian@...gle.com,
        fenghua.yu@...el.com, hpa@...or.com, james.morse@....com,
        jannh@...gle.com, kpsingh@...gle.com, linux-kernel@...r.kernel.org,
        mingo@...hat.com, peternewman@...gle.com, tglx@...utronix.de,
        x86@...nel.org
Subject: Re: [PATCH v4 1/2] x86/resctrl: Update task closid/rmid with task_call_func()

Hi Reinette,

On Wed, Dec 7, 2022 at 7:41 PM Reinette Chatre <reinette.chatre@...el.com> wrote:
> On 12/7/2022 2:58 AM, Peter Newman wrote:
> >>>  2. resctrl_sched_in() loads t->{closid,rmid} before the calling context
> >>>     switch stores new task_curr() and task_cpu() values.
> >>
> >> This scenario is not clear to me. Could you please provide more detail about it?
> >> I was trying to follow the context_switch() flow and resctrl_sched_in() is
> >> one of the last things done (context_switch()->switch_to()->resctrl_sched_in()).
> >> From what I can tell rq->curr, as used by task_curr() is set before
> >> even context_switch() is called ... and since the next task is picked from
> >> the CPU's runqueue (and set_task_cpu() sets the task's cpu when moved to
> >> a runqueue) it seems to me that the value used by task_cpu() would also
> >> be set early (before context_switch() is called). It is thus not clear to
> >> me how the above reordering could occur so an example would help a lot.
> >
> > Perhaps in both scenarios I didn't make it clear that reordering in the
> > CPU can cause the incorrect behavior rather than the program order. In
> > this explanation, items 1. and 2. are supposed to be completing the
> > sentence ending with a ':' at the end of paragraph 3, so I thought that
> > would keep focus on the CPU.
>
> You did make it clear that the cause is reordering in the CPU. I am just
> not able to see where the reordering is occurring in your scenario (2).

It will all come down to whether it can get from updating rq->curr to
reading task_struct::{closid,rmid} without encountering a full barrier.

I'll go into the details below.

> Please do. Could you start by highlighting which resctrl_sched_in()
> you are referring to? I am trying to dissect (2) with the given information:
> Through "the calling context switch" the scenario is written to create
> understanding that it refers to:
> context_switch()->switch_to()->resctrl_sched_in() - so the calling context
> switch is the first in the above call path ... where does it (context_switch())
> store the new task_curr() and task_cpu() values and how does that reorder with
> resctrl_sched_in() further down in call path?

Yes, the rq->curr update is actually in __schedule(). I was probably
still thinking it was in prepare_task_switch() (called from
context_switch()) because of older kernels where __rdtgroup_move_task()
is still reading task_struct::on_cpu.

There is an interesting code comment under the rq->curr update site in
__schedule():

	/*
	 * RCU users of rcu_dereference(rq->curr) may not see
	 * changes to task_struct made by pick_next_task().
	 */
	RCU_INIT_POINTER(rq->curr, next);
	/*
	 * The membarrier system call requires each architecture
	 * to have a full memory barrier after updating
	 * rq->curr, before returning to user-space.
	 *
	 * Here are the schemes providing that barrier on the
	 * various architectures:
	 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
	 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
	 * - finish_lock_switch() for weakly-ordered
	 *   architectures where spin_unlock is a full barrier,
	 * - switch_to() for arm64 (weakly-ordered, spin_unlock
	 *   is a RELEASE barrier),
	 */

Based on this, I believe (2) doesn't happen on x86 because switch_mm()
provides the required ordering.

On arm64, it won't happen as long as it calls its resctrl_sched_in()
after the dsb(ish) in __switch_to(), which seems to be the case:

https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git/tree/arch/arm64/kernel/process.c?h=mpam/snapshot/v6.0-rc1#n561

Based on this, I'll just sketch out the first scenario below and drop
(2) from the changelog. This also implies that the group update cases
can use a single smp_mb() to provide all the necessary ordering, because
there's a full barrier on context switch for it to pair with, so I don't
need to broadcast IPI anymore.  I don't know whether task_call_func() is
faster than an smp_mb(). I'll take some measurements to see.

The presumed behavior is __rdtgroup_move_task() not seeing t1 running
yet implies that it observes the updated values:

CPU 0                                   CPU 1
-----                                   -----
(t1->{closid,rmid} -> {1,1})            (rq->curr -> t0)

__rdtgroup_move_task():
  t1->{closid,rmid} <- {2,2}
  curr <- t1->cpu->rq->curr
                                        __schedule():
                                          rq->curr <- t1
                                        resctrl_sched_in():
                                          t1->{closid,rmid} -> {2,2}
  if (curr == t1) // false
    IPI(t1->cpu)

In (1), CPU 0 is allowed to store {closid,rmid} after reading whether t1
is current:

CPU 0                                   CPU 1
-----                                   -----
__rdtgroup_move_task():
  curr <- t1->cpu->rq->curr
                                        __schedule():
                                          rq->curr <- t1
                                        resctrl_sched_in():
                                          t1->{closid,rmid} -> {1,1}
  t1->{closid,rmid} <- {2,2}
  if (curr == t1) // false
   IPI(t1->cpu)

Please let me know if these diagrams clarify things.

-Peter

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ