lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 14 Jan 2023 23:59:39 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Andrei Vagin <avagin@...il.com>
CC:     Andrei Vagin <avagin@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        <linux-kernel@...r.kernel.org>, Kees Cook <keescook@...omium.org>,
        Christian Brauner <brauner@...nel.org>,
        Andy Lutomirski <luto@...capital.net>,
        Juri Lelli <juri.lelli@...hat.com>,
        Peter Oskolkov <posk@...gle.com>,
        Tycho Andersen <tycho@...ho.pizza>,
        "Will Drewry" <wad@...omium.org>
Subject: Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu

On 2023-01-13 at 13:39:46 -0800, Andrei Vagin wrote:
> On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@...el.com> wrote:
> >
> > On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote:
> > > From: Peter Oskolkov <posk@...gle.com>
> > >
> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > context switching use cases.
> > >
> > > In addition, make ttwu external rather than static so that
> > > the flag could be passed to it from outside of sched/core.c.
> > >
> > > Signed-off-by: Peter Oskolkov <posk@...gle.com>
> > > Signed-off-by: Andrei Vagin <avagin@...il.com>
> > > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > >       if (wake_flags & WF_TTWU) {
> > >               record_wakee(p);
> > >
> > > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > > +                     return cpu;
> > I agree that cross-CPU wake up brings pain to fast context switching
> > use cases,  especially on high core count system. We suffered from this
> > issue as well, so previously we presented this issue as well. The difference
> > is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it.
> > That is, if the waker/wakee are both short duration tasks, let the waker wakes up
> > the wakee on current CPU. So not only seccomp but also other components/workloads
> > could benefit from this without having to set the WF_CURRENT_CPU flag.
> >
> > Link [1]:
> > https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/
> 
> Thanks for the link. I like the idea, but this change has no impact on the
> seccom notify case.  I used the benchmark from the fifth patch. It is
> a ping-pong
> benchmark in which one process triggers system calls, and another process
> handles them. It measures the number of system calls that can be processed
> within a specified time slice.
>
Thanks for this information.
> $ cd tools/testing/selftests/seccomp/
> $ make
> $ ./seccomp_bpf  2>&1| grep user_notification_sync
> #  RUN           global.user_notification_sync ...
> # seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall
> # seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall
> #            OK  global.user_notification_sync
> ok 51 global.user_notification_sync
> 
> The results are the same with and without your change. I expected that
> your change improves
> the basic case so that it reaches the sync one.
> 
The reason why the patch did not bring benefit might be that, that patch
aims to wake task on current CPU only there is no idle cores. The seccomp
notify would launch 2 processes which makes it hard to saturate the system
if I understand correctly? I built a kernel based on your repo, and launched
above test, it seems that the average load is quit low on my system. Is this
expected? Besides, is 100 ms long enough to test(USER_NOTIF_BENCH_TIMEOUT)?
> I did some experiments and found that we can achieve the desirable
> outcome if we move the "short-task" checks prior to considering waking
> up on prev_cpu.
> 
> For example, with this patch:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2f89e44e237d..af20b58e3972 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p)
>  static int
>  wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>  {
> +       /* The only running task is a short duration one. */
> +       if (cpu_rq(this_cpu)->nr_running == 1 &&
> +           is_short_task(cpu_curr(this_cpu)))
> +               return this_cpu;
> +
In v1 we put above code before checking the prev_cpu, but is was reported
to bring regression on some systems[2]. The result suggested that, we should
try to pick idle CPU first, no matter it is current CPU, or previous CPU,
if it failed, we can lower the bar and pick the current CPU.

[2] https://lore.kernel.org/lkml/6c626e8d-4133-00ba-a765-bafe08034517@amd.com/
>         /*
>          * If this_cpu is idle, it implies the wakeup is from interrupt
>          * context. Only allow the move if cache is shared. Otherwise an
> @@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>         if (available_idle_cpu(prev_cpu))
>                 return prev_cpu;
> 
> -       /* The only running task is a short duration one. */
> -       if (cpu_rq(this_cpu)->nr_running == 1 &&
> -           is_short_task(cpu_curr(this_cpu)))
> -               return this_cpu;
> -
>         return nr_cpumask_bits;
>  }
> 
> @@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct
> task_struct *p, int prev, int target)
>             asym_fits_cpu(task_util, util_min, util_max, target))
>                 return target;
> 
> +       if (!has_idle_core && cpu_rq(target)->nr_running == 1 &&
> +           is_short_task(cpu_curr(target)) && is_short_task(p))
> +               return target;
> +
>         /*
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */
> 
> 
> the basic test case shows almost the same results as the sync one:
> 
> $ ./seccomp_bpf  2>&1| grep user_notification_sync
> #  RUN           global.user_notification_sync ...
> # seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall
> # seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall
> #            OK  global.user_notification_sync
> ok 51 global.user_notification_sync
> 
> If you want to do any experiments, you can find my tree here:
> [1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task
> 
I'm happy to further test on this. One thing I'm curious about is, where does the
benefit of waking up seccom task on current CPU come from? Is it due to hot L1 cache?

thanks,
Chenyu
> Thanks,
> Andrei

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ