linux-kernel - Re: [PATCH 2/5] sched: add WF_CURRENT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANaxB-wcpKS64q6_0+r+OwoZupRN-A-PQvPRiVsMmEgB1TRSrw@mail.gmail.com>
Date:   Fri, 13 Jan 2023 13:39:46 -0800
From:   Andrei Vagin <avagin@...il.com>
To:     Chen Yu <yu.c.chen@...el.com>
Cc:     Andrei Vagin <avagin@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        linux-kernel@...r.kernel.org, Kees Cook <keescook@...omium.org>,
        Christian Brauner <brauner@...nel.org>,
        Andy Lutomirski <luto@...capital.net>,
        Juri Lelli <juri.lelli@...hat.com>,
        Peter Oskolkov <posk@...gle.com>,
        Tycho Andersen <tycho@...ho.pizza>,
        Will Drewry <wad@...omium.org>
Subject: Re: [PATCH 2/5] sched: add WF_CURRENT_CPU and externise ttwu

On Wed, Jan 11, 2023 at 11:36 PM Chen Yu <yu.c.chen@...el.com> wrote:
>
> On 2023-01-10 at 13:30:07 -0800, Andrei Vagin wrote:
> > From: Peter Oskolkov <posk@...gle.com>
> >
> > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > move the wakee to the current CPU. This is useful for fast on-CPU
> > context switching use cases.
> >
> > In addition, make ttwu external rather than static so that
> > the flag could be passed to it from outside of sched/core.c.
> >
> > Signed-off-by: Peter Oskolkov <posk@...gle.com>
> > Signed-off-by: Andrei Vagin <avagin@...il.com>
> > @@ -7380,6 +7380,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >       if (wake_flags & WF_TTWU) {
> >               record_wakee(p);
> >
> > +             if ((wake_flags & WF_CURRENT_CPU) &&
> > +                 cpumask_test_cpu(cpu, p->cpus_ptr))
> > +                     return cpu;
> I agree that cross-CPU wake up brings pain to fast context switching
> use cases,  especially on high core count system. We suffered from this
> issue as well, so previously we presented this issue as well. The difference
> is that we used some dynamic "WF_CURRENT_CPU" mechanism[1] to deal with it.
> That is, if the waker/wakee are both short duration tasks, let the waker wakes up
> the wakee on current CPU. So not only seccomp but also other components/workloads
> could benefit from this without having to set the WF_CURRENT_CPU flag.
>
> Link [1]:
> https://lore.kernel.org/lkml/cover.1671158588.git.yu.c.chen@intel.com/

Thanks for the link. I like the idea, but this change has no impact on the
seccom notify case.  I used the benchmark from the fifth patch. It is
a ping-pong
benchmark in which one process triggers system calls, and another process
handles them. It measures the number of system calls that can be processed
within a specified time slice.

$ cd tools/testing/selftests/seccomp/
$ make
$ ./seccomp_bpf  2>&1| grep user_notification_sync
#  RUN           global.user_notification_sync ...
# seccomp_bpf.c:4281:user_notification_sync:basic: 8489 nsec/syscall
# seccomp_bpf.c:4281:user_notification_sync:sync: 3078 nsec/syscall
#            OK  global.user_notification_sync
ok 51 global.user_notification_sync

The results are the same with and without your change. I expected that
your change improves
the basic case so that it reaches the sync one.

I did some experiments and found that we can achieve the desirable
outcome if we move the "short-task" checks prior to considering waking
up on prev_cpu.

For example, with this patch:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f89e44e237d..af20b58e3972 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6384,6 +6384,11 @@ static int wake_wide(struct task_struct *p)
 static int
 wake_affine_idle(int this_cpu, int prev_cpu, int sync)
 {
+       /* The only running task is a short duration one. */
+       if (cpu_rq(this_cpu)->nr_running == 1 &&
+           is_short_task(cpu_curr(this_cpu)))
+               return this_cpu;
+
        /*
         * If this_cpu is idle, it implies the wakeup is from interrupt
         * context. Only allow the move if cache is shared. Otherwise an
@@ -6405,11 +6410,6 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
        if (available_idle_cpu(prev_cpu))
                return prev_cpu;

-       /* The only running task is a short duration one. */
-       if (cpu_rq(this_cpu)->nr_running == 1 &&
-           is_short_task(cpu_curr(this_cpu)))
-               return this_cpu;
-
        return nr_cpumask_bits;
 }

@@ -6897,6 +6897,10 @@ static int select_idle_sibling(struct
task_struct *p, int prev, int target)
            asym_fits_cpu(task_util, util_min, util_max, target))
                return target;

+       if (!has_idle_core && cpu_rq(target)->nr_running == 1 &&
+           is_short_task(cpu_curr(target)) && is_short_task(p))
+               return target;
+
        /*
         * If the previous CPU is cache affine and idle, don't be stupid:
         */


the basic test case shows almost the same results as the sync one:

$ ./seccomp_bpf  2>&1| grep user_notification_sync
#  RUN           global.user_notification_sync ...
# seccomp_bpf.c:4281:user_notification_sync:basic: 3082 nsec/syscall
# seccomp_bpf.c:4281:user_notification_sync:sync: 2690 nsec/syscall
#            OK  global.user_notification_sync
ok 51 global.user_notification_sync

If you want to do any experiments, you can find my tree here:
[1] https://github.com/avagin/linux-task-diag/tree/wip/seccom-notify-sync-and-shed-short-task

Thanks,
Andrei