linux-kernel - Re: [PATCH v2] sched: Clear ttwu_pending after enqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20221114152756.aqfxp5wlh47ncjwi@suse.de>
Date:   Mon, 14 Nov 2022 15:27:56 +0000
From:   Mel Gorman <mgorman@...e.de>
To:     Tianchen Ding <dtcccc@...ux.alibaba.com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched: Clear ttwu_pending after enqueue_task

On Fri, Nov 04, 2022 at 10:36:01AM +0800, Tianchen Ding wrote:
> We found a long tail latency in schbench whem m*t is close to nr_cpus.
> (e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)
> 
> This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
> too early, and idle_cpu() will return true until the wakee task enqueued.
> This will mislead the waker when selecting idle cpu, and wake multiple
> worker threads on the same wakee cpu. This situation is enlarged by
> commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
> wakelist if wakee cpu is idle") because it tends to use wakelist.
> 
> Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
> (Intel(R) Xeon(R) Platinum 8369B).
> 
> Latency percentiles (usec):
>                 base      base+revert_f3dd3f674555   base+this_patch
> 50.0000th:         9                            13                 9
> 75.0000th:        12                            19                12
> 90.0000th:        15                            22                15
> 95.0000th:        18                            24                17
> *99.0000th:       27                            31                24
> 99.5000th:      3364                            33                27
> 99.9000th:     12560                            36                30
> 
> We also tested on unixbench and hackbench, and saw no performance
> change.
> 
> Signed-off-by: Tianchen Ding <dtcccc@...ux.alibaba.com>

I tested this on bare metal across a range of machines. The impact of the
patch is nowhere near as obvious as it is on a VM but even then, schbench
generally benefits (not by as much and not always at all percentiles). The
only workload that appeared to suffer was specjbb2015 but *only* on NUMA
machines, on UMA it was fine and the benchmark can be a little flaky for
getting stable results anyway. In the few cases where it showed a problem,
the NUMA balancing behaviour was also different so I think it can be ignored.

In most cases it was better than vanilla and better than a revert or at
least made marginal differences that were borderline noise. However, avoiding
stacking tasks due to false positives is also important because even though
that can help performance in some cases (strictly sync wakeups), it's not
necessarily a good idea. So while it's not a universal win, it wins more
than it loses and it appears to be more clearly a win on VMs so on that basis

Acked-by: Mel Gorman <mgorman@...e.de>

-- 
Mel Gorman
SUSE Labs