linux-kernel - Re: [PATCH v8 00/11] Enable haltpoll on arm64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <877c9zhk68.fsf@oracle.com>
Date: Tue, 22 Oct 2024 15:01:03 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Ankur Arora <ankur.a.arora@...cle.com>
Cc: Marc Zyngier <maz@...nel.org>, linux-pm@...r.kernel.org,
        kvm@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        linux-kernel@...r.kernel.org, catalin.marinas@....com, will@...nel.org,
        tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
        dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
        pbonzini@...hat.com, wanpengli@...cent.com, vkuznets@...hat.com,
        rafael@...nel.org, daniel.lezcano@...aro.org, peterz@...radead.org,
        arnd@...db.de, lenb@...nel.org, mark.rutland@....com,
        harisokn@...zon.com, mtosatti@...hat.com, sudeep.holla@....com,
        cl@...two.org, misono.tomohiro@...itsu.com, maobibo@...ngson.cn,
        joao.m.martins@...cle.com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com
Subject: Re: [PATCH v8 00/11] Enable haltpoll on arm64


Ankur Arora <ankur.a.arora@...cle.com> writes:

> Marc Zyngier <maz@...nel.org> writes:
>
>> On Wed, 16 Oct 2024 22:55:09 +0100,
>> Ankur Arora <ankur.a.arora@...cle.com> wrote:
>>>
>>>
>>> Marc Zyngier <maz@...nel.org> writes:
>>>
>>> > On Thu, 26 Sep 2024 00:24:14 +0100,
>>> > Ankur Arora <ankur.a.arora@...cle.com> wrote:
>>> >>
>>> >> This patchset enables the cpuidle-haltpoll driver and its namesake
>>> >> governor on arm64. This is specifically interesting for KVM guests by
>>> >> reducing IPC latencies.
>>> >>
>>> >> Comparing idle switching latencies on an arm64 KVM guest with
>>> >> perf bench sched pipe:
>>> >>
>>> >>                                      usecs/op       %stdev
>>> >>
>>> >>   no haltpoll (baseline)               13.48       +-  5.19%
>>> >>   with haltpoll                         6.84       +- 22.07%
>>> >>
>>> >>
>>> >> No change in performance for a similar test on x86:
>>> >>
>>> >>                                      usecs/op        %stdev
>>> >>
>>> >>   haltpoll w/ cpu_relax() (baseline)     4.75      +-  1.76%
>>> >>   haltpoll w/ smp_cond_load_relaxed()    4.78      +-  2.31%
>>> >>
>>> >> Both sets of tests were on otherwise idle systems with guest VCPUs
>>> >> pinned to specific PCPUs. One reason for the higher stdev on arm64
>>> >> is that trapping of the WFE instruction by the host KVM is contingent
>>> >> on the number of tasks on the runqueue.
>>> >
>>> > Sorry to state the obvious, but if that's the variable trapping of
>>> > WFI/WFE is the cause of your trouble, why don't you simply turn it off
>>> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to
>>> > physical CPUs, there is no need for any trapping.
>>>
>>> Good point. Thanks. That should help reduce the guessing games around
>>> the variance in these tests.
>>
>> I'd be interested to find out whether there is still some benefit in
>> this series once you disable the WFx trapping heuristics.
>
> The benefit of polling in idle is more than just avoiding the cost of
> trapping and re-entering. The other benefit is that remote wakeups
> can now be done just by setting need-resched, instead of sending an
> IPI, and incurring the cost of handling the interrupt on the receiver
> side.
>
> But let me get you some numbers with that.

So, I ran the sched-pipe test with processes on VCPUs 4 and 5 with
kvm-arm.wfi_trap_policy=notrap.

# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
  perf bench sched pipe -l 1000000 -c 4

# No haltpoll (and, no TIF_POLLING_NRFLAG):

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         25,229.57 msec task-clock                       #    2.000 CPUs utilized               ( +-  7.75% )
    45,821,250,284      cycles                           #    1.816 GHz                         ( +- 10.07% )
    26,557,496,665      instructions                     #    0.58  insn per cycle              ( +-  0.21% )
                 0      sched:sched_wake_idle_without_ipi #    0.000 /sec

            12.615 +- 0.977 seconds time elapsed  ( +-  7.75% )


# Haltpoll:

 Performance counter stats for 'CPU(s) 4,5' (5 runs):

         15,131.58 msec task-clock                       #    2.000 CPUs utilized               ( +- 10.00% )
    34,158,188,839      cycles                           #    2.257 GHz                         ( +-  6.91% )
    20,824,950,916      instructions                     #    0.61  insn per cycle              ( +-  0.09% )
         1,983,822      sched:sched_wake_idle_without_ipi #  131.105 K/sec                       ( +-  0.78% )

             7.566 +- 0.756 seconds time elapsed  ( +- 10.00% )

We get a decent boost just because we are executing ~20% fewer
instructions. Not sure how the cpu frequency scaling works in a
VM but we also run at a higher frequency.

--
ankur