linux-kernel - Re: Regression on vcpu_is

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <625ce99b-8ec3-f807-99ac-1dc32695deca@bytedance.com>
Date:   Fri, 28 Oct 2022 18:21:11 +0800
From:   Abel Wu <wuyun.abel@...edance.com>
To:     Miaohe Lin <linmiaohe@...wei.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, rohit.k.jain@...cle.com
Cc:     dietmar.eggemann@....com, Steven Rostedt <rostedt@...dmis.org>,
        bsegall@...gle.com, mgorman@...e.de, bristot@...hat.com,
        vschneid@...hat.com, linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: Regression on vcpu_is_preempted()

Hi Miaohe,

On 10/28/22 4:48 PM, Miaohe Lin wrote:
> Hi all scheduler experts:
>    When we run java gc in our 8 vcpus guest *without KVM_FEATURE_STEAL_TIME enabled*, the output looks like below:
>      With ParallelGCThreads=4 and ConcGCThreads=4, we have:
> 	G1 Young Generation: 1 times 1786 ms
> 	G1 Old Generation: 1 times 1022 ms
>      With ParallelGCThreads=5 and ConcGCThreads=5, we have:
> 	G1 Young Generation: 1 times 1557 ms
> 	G1 Old Generation: 1 times 1020 ms
> 
>    This meets our expectation. But *with KVM_FEATURE_STEAL_TIME enabled* in our guest, the output looks like this:
>      With ParallelGCThreads=4 and ConcGCThreads=4, we have:
> 	G1 Young Generation: 1 times 1637 ms
> 	G1 Old Generation: 1 times 1022 ms
>      With ParallelGCThreads=5 and ConcGCThreads=5, we have:
> 	G1 Young Generation: 1 times 2164 ms
> 				      ^^^^
> 	G1 Old Generation: 1 times 1024 ms
> 
>    The duration of G1 Young Generation is far beyond our expectation when gc threads = 5. And we found the root cause
> is that when KVM_FEATURE_STEAL_TIME is enabled *there are much more(3k+) cpu migrations for java gc threads*. It's due to
> the below commit:
> 
>    commit 247f2f6f3c706b40b5f3886646f3eb53671258bf
>    Author: Rohit Jain <rohit.k.jain@...cle.com>
>    Date:   Wed May 2 13:52:10 2018 -0700
> 
>      sched/core: Don't schedule threads on pre-empted vCPUs
> 
>      In paravirt configurations today, spinlocks figure out whether a vCPU is
>      running to determine whether or not spinlock should bother spinning. We
>      can use the same logic to prioritize CPUs when scheduling threads. If a
>      vCPU has been pre-empted, it will incur the extra cost of VMENTER and
>      the time it actually spends to be running on the host CPU. If we had
>      other vCPUs which were actually running on the host CPU and idle we
>      should schedule threads there.
> 
>    When scheduler tries to select a CPU to run the gc thread, available_idle_cpu() will check whether vcpu_is_preempted().
> It will choose other vcpu to run gc threads when the current vcpu is preempted. But the preempted vcpu has no other work
> to do except continuing to do gc. In our guest, there are more vcpus than java gc threads. So there could always be some
> available vcpus when scheduler tries to select a idle vcpu (runing on host). This leads to lots of cpu migrations and results
> in regression.

So you want the preempted idle cpus to run gc threads to maximize the
gc throughput, but available_idle_cpu() keeps them from being selected.
In theory, load balancing will help spreading load to these cpus (and
make them VMENTERed), so have you checked that the gc threads showed a
tendency to stack on same cpus?