lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6a70900a-649f-3a4d-2e47-61648bc95666@linux.alibaba.com>
Date:   Fri, 21 Jul 2023 10:58:50 +0800
From:   "Kenan.Liu" <Kenan.Liu@...ux.alibaba.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, luoben@...ux.alibaba.com,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/2] Adjust CFS loadbalance to adapt QEMU CPU
 topology.

Hi Peter, thanks for your attention,

please refer to my answer to your question inline:


在 2023/7/20 下午4:50, Peter Zijlstra 写道:
> On Thu, Jul 20, 2023 at 04:34:11PM +0800, Kenan.Liu wrote:
>> From: "Kenan.Liu" <Kenan.Liu@...ux.alibaba.com>
>>
>> Multithreading workloads in VM with Qemu may encounter an unexpected
>> phenomenon: one hyperthread of a physical core is busy while its sibling
>> is idle. Such as:
> Is this with vCPU pinning? Without that, guest topology makes no sense
> what so ever.


vCPU is pinned on host and the imbalance phenomenon we observed is inside
VM, not for the vCPU threads on host.


>> The main reason is that hyperthread index is consecutive in qemu native x86 CPU
>> model which is different from the physical topology.
> I'm sorry, what? That doesn't make sense. SMT enumeration is all over
> the place for Intel, but some actually do have (n,n+1) SMT. On AMD it's
> always (n,n+1) IIRC.
>
>> As the current kernel scheduler
>> implementation, hyperthread with an even ID number will be picked up in a much
>> higher probability during load-balancing and load-deploying.
> How so?


The SMT topology in qemu native x86 CPU model is (0,1),…,(n,n+1),…,
but nomarlly seen SMT topo in physical machine is like (0,n),(1,n+1),…,
n means the total core number of the machine.

The imbalance happens when the number of runnable threads is less
than the number of hyperthreads, select_idle_core() would be called
to decide which cpu be selected to run the waken-up task.

select_idle_core() will return the checked cpu number if the whole
core is idle. On the contrary, if any one HT of the core is busy,
select_idle_core() would clear the whole core out from cpumask and
check the next core.

select_idle_core():
     …
     if (idle)
         return core;

     cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
     return -1;

In this manner, except the very beginning of for_each_cpu_wrap() loop,
HT with even ID number is always be checked at first, and be returned
to the caller if the whole core is idle, so the odd indexed HT almost
has no chance to be selected.

select_idle_cpu():
     …
     for_each_cpu_wrap(cpu, cpus, target + 1) {
         if (has_idle_core) {
             i = select_idle_core(p, cpu, cpus, &idle_cpu);

And this will NOT happen when the SMT topo is (0,n),(1,n+1),…, because
when the loop starts from the bottom half of SMT number, HTs with larger
number will be checked first, when it starts from the top half, their
siblings with smaller number take the first place of inner core searching.


>
>> This RFC targets to solve the problem by adjusting CFS loabalance policy:
>> 1. Explore CPU topology and adjust CFS loadbalance policy when we found machine
>> with qemu native CPU topology.
>> 2. Export a procfs to control the traverse length when select idle cpu.
>>
>> Kenan.Liu (2):
>>    sched/fair: Adjust CFS loadbalance for machine with qemu native CPU
>>      topology.
>>    sched/fair: Export a param to control the traverse len when select
>>      idle cpu.
> NAK, qemu can either provide a fake topology to the guest using normal
> x86 means (MADT/CPUID) or do some paravirt topology setup, but this is
> quite insane.
Thanks,

Kenan.Liu

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ