netdev - Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZVoNw4ta_1oWIrKX@archie.me>
Date: Sun, 19 Nov 2023 20:29:39 +0700
From: Bagas Sanjaya <bagasdotme@...il.com>
To: Tobias Huschle <huschle@...ux.ibm.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Linux KVM <kvm@...r.kernel.org>,
	Linux Virtualization <virtualization@...ts.linux.dev>,
	Linux Networking <netdev@...r.kernel.org>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...nel.org>,
	Abel Wu <wuyun.abel@...edance.com>,
	Honglei Wang <wanghonglei@...ichuxing.com>, mst@...hat.com,
	jasowang@...hat.com
Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add
 lag based placement)

On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:
> Hi,
> 
> when testing the EEVDF scheduler we stumbled upon a performance regression
> in a uperf scenario and would like to
> kindly ask for feedback on whether we are going into the right direction
> with our analysis so far.
> 
> The base scenario are two KVM guests running on an s390 LPAR. One guest
> hosts the uperf server, one the uperf client.
> With EEVDF we observe a regression of ~50% for a strburst test.
> For a more detailed description of the setup see the section TEST SUMMARY at
> the bottom.
> 
> Bisecting led us to the following commit which appears to introduce the
> regression:
> 86bfbb7ce4f6 sched/fair: Add lag based placement
> 
> We then compared the last good commit we identified with a recent level of
> the devel branch.
> The issue still persists on 6.7 rc1 although there is some improvement (down
> from 62% regression to 49%)
> 
> All analysis described further are based on a 6.6 rc7 kernel.
> 
> We sampled perf data to get an idea on what is going wrong and ended up
> seeing an dramatic increase in the maximum
> wait times from 3ms up to 366ms. See section WAIT DELAYS below for more
> details.
> 
> We then collected tracing data to get a better insight into what is going
> on.
> The trace excerpt in section TRACE EXCERPT shows one example (of multiple
> per test run) of the problematic scenario where
> a kworker(pid=6525) has to wait for 39,718 ms.
> 
> Short summary:
> The mentioned kworker has been scheduled to CPU 14 before the tracing was
> enabled.
> A vhost process is migrated onto CPU 14.
> The vruntimes of kworker and vhost differ significantly (86642125805 vs
> 4242563284 -> factor 20)
> The vhost process wants to wake up the kworker, therefore the kworker is
> placed onto the runqueue again and set to runnable.
> The vhost process continues to execute, waking up other vhost processes on
> other CPUs.
> 
> So far this behavior is not different to what we see on pre-EEVDF kernels.
> 
> On timestamp 576.162767, the vhost process triggers the last wake up of
> another vhost on another CPU.
> Until timestamp 576.171155, we see no other activity. Now, the vhost process
> ends its time slice.
> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
> off to CPU 15.
> This does not occur with older kernels.
> The kworker has to wait for the migration to happen in order to be able to
> execute again.
> This is due to the fact, that the vruntime of the kworker is significantly
> larger than the one of vhost.
> 
> 
> We observed the large difference in vruntime between kworker and vhost in
> the same magnitude on
> a kernel built based on the parent of the commit mentioned above.
> With EEVDF, the kworker is doomed to wait until the vhost either catches up
> on vruntime (which would take 86 seconds)
> or the vhost is migrated off of the CPU.
> 
> We found some options which sound plausible but we are not sure if they are
> valid or not:
> 
> 1. The wake up path has a dependency on the vruntime metrics that now delays
> the execution of the kworker.
> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
> which updates the way cfs_rq->min_vruntime and
>     cfs_rq->avg_runtime are set might have introduced an issue which is
> uncovered with the commit mentioned above.
> 3. An assumption in the vhost code which causes vhost to rely on being
> scheduled off in time to allow the kworker to proceed.
> 
> We also stumbled upon the following mailing thread:
> https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> That conversation, and the patches derived from it lead to the assumption
> that the wake up path might be adjustable in a way
> that this case in particular can be addressed.
> At the same time, the vast difference in vruntimes is concerning since, at
> least for some time frame, both processes are on the runqueue.
> 
> We would be glad to hear some feedback on which paths to pursue and which
> might just be a dead end in the first place.
> 
> 
> #################### TRACE EXCERPT ####################
> The sched_place trace event was added to the end of the place_entity
> function and outputs:
> sev -> sched_entity vruntime
> sed -> sched_entity deadline
> sel -> sched_entity vlag
> avg -> cfs_rq avg_vruntime
> min -> cfs_rq min_vruntime
> cpu -> cpu of cfs_rq
> nr  -> cfs_rq nr_running
> ---
>     CPU 3/KVM-2950    [014] d....   576.161432: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
> --> migrates task from cpu 15 to 14
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm=vhost-2920
> pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284
> cpu=14 nr=0
> --> places vhost 2920 on CPU 14 with vruntime 4242563284
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0
> sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14
> nr=0
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0
> sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 cpu=14
> nr=0
>     CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0
> sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 cpu=14
> nr=0
>     CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0
> sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 cpu=14
> nr=0
>     CPU 3/KVM-2950    [014] dN...   576.161434: sched_stat_wait:
> comm=vhost-2920 pid=2941 delay=9958 [ns]
>     CPU 3/KVM-2950    [014] d....   576.161435: sched_switch: prev_comm=CPU
> 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> next_comm=vhost-2920
> next_pid=2941 next_prio=120
>    vhost-2920-2941    [014] D....   576.161439: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>    vhost-2920-2941    [014] d....   576.161446: sched_waking:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] d....   576.161447: sched_place:
> comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0
> avg=86642125805 min=86642125805 cpu=14 nr=1
> --> places kworker 6525 on cpu 14 with vruntime 86642125805
> -->  which is far larger than vhost vruntime of  4242563284
>    vhost-2920-2941    [014] d....   576.161447: sched_stat_blocked:
> comm=kworker/14:0 pid=6525 delay=10143757 [ns]
>    vhost-2920-2941    [014] dN...   576.161447: sched_wakeup:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] dN...   576.161448: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=13884 [ns] vruntime=4242577168 [ns]
> --> vhost 2920 finishes after 13884 ns of runtime
>    vhost-2920-2941    [014] dN...   576.161448: sched_stat_wait:
> comm=kworker/14:0 pid=6525 delay=0 [ns]
>    vhost-2920-2941    [014] d....   576.161448: sched_switch:
> prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
> next_comm=kworker/14:0 next_pid=6525 next_prio=120
> --> switch to kworker
>  kworker/14:0-6525    [014] d....   576.161449: sched_waking: comm=CPU 2/KVM
> pid=2949 prio=120 target_cpu=007
>  kworker/14:0-6525    [014] d....   576.161450: sched_stat_runtime:
> comm=kworker/14:0 pid=6525 runtime=3714 [ns] vruntime=86642129519 [ns]
> --> kworker finshes after 3714 ns of runtime
>  kworker/14:0-6525    [014] d....   576.161450: sched_stat_wait:
> comm=vhost-2920 pid=2941 delay=3714 [ns]
>  kworker/14:0-6525    [014] d....   576.161451: sched_switch:
> prev_comm=kworker/14:0 prev_pid=6525 prev_prio=120 prev_state=I ==>
> next_comm=vhost-2920 next_pid=2941 next_prio=120
> --> switch back to vhost
>    vhost-2920-2941    [014] d....   576.161478: sched_waking:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] d....   576.161478: sched_place:
> comm=kworker/14:0 pid=6525 sev=86642191859 sed=86645191859 sel=-1150
> avg=86642188144 min=86642188144 cpu=14 nr=1
> --> kworker placed again on cpu 14 with vruntime 86642191859, the problem
> occurs only if lag <= 0, having lag=0 does not always hit the problem though
>    vhost-2920-2941    [014] d....   576.161478: sched_stat_blocked:
> comm=kworker/14:0 pid=6525 delay=27943 [ns]
>    vhost-2920-2941    [014] d....   576.161479: sched_wakeup:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] D....   576.161511: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.161512: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>    vhost-2920-2941    [014] D....   576.161516: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.161773: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.161775: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>    vhost-2920-2941    [014] D....   576.162103: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.162105: sched_waking:
> comm=vhost-2286 pid=2307 prio=120 target_cpu=021
>    vhost-2920-2941    [014] D....   576.162326: sched_waking:
> comm=vhost-2286 pid=2305 prio=120 target_cpu=004
>    vhost-2920-2941    [014] D....   576.162437: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
>    vhost-2920-2941    [014] D....   576.162767: sched_waking:
> comm=vhost-2286 pid=2305 prio=120 target_cpu=004
>    vhost-2920-2941    [014] d.h..   576.171155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=9704465 [ns] vruntime=4252281633 [ns]
>    vhost-2920-2941    [014] d.h..   576.181155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=10000377 [ns] vruntime=4262282010 [ns]
>    vhost-2920-2941    [014] d.h..   576.191154: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=9999514 [ns] vruntime=4272281524 [ns]
>    vhost-2920-2941    [014] d.h..   576.201155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=10000246 [ns] vruntime=4282281770 [ns]
> --> vhost gets rescheduled multiple times because its vruntime is
> significantly smaller than the vruntime of the kworker
>    vhost-2920-2941    [014] dNh..   576.201176: sched_wakeup:
> comm=migration/14 pid=85 prio=0 target_cpu=014
>    vhost-2920-2941    [014] dN...   576.201191: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=25190 [ns] vruntime=4282306960 [ns]
>    vhost-2920-2941    [014] d....   576.201192: sched_switch:
> prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
> next_comm=migration/14 next_pid=85 next_prio=0
>  migration/14-85      [014] d..1.   576.201194: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=14 dest_cpu=15
> --> vhost gets migrated off of cpu 14
>  migration/14-85      [014] d..1.   576.201194: sched_place: comm=vhost-2920
> pid=2941 sev=3198666923 sed=3201666923 sel=0 avg=3198666923 min=3198666923
> cpu=15 nr=0
>  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0
> sev=12775683594 sed=12779398224 sel=0 avg=12775683594 min=12775683594 cpu=15
> nr=0
>  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0
> sev=33655559178 sed=33661025369 sel=0 avg=33655559178 min=33655559178 cpu=15
> nr=0
>  migration/14-85      [014] d..1.   576.201195: sched_place: comm= pid=0
> sev=42240572785 sed=42244083642 sel=0 avg=42240572785 min=42240572785 cpu=15
> nr=0
>  migration/14-85      [014] d..1.   576.201196: sched_place: comm= pid=0
> sev=70190876523 sed=70194789898 sel=-13068763 avg=70190876523
> min=70190876523 cpu=15 nr=0
>  migration/14-85      [014] d....   576.201198: sched_stat_wait:
> comm=kworker/14:0 pid=6525 delay=39718472 [ns]
>  migration/14-85      [014] d....   576.201198: sched_switch:
> prev_comm=migration/14 prev_pid=85 prev_prio=0 prev_state=S ==>
> next_comm=kworker/14:0 next_pid=6525 next_prio=120
>  --> only now, kworker is eligible to run again, after a delay of 39718472
> ns
>  kworker/14:0-6525    [014] d....   576.201200: sched_waking: comm=CPU 0/KVM
> pid=2947 prio=120 target_cpu=012
>  kworker/14:0-6525    [014] d....   576.201290: sched_stat_runtime:
> comm=kworker/14:0 pid=6525 runtime=92941 [ns] vruntime=86642284800 [ns]
> 
> #################### WAIT DELAYS - PERF LATENCY ####################
> last good commit --> perf sched latency -s max
> -------------------------------------------------------------------------------------------------------------------------------------------
>   Task                  |   Runtime ms  | Switches | Avg delay ms    | Max
> delay ms    | Max delay start           | Max delay end          |
> -------------------------------------------------------------------------------------------------------------------------------------------
>   CPU 2/KVM:(2)         |   5399.650 ms |   108698 | avg:   0.003 ms | max:
> 3.077 ms | max start:   544.090322 s | max end:   544.093399 s
>   CPU 7/KVM:(2)         |   5111.132 ms |    69632 | avg:   0.003 ms | max:
> 2.980 ms | max start:   544.690994 s | max end:   544.693974 s
>   kworker/22:3-ev:723   |    342.944 ms |    63417 | avg:   0.005 ms | max:
> 1.880 ms | max start:   545.235430 s | max end:   545.237310 s
>   CPU 0/KVM:(2)         |   8171.431 ms |   433099 | avg:   0.003 ms | max:
> 1.004 ms | max start:   547.970344 s | max end:   547.971348 s
>   CPU 1/KVM:(2)         |   5486.260 ms |   258702 | avg:   0.003 ms | max:
> 1.002 ms | max start:   548.782514 s | max end:   548.783516 s
>   CPU 5/KVM:(2)         |   4766.143 ms |    65727 | avg:   0.003 ms | max:
> 0.997 ms | max start:   545.313610 s | max end:   545.314607 s
>   vhost-2268:(6)        |  13206.503 ms |   315030 | avg:   0.003 ms | max:
> 0.989 ms | max start:   550.887761 s | max end:   550.888749 s
>   vhost-2892:(6)        |  14467.268 ms |   214005 | avg:   0.003 ms | max:
> 0.981 ms | max start:   545.213819 s | max end:   545.214800 s
>   CPU 3/KVM:(2)         |   5538.908 ms |    85105 | avg:   0.003 ms | max:
> 0.883 ms | max start:   547.138139 s | max end:   547.139023 s
>   CPU 6/KVM:(2)         |   5289.827 ms |    72301 | avg:   0.003 ms | max:
> 0.836 ms | max start:   551.094590 s | max end:   551.095425 s
> 
> 6.6 rc7 --> perf sched latency -s max
> -------------------------------------------------------------------------------------------------------------------------------------------
>   Task                  |   Runtime ms  | Switches | Avg delay ms    | Max
> delay ms    | Max delay start           | Max delay end          |
> -------------------------------------------------------------------------------------------------------------------------------------------
>   kworker/19:2-ev:1071  |     69.482 ms |    12700 | avg:   0.050 ms | max:
> 366.314 ms | max start: 54705.674294 s | max end: 54706.040607 s
>   kworker/13:1-ev:184   |     78.048 ms |    14645 | avg:   0.067 ms | max:
> 287.738 ms | max start: 54710.312863 s | max end: 54710.600602 s
>   kworker/12:1-ev:46148 |    138.488 ms |    26660 | avg:   0.021 ms | max:
> 147.414 ms | max start: 54706.133161 s | max end: 54706.280576 s
>   kworker/16:2-ev:33076 |    149.175 ms |    29491 | avg:   0.026 ms | max:
> 139.752 ms | max start: 54708.410845 s | max end: 54708.550597 s
>   CPU 3/KVM:(2)         |   1934.714 ms |    41896 | avg:   0.007 ms | max:
> 92.126 ms | max start: 54713.158498 s | max end: 54713.250624 s
>   kworker/7:2-eve:17001 |     68.164 ms |    11820 | avg:   0.045 ms | max:
> 69.717 ms | max start: 54707.100903 s | max end: 54707.170619 s
>   kworker/17:1-ev:46510 |     68.804 ms |    13328 | avg:   0.037 ms | max:
> 67.894 ms | max start: 54711.022711 s | max end: 54711.090605 s
>   kworker/21:1-ev:45782 |     68.906 ms |    13215 | avg:   0.021 ms | max:
> 59.473 ms | max start: 54709.351135 s | max end: 54709.410608 s
>   ksoftirqd/17:101      |      0.041 ms |        2 | avg:  25.028 ms | max:
> 50.047 ms | max start: 54711.040578 s | max end: 54711.090625 s
> 
> #################### TEST SUMMARY ####################
>  Setup description:
> - single KVM host with 2 identical guests
> - guests are connected virtually via Open vSwitch
> - guests run uperf streaming read workload with 50 parallel connections
> - one guests acts as uperf client, the other one as uperf server
> 
> Regression:
> kernel-6.5.0-rc2: 78 Gb/s (before 86bfbb7ce4f6 sched/fair: Add lag based
> placement)
> kernel-6.5.0-rc2: 29 Gb/s (with 86bfbb7ce4f6 sched/fair: Add lag based
> placement)
> kernel-6.7.0-rc1: 41 Gb/s
> 
> KVM host:
> - 12 dedicated IFLs, SMT-2 (24 Linux CPUs)
> - 64 GiB memory
> - FEDORA 38
> - kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
> audit_debug=0 selinux=0
> 
> KVM guests:
> - 8 vCPUs
> - 8 GiB memory
> - RHEL 9.2
> - kernel: 5.14.0-162.6.1.el9_1.s390x
> - kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
> audit_debug=0 selinux=0
> 
> Open vSwitch:
> - Open vSwitch with 2 ports, each with mtu=32768 and qlen=15000
> - Open vSwitch ports attached to guests via virtio-net
> - each guest has 4 vhost-queues
> 
> Domain xml snippet for Open vSwitch port:
> <interface type="bridge" dev="OVS">
>   <source bridge="vswitch0"/>
>   <mac address="02:bb:97:28:02:02"/>
>   <virtualport type="openvswitch"/>
>   <model type="virtio"/>
>   <target dev="vport1"/>
>   <driver name="vhost" queues="4"/>
>   <address type="ccw" cssid="0xfe" ssid="0x0" devno="0x0002"/>
> </interface>
> 
> Benchmark: uperf
> - workload: str-readx30k, 50 active parallel connections
> - uperf server permanently sends data in 30720-byte chunks
> - uperf client receives and acknowledges this data
> - Server: uperf -s
> - Client: uperf -a -i 30 -m uperf.xml
> 
> uperf.xml:
> <?xml version="1.0"?>
> <profile name="strburst">
>   <group nprocs="50">
>     <transaction iterations="1">
>       <flowop type="connect" options="remotehost=10.161.28.3 protocol=tcp
> "/>
>     </transaction>
>     <transaction duration="300">
>       <flowop type="read" options="count=640 size=30k"/>
>     </transaction>
>     <transaction iterations="1">
>       <flowop type="disconnect" />
>     </transaction>
>   </group>
> </profile>

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 86bfbb7ce4f67a

-- 
An old man doll... just what I always wanted! - Clara

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)