[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0e64ed24-e676-6cfc-376f-f404e759f6f1@redhat.com>
Date: Thu, 21 Apr 2022 12:21:10 -0400
From: Nico Pache <npache@...hat.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org, aquini@...hat.com,
shakeelb@...gle.com, mhocko@...e.com, hakavlad@...ox.lv,
David Hildenbrand <david@...hat.com>, llong@...hat.com
Subject: Re: [PATCH v3] vm_swappiness=0 should still try to avoid swapping
anon memory
On 4/20/22 14:44, Johannes Weiner wrote:
>>
>> The larger issue is that our workload has regressed in performance.
>>
>> With V2 and swappiness=10 we are still seeing some swap, but very little tearing
>> down of THPs over time. With swappiness=0 it did some when swap but we are not
>> losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2).
I meant to say `with your patch swappiness=0 does not swap or have thp issues on v2`
>>
>> With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and
>> ultimately leads to a significant amount of THP splitting. So the longer the
>> system/workload runs, the less likely we are to get THPs backing the guest and
>> the performance gain from THPs is lost.
>
> I hate to ask, but is it possible this is a configuration issue
Im very glad you asked :)
>
> One significant difference between V1 and V2 is that V1 has per-cgroup
> swappiness, which is inherited when the cgroup is created. So if you
> set sysctl vm.swappiness=0 after cgroups have been created, it will
> not update them. V2 cgroups do use vm.swappiness:
This is something I did not consider... Thank you for pointing that out!
The issue still occurs weather or not I set the swappiness value before the VM
boot. However this led me to find the icing on the cake :)
Even if I set vm.swappiness=0 at boot using sysctl.conf I was not considering
the fact that libvirtd was creating its own cgroup for the machines you start it
with... additionally it does not inherit the sysctl value (even when set at
boot)?!? How annoying...
The cgroups swappiness value is defaulted to 60. This to me seems wrong from a
libvirt/systemd POV. If the system is booted with swappiness=0 then why does the
(user|machine|system).splice cgroup ignore this value when it creates it cgroups
(see below).
I will have to dig a little further to find a cause/fix for this. This requires
the libvirt users to understand a number of intricacies that they really
shouldnt have to consider, and may lead to headaches like these ;P
Values of the memcgs created on boot (with sysctl.swappiness=0 on V1 boot)
------------------------------------------------------------------------
/sys/fs/cgroup/memory/memory.swappiness =0
/sys/fs/cgroup/memory/dev-hugepages.mount/memory.swappiness =60
/sys/fs/cgroup/memory/dev-mqueue.mount/memory.swappiness =60
/sys/fs/cgroup/memory/machine.slice/memory.swappiness =60
/sys/fs/cgroup/memory/proc-sys-fs-binfmt_misc.mount/memory.swappiness =0
/sys/fs/cgroup/memory/sys-fs-fuse-connections.mount/memory.swappiness =0
/sys/fs/cgroup/memory/sys-kernel-config.mount/memory.swappiness =0
/sys/fs/cgroup/memory/sys-kernel-debug.mount/memory.swappiness =60
/sys/fs/cgroup/memory/sys-kernel-tracing.mount/memory.swappiness =60
/sys/fs/cgroup/memory/system.slice/memory.swappiness =60
/sys/fs/cgroup/memory/user.slice/memory.swappiness =60
Some seem to inherit the cgroup/memory/memory.swappiness value and some do
not... This issue was brought up in a systemd issue with no solution or
documentation [1].
Libvirt in particular is using the machine.splice cgroup so it inherits the 60.
If i change that value to 0, then start the machine it now has swappiness 0.
$ echo 0 > /sys/fs/cgroup/memory/machine.slice/memory.swappiness
$ virsh start <guest-name>
$ cat /sys/fs/cgroup/memory/machine.slice/machine-qemu.scope/memory.swappiness
0
Thank you so much for your very insightful note that led to the real issue :)
> Thanks for verifying. I'll prepare a proper patch.
my issue with v1 vs v2 seems to go away with a much more sane value of
swappiness=10 on v1 (when actually set properly lol).
Also as per my results below, I actually dont think your patch caused much
change to my workload. Im not sure what happened the first time I ran it that
caused the swapping on v2 (before your patch)... perhaps I ran the older kernel
(~v5.14) that was still having issues with v2 or its the fact that the results
can differ between runs. sorry about that.
Here is the test results for your patch with V1 and V2 (swappiness=0/10):
Before Patch
-------------
-- V1(swappiness=0):
total used free shared buff/cache available
Mem: 264071432 257465704 1100160 4224 5505568 5064964
Swap: 4194300 47828 4146472
Node 0 AnonPages: 128068580 kB Node 1 AnonPages: 128120400 kB
Node 0 AnonHugePages: 128012288 kB Node 1 AnonHugePages: 128090112 kB
^^^^^ no loss
-- V1(swappiness=10):
total used free shared buff/cache available
Mem: 264071432 257364436 972048 3972 5734948 5164520
Swap: 4194300 235028 3959272
Node 0 AnonPages: 128015404 kB Node 1 AnonPages: 128002788 kB
Node 0 AnonHugePages: 128002048 kB Node 1 AnonHugePages: 120576000 kB
^^^^^ some loss
-- V2(swappiness=0):
total used free shared buff/cache available
Mem: 264071432 257609352 924692 4664 5537388 4921236
Swap: 4194300 0 4194300
^^^^^ No Swap
Node 0 AnonPages: 128083104 kB Node 1 AnonPages: 128180576 kB
Node 0 AnonHugePages: 128002048 kB Node 1 AnonHugePages: 128124928 kB
^^^^^ No loss
-- V2(swappiness=10):
total used free shared buff/cache available
Mem: 264071432 257407576 918124 4632 5745732 5101764
Swap: 4194300 220424 3973876
^^^^^ Some Swap
Node 0 AnonPages: 128109700 kB Node 1 AnonPages: 127918164 kB
Node 0 AnonHugePages: 128006144 kB Node 1 AnonHugePages: 120569856 kB
^^^^^ some loss
After Patch
-------------
-- V1:swappiness=0
total used free shared buff/cache available
Mem: 264071432 257538832 945276 4368 5587324 4991852
Swap: 4194300 9276 4185024
Node 0 AnonPages: 128133932 kB Node 1 AnonPages: 128100540 kB
Node 0 AnonHugePages: 128047104 kB Node 1 AnonHugePages: 128061440 kB
-- V1:swappiness=10
total used free shared buff/cache available
Mem: 264071432 257428564 969252 4384 5673616 5100824
Swap: 4194300 138936 4055364
^^^^^ Some Swap
Node 0 AnonPages: 128161724 kB Node 1 AnonPages: 127945368 kB
Node 0 AnonHugePages: 128043008 kB Node 1 AnonHugePages: 120221696 kB
^^^^^ some loss
-- V2(swappiness=0):
total used free shared buff/cache available
Mem: 264071432 257536896 927424 4664 5607112 4993184
Swap: 4194300 0 4194300
Node 0 AnonPages: 128145476 kB Node 1 AnonPages: 128111908 kB
Node 0 AnonHugePages: 128026624 kB Node 1 AnonHugePages: 128090112 kB
-- V2(swappiness=10):
total used free shared buff/cache available
Mem: 264071432 257423936 1007076 4548 5640420 5106544
Swap: 4194300 156016 4038284
Node 0 AnonPages: 128133264 kB Node 1 AnonPages: 127955952 kB
Node 0 AnonHugePages: 128018432 kB Node 1 AnonHugePages: 122507264 kB
^^^^ slightly better
The only notable difference between before/after your patch is that with your
patch the THP tearing was slightly better, resulting in an extra 2GB as seen in
the last result. This may just be noise.
I'll have to see if I can find a fix for this in either the kernel, libvirt, or
systemd, and will follow up if I do. If not this should at least be documented
correctly. Given the fact cgroupV1 is in limited support mode upstream, and
systemd's hesitancy to make changes for V1, we may how to go down our own
avenues to ensure our customers dont run into this issue.
Big Thanks!
-- Nico
[1] - https://github.com/systemd/systemd/issues/9276
Powered by blists - more mailing lists