lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0e64ed24-e676-6cfc-376f-f404e759f6f1@redhat.com>
Date:   Thu, 21 Apr 2022 12:21:10 -0400
From:   Nico Pache <npache@...hat.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     linux-mm@...ck.org, akpm@...ux-foundation.org,
        linux-kernel@...r.kernel.org, aquini@...hat.com,
        shakeelb@...gle.com, mhocko@...e.com, hakavlad@...ox.lv,
        David Hildenbrand <david@...hat.com>, llong@...hat.com
Subject: Re: [PATCH v3] vm_swappiness=0 should still try to avoid swapping
 anon memory



On 4/20/22 14:44, Johannes Weiner wrote:
>>
>> The larger issue is that our workload has regressed in performance.
>>
>> With V2 and swappiness=10 we are still seeing some swap, but very little tearing
>> down of THPs over time. With swappiness=0 it did some when swap but we are not
>> losings GBs of THPS (with your patch swappiness=0 has swap or THP issues on V2).
I meant to say `with your patch swappiness=0 does not swap or have thp issues on v2`
>>
>> With V1 and swappiness=(0|10)(with and without your patch), it swaps a ton and
>> ultimately leads to a significant amount of THP splitting. So the longer the
>> system/workload runs, the less likely we are to get THPs backing the guest and
>> the performance gain from THPs is lost.
> 
> I hate to ask, but is it possible this is a configuration issue
Im very glad you asked :)
> 
> One significant difference between V1 and V2 is that V1 has per-cgroup
> swappiness, which is inherited when the cgroup is created. So if you
> set sysctl vm.swappiness=0 after cgroups have been created, it will
> not update them. V2 cgroups do use vm.swappiness:

This is something I did not consider... Thank you for pointing that out!

The issue still occurs weather or not I set the swappiness value before the VM
boot. However this led me to find the icing on the cake :)

Even if I set vm.swappiness=0 at boot using sysctl.conf I was not considering
the fact that libvirtd was creating its own cgroup for the machines you start it
with... additionally it does not inherit the sysctl value (even when set at
boot)?!? How annoying...

The cgroups swappiness value is defaulted to 60. This to me seems wrong from a
libvirt/systemd POV. If the system is booted with swappiness=0 then why does the
(user|machine|system).splice cgroup ignore this value when it creates it cgroups
(see below).

I will have to dig a little further to find a cause/fix for this. This requires
the libvirt users to understand a number of intricacies that they really
shouldnt have to consider, and may lead to headaches like these ;P

Values of the memcgs created on boot (with sysctl.swappiness=0 on V1 boot)
------------------------------------------------------------------------
/sys/fs/cgroup/memory/memory.swappiness 				=0
/sys/fs/cgroup/memory/dev-hugepages.mount/memory.swappiness		=60
/sys/fs/cgroup/memory/dev-mqueue.mount/memory.swappiness		=60
/sys/fs/cgroup/memory/machine.slice/memory.swappiness			=60
/sys/fs/cgroup/memory/proc-sys-fs-binfmt_misc.mount/memory.swappiness	=0
/sys/fs/cgroup/memory/sys-fs-fuse-connections.mount/memory.swappiness	=0
/sys/fs/cgroup/memory/sys-kernel-config.mount/memory.swappiness		=0
/sys/fs/cgroup/memory/sys-kernel-debug.mount/memory.swappiness		=60
/sys/fs/cgroup/memory/sys-kernel-tracing.mount/memory.swappiness	=60
/sys/fs/cgroup/memory/system.slice/memory.swappiness			=60
/sys/fs/cgroup/memory/user.slice/memory.swappiness			=60

Some seem to inherit the cgroup/memory/memory.swappiness value and some do
not... This issue was brought up in a systemd issue with no solution or
documentation [1].

Libvirt in particular is using the machine.splice cgroup so it inherits the 60.
If i change that value to 0, then start the machine it now has swappiness 0.
$ echo 0 > /sys/fs/cgroup/memory/machine.slice/memory.swappiness
$ virsh start <guest-name>
$ cat /sys/fs/cgroup/memory/machine.slice/machine-qemu.scope/memory.swappiness
0

Thank you so much for your very insightful note that led to the real issue :)

> Thanks for verifying. I'll prepare a proper patch.

my issue with v1 vs v2 seems to go away with a much more sane value of
swappiness=10 on v1 (when actually set properly lol).

Also as per my results below, I actually dont think your patch caused much
change to my workload. Im not sure what happened the first time I ran it that
caused the swapping on v2 (before your patch)... perhaps I ran the older kernel
(~v5.14) that was still having issues with v2 or its the fact that the results
can differ between runs. sorry about that.

Here is the test results for your patch with V1 and V2 (swappiness=0/10):

Before Patch
-------------
-- V1(swappiness=0):
               total        used        free      shared  buff/cache   available
Mem:       264071432   257465704     1100160        4224     5505568     5064964
Swap:        4194300       47828     4146472

Node 0 AnonPages:      128068580 kB	Node 1 AnonPages:      128120400 kB
Node 0 AnonHugePages:  128012288 kB	Node 1 AnonHugePages:  128090112 kB
                                                               ^^^^^ no loss

-- V1(swappiness=10):
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257364436      972048        3972     5734948     5164520
Swap:        4194300      235028     3959272

Node 0 AnonPages:      128015404 kB     Node 1 AnonPages:      128002788 kB
Node 0 AnonHugePages:  128002048 kB     Node 1 AnonHugePages:  120576000 kB
                                                               ^^^^^ some loss

-- V2(swappiness=0):
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257609352      924692        4664     5537388     4921236
Swap:        4194300           0     4194300
                           ^^^^^ No Swap
Node 0 AnonPages:      128083104 kB     Node 1 AnonPages:      128180576 kB
Node 0 AnonHugePages:  128002048 kB     Node 1 AnonHugePages:  128124928 kB
                                                               ^^^^^ No loss

-- V2(swappiness=10):
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257407576      918124        4632     5745732     5101764
Swap:        4194300      220424     3973876
                           ^^^^^ Some Swap
Node 0 AnonPages:      128109700 kB	Node 1 AnonPages:      127918164 kB
Node 0 AnonHugePages:  128006144 kB	Node 1 AnonHugePages:  120569856 kB
                                                               ^^^^^ some loss

After Patch
-------------
-- V1:swappiness=0
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257538832      945276        4368     5587324     4991852
Swap:        4194300        9276     4185024

Node 0 AnonPages:      128133932 kB	Node 1 AnonPages:      128100540 kB
Node 0 AnonHugePages:  128047104 kB	Node 1 AnonHugePages:  128061440 kB


-- V1:swappiness=10
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257428564      969252        4384     5673616     5100824
Swap:        4194300      138936     4055364
                           ^^^^^ Some Swap
Node 0 AnonPages:      128161724 kB     Node 1 AnonPages:      127945368 kB
Node 0 AnonHugePages:  128043008 kB     Node 1 AnonHugePages:  120221696 kB
                                                               ^^^^^ some loss

-- V2(swappiness=0):
               total        used        free      shared  buff/cache   available
Mem:       264071432   257536896      927424        4664     5607112     4993184
Swap:        4194300           0     4194300

Node 0 AnonPages:      128145476 kB	Node 1 AnonPages:      128111908 kB
Node 0 AnonHugePages:  128026624 kB	Node 1 AnonHugePages:  128090112 kB

-- V2(swappiness=10):
               total        used        free	  shared  buff/cache   available
Mem:	   264071432   257423936     1007076        4548     5640420     5106544
Swap:        4194300	  156016     4038284

Node 0 AnonPages:      128133264 kB     Node 1 AnonPages:      127955952 kB
Node 0 AnonHugePages:  128018432 kB     Node 1 AnonHugePages:  122507264 kB
                                                           ^^^^ slightly better

The only notable difference between before/after your patch is that with your
patch the THP tearing was slightly better, resulting in an extra 2GB as seen in
the last result. This may just be noise.

I'll have to see if I can find a fix for this in either the kernel, libvirt, or
systemd, and will follow up if I do. If not this should at least be documented
correctly. Given the fact cgroupV1 is in limited support mode upstream, and
systemd's hesitancy to make changes for V1, we may how to go down our own
avenues to ensure our customers dont run into this issue.

Big Thanks!
-- Nico

[1] - https://github.com/systemd/systemd/issues/9276

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ