linux-kernel - Re: [PATCH 0/2] minimize swapping on zswap store failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=Npoa5pxeqOHqNTreqVEKpnFQ2rg=yN6eOgFQngBp6X+w@mail.gmail.com>
Date:   Tue, 17 Oct 2023 12:03:10 -0700
From:   Nhat Pham <nphamcs@...il.com>
To:     Yosry Ahmed <yosryahmed@...gle.com>
Cc:     akpm@...ux-foundation.org, hannes@...xchg.org,
        cerasuolodomenico@...il.com, sjenning@...hat.com,
        ddstreet@...e.org, vitaly.wool@...sulko.com, hughd@...gle.com,
        corbet@....net, konrad.wilk@...cle.com, senozhatsky@...omium.org,
        rppt@...nel.org, linux-mm@...ck.org, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, david@...t.cz
Subject: Re: [PATCH 0/2] minimize swapping on zswap store failure

On Mon, Oct 16, 2023 at 5:58 PM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Mon, Oct 16, 2023 at 5:35 PM Nhat Pham <nphamcs@...il.com> wrote:
> >
> > Currently, when a zswap store attempt fails, the page is immediately
> > swapped out. This could happen for a variety of reasons. For instance,
> > the compression algorithm could fail (such as when the data is not
> > compressible), or the backend allocator might not be able to find a
> > suitable slot for the compressed page. If these pages are needed
> > later on, users will incur IOs from swapins.
> >
> > This issue prevents the adoption of zswap for potential users who
> > cannot tolerate the latency associated with swapping. In many cases,
> > these IOs are avoidable if we just keep in memory the pages that zswap
> > fail to store.
> >
> > This patch series add two new features for zswap that will alleviate
> > the risk of swapping:
> >
> > a) When a store attempt fail, keep the page untouched in memory
> > instead of swapping it out.
>
> What about writeback when the zswap limit is hit? I understand the
> problem, but I am wondering if this is the correct way of fixing it.
> We really need to make zswap work without a backing swapfile, which I
> think is the correct way to fix all these problems. I was working on
> that, but unfortunately I had to pivot to something else before I had
> something that was working.
>
> At Google, we have "ghost" swapfiles that we use just to use zswap
> without a swapfile. They are sparse files, and we have internal kernel
> patches to flag them and never try to actually write to them.
>
> I am not sure how many bandaids we can afford before doing the right
> thing. I understand it's a much larger surgery, perhaps there is a way
> to get a short-term fix that is also a step towards the final state we
> want to reach instead?

Regarding the writeback - I'll make sure to also short-circuit writeback
when the bypass_swap option is enabled in v2 :) I'll probably send out
v2 after

I absolutely agree that we must decouple zswap and swap (and would
be happy to help out in any capacity I could - we have heard similar
concerns/complaints about swap wastage from internal parties as well).

However, as Johannes has pointed out, this feature still has its place,
given our already existing swapfile deployments. I do agree that a
global knob is insufficient tho. I'll add a per-cgroup knob in v2 so that
we can enable/disable this feature on a per-workload basis.

>
> >
> > b) If the store attempt fails at the compression step, allow the page
> > to be stored in its uncompressed form in the zswap pool. This maintains
> > the LRU ordering of pages, which will be helpful for accurate
> > memory reclaim (zswap writeback in particular).
>
> This is dangerous. Johannes and I discussed this before. This means
> that reclaim can end up allocating more memory instead of freeing.
> Allocations made in the reclaim path are made under the assumption
> that we will eventually free memory. In this case, we won't. In the
> worst case scenario, reclaim can leave the system/memcg in a worse
> state than before it started.
>
> Perhaps there is a way we can do this without allocating a zswap entry?
>
> I thought before about having a special list_head that allows us to
> use the lower bits of the pointers as markers, similar to the xarray.
> The markers can be used to place different objects on the same list.
> We can have a list that is a mixture of struct page and struct
> zswap_entry. I never pursued this idea, and I am sure someone will
> scream at me for suggesting it. Maybe there is a less convoluted way
> to keep the LRU ordering intact without allocating memory on the
> reclaim path.

Hmm yeah you're right about these concerns. That seems like a lot more
involved than what I envisioned initially.

Let's put this aside for now. I'll just send the first patch in v2, and we can
work on + discuss more about uncompressed pages storing later on.

>
> >
> > These features could be enabled independently via two new zswap module
> > parameters.
> >
> > Nhat Pham (2):
> >   swap: allows swap bypassing on zswap store failure
> >   zswap: store uncompressed pages when compression algorithm fails
> >
> >  Documentation/admin-guide/mm/zswap.rst | 16 +++++++
> >  include/linux/zswap.h                  |  9 ++++
> >  mm/page_io.c                           |  6 +++
> >  mm/shmem.c                             |  8 +++-
> >  mm/zswap.c                             | 64 +++++++++++++++++++++++---
> >  5 files changed, 95 insertions(+), 8 deletions(-)
> >
> > --
> > 2.34.1