lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=NX-4dbietxy-25F-OotuGGL0F9h+hwV76b9Ap5nSy9uw@mail.gmail.com>
Date: Thu, 30 May 2024 12:18:13 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Yosry Ahmed <yosryahmed@...gle.com>
Cc: Johannes Weiner <hannes@...xchg.org>, Usama Arif <usamaarif642@...il.com>, 
	akpm@...ux-foundation.org, chengming.zhou@...ux.dev, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, kernel-team@...a.com, 
	Hugh Dickins <hughd@...gle.com>, Huang Ying <ying.huang@...el.com>
Subject: Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap

On Thu, May 30, 2024 at 9:24 AM Yosry Ahmed <yosryahmed@...gle.com> wrote:
>
> On Thu, May 30, 2024 at 5:27 AM Johannes Weiner <hannes@...xchg.org> wrote:
> >
> > On Thu, May 30, 2024 at 11:19:07AM +0100, Usama Arif wrote:
> > > Approximately 10-20% of pages to be swapped out are zero pages [1].
> > > Rather than reading/writing these pages to flash resulting
> > > in increased I/O and flash wear, a bitmap can be used to mark these
> > > pages as zero at write time, and the pages can be filled at
> > > read time if the bit corresponding to the page is set.
> > > With this patch, NVMe writes in Meta server fleet decreased
> > > by almost 10% with conventional swap setup (zswap disabled).
> > >
> > > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> > >
> > > Signed-off-by: Usama Arif <usamaarif642@...il.com>
> >
> > This is awesome.
> >
> > > ---
> > >  include/linux/swap.h |  1 +
> > >  mm/page_io.c         | 86 ++++++++++++++++++++++++++++++++++++++++++--
> > >  mm/swapfile.c        | 10 ++++++
> > >  3 files changed, 95 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index a11c75e897ec..e88563978441 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -299,6 +299,7 @@ struct swap_info_struct {
> > >       signed char     type;           /* strange name for an index */
> > >       unsigned int    max;            /* extent of the swap_map */
> > >       unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> > > +     unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
> >
> > One bit per swap slot, so 1 / (4096 * 8) = 0.003% static memory
> > overhead for configured swap space. That seems reasonable for what
> > appears to be a fairly universal 10% reduction in swap IO.
> >
> > An alternative implementation would be to reserve a bit in
> > swap_map. This would be no overhead at idle, but would force
> > continuation counts earlier on heavily shared page tables, and AFAICS
> > would get complicated in terms of locking, whereas this one is pretty
> > simple (atomic ops protect the map, swapcache lock protects the bit).
> >
> > So I prefer this version. But a few comments below:
>
> I am wondering if it's even possible to take this one step further and
> avoid reclaiming zero-filled pages in the first place. Can we just
> unmap them and let the first read fault allocate a zero'd page like
> uninitialized memory, or point them at the zero page and make them
> read-only, or something? Then we could free them directly without
> going into the swap code to begin with.
>
> That's how I thought about it initially when I attempted to support
> only zero-filled pages in zswap. It could be a more complex
> implementation though.

We can aim for this eventually, but yeah the implementation will be
more complex. We'll need to be careful in handling shared zero pages,
synchronizing accesses and maintaining reference counts. I think we
will need to special-case swap cache and swap map for these zero pages
(a ghost zero swap device perhaps), or reinvent the wheel to manage
these pieces of information.

Not impossible, but annoying :) For now, I think Usama's approach is
clean enough and does the job.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ