linux-kernel - Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMgjq7D43TQtvdrUyECq0GfaobbEpA2ANgyAqBnh-yJ3n74YAQ@mail.gmail.com>
Date: Sun, 15 Dec 2024 03:48:05 +0800
From: Kairui Song <ryncsn@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Hugh Dickins <hughd@...gle.com>, "Huang, Ying" <ying.huang@...ux.alibaba.com>, 
	Yosry Ahmed <yosryahmed@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Johannes Weiner <hannes@...xchg.org>, 
	Barry Song <baohua@...nel.org>, Michal Hocko <mhocko@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock

On Sun, Dec 15, 2024 at 12:07 AM Chris Li <chrisl@...nel.org> wrote:
>
> On Tue, Dec 10, 2024 at 1:29 AM Kairui Song <ryncsn@...il.com> wrote:
> >
> > From: Kairui Song <kasong@...cent.com>
> >
> > commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
> > replaced the cmpxchg/xchg with a global irq spinlock because some archs
> > doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.
> >
> > And as commented in swap_cgroup.c, this lock is not needed for map
> > synchronization.
> >
> > Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
> > it to get rid of this lock. Introduced two helpers for doing so and they
> > can be easily dropped if a generic 2 byte xchg is support.
> >
> > Testing using 64G brd and build with build kernel with make -j96 in 1.5G
> > memory cgroup using 4k folios showed below improvement (10 test run):
> >
> > Before this series:
> > Sys time: 10809.46 (stdev 80.831491)
> > Real time: 171.41 (stdev 1.239894)
> >
> > After this commit:
> > Sys time: 9621.26 (stdev 34.620000), -10.42%
> > Real time: 160.00 (stdev 0.497814), -6.57%
> >
> > With 64k folios and 2G memcg:
> > Before this series:
> > Sys time: 8231.99 (stdev 30.030994)
> > Real time: 143.57 (stdev 0.577394)
> >
> > After this commit:
> > Sys time: 7403.47 (stdev 6.270000), -10.06%
> > Real time: 135.18 (stdev 0.605000), -5.84%
> >
> > Sequential swapout of 8G 64k zero folios with madvise (24 test run):
> > Before this series:
> > 5461409.12 us (stdev 183957.827084)
> >
> > After this commit:
> > 5420447.26 us (stdev 196419.240317)
> >
> > Sequential swapin of 8G 4k zero folios (24 test run):
> > Before this series:
> > 19736958.916667 us (stdev 189027.246676)
> >
> > After this commit:
> > 19662182.629630 us (stdev 172717.640614)
> >
> > Performance is better or at least not worse for all tests above.
> >
> > Signed-off-by: Kairui Song <kasong@...cent.com>
> > ---
> >  mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++-------------------
> >  1 file changed, 45 insertions(+), 28 deletions(-)
> >
> > diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> > index 1770b076f6b7..a0a8547dc85d 100644
> > --- a/mm/swap_cgroup.c
> > +++ b/mm/swap_cgroup.c
> > @@ -7,19 +7,20 @@
> >
> >  static DEFINE_MUTEX(swap_cgroup_mutex);
> >
> > +/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */
>
> Might not be two short if the atomic_t is more than 4 bytes. The
> assumption here is that short is 2 bytes and atomic_t is 4 bytes. It
> is hard to conclude that is the case for all architecture.
>
> > +#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short))
>
> You should use "sizeof(struct swap_cgroup) / sizeof(unsigned short)",
> or get rid of struct swap_cgroup and directly use atomic_t.
>
> > +#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
> > +#define ID_MASK (BIT(ID_SHIFT) - 1)
> >  struct swap_cgroup {
> > -       unsigned short          id;
> > +       atomic_t ids;
>
> You use struct swap_cgroup and atomic_t which assumes no padding added
> to the struct. You might want to build an assert on sizeof(atomic_t)
> == sizeof(struct swap_cgroup).

Good idea, asserting struct swap_croup is an atomic_t ensures no
unexpected behaviour.

>
> >  };
> >
> >  struct swap_cgroup_ctrl {
> >         struct swap_cgroup *map;
> > -       spinlock_t      lock;
> >  };
> >
> >  static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> >
> > -#define SC_PER_PAGE    (PAGE_SIZE/sizeof(struct swap_cgroup))
> > -
> >  /*
> >   * SwapCgroup implements "lookup" and "exchange" operations.
> >   * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> > @@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
> >   *    SwapCache(and its swp_entry) is under lock.
> >   *  - When called via swap_free(), there is no user of this entry and no race.
> >   * Then, we don't need lock around "exchange".
> > - *
> > - * TODO: we can push these buffers out to HIGHMEM.
> >   */
> > -static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
> > -                                       struct swap_cgroup_ctrl **ctrlp)
> > +static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
> > +                                             pgoff_t offset)
> >  {
> > -       pgoff_t offset = swp_offset(ent);
> > -       struct swap_cgroup_ctrl *ctrl;
> > +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
>
> Might not want to assume the ID_PER_SC is two. If some architecture
> atomic_t is 64 bits then that code will break.

Good idea, atomic_t is by defining an int, not sure if there is any
strange archs will change the size though, more robust code is always
better.

Can change this to (offset % ID_PER_SC) instead, the generated machine
code should be still the same for most archs.

>
> > +       unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);
>
> Here assume sizeof(unsigned int) == sizeof(atomic_t). Again,some
> strange architecture might break it. Better use unsigned version of
> aotmic_t;
>
>
> >
> > -       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> > -       if (ctrlp)
> > -               *ctrlp = ctrl;
> > -       return &ctrl->map[offset];
> > +       return (old_ids & (ID_MASK << shift)) >> shift;
>
> Can be simplified as (old_ids >> shift) & ID_MASK. You might want to
> double check that.
>
> > +}
> > +
> > +static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
> > +                                           pgoff_t offset,
> > +                                           unsigned short new_id)
> > +{
> > +       unsigned short old_id;
> > +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;
>
> Same here, it assumes ID_PER_SC is 2.
>
> > +       struct swap_cgroup *sc = &map[offset / ID_PER_SC];
> > +       unsigned int new_ids, old_ids = atomic_read(&sc->ids);
>
> Again it assumes  sizeof(unsigned int) == sizeof(atomic_t).

I think this should be OK? The document says "atomic_t, atomic_long_t
and atomic64_t use int, long and s64 respectively".

Could change this with some wrapper but I think it's unnecessary.

>
> > +
> > +       do {
> > +               old_id = (old_ids & (ID_MASK << shift)) >> shift;
> Can be simplify:
> old_id = (old_ids >> shift) & ID_MASK;
>
> > +               new_ids = (old_ids & ~(ID_MASK << shift));
> > +               new_ids |= ((unsigned int)new_id) << shift;
>
> new_ids |= (atomic_t) new_id << shift;

atomic_t doesn't work with bit operations, it must be an arithmetic
type, so here I think it has to stay like this.

>
> > +       } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
> > +
> > +       return old_id;
> >  }
> >
> >  /**
> > @@ -58,21 +72,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
> >                                   unsigned int nr_ents)
> >  {
> >         struct swap_cgroup_ctrl *ctrl;
> > -       struct swap_cgroup *sc;
> > -       unsigned short old;
> > -       unsigned long flags;
> >         pgoff_t offset = swp_offset(ent);
> >         pgoff_t end = offset + nr_ents;
> > +       unsigned short old, iter;
> > +       struct swap_cgroup *map;
> >
> > -       sc = lookup_swap_cgroup(ent, &ctrl);
> > +       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> > +       map = ctrl->map;
> >
> > -       spin_lock_irqsave(&ctrl->lock, flags);
> > -       old = sc->id;
> > -       for (; offset < end; offset++, sc++) {
> > -               VM_BUG_ON(sc->id != old);
> > -               sc->id = id;
> > -       }
> > -       spin_unlock_irqrestore(&ctrl->lock, flags);
>
> The above will always assign nr_ents of swap entry atomically.
>
> > +       old = __swap_cgroup_id_lookup(map, offset);
> > +       do {
> > +               iter = __swap_cgroup_id_xchg(map, offset, id);
> > +               VM_BUG_ON(iter != old);
> > +       } while (++offset != end);
>
> Here it is possible that some of the nr_ents can be changed while the
> offset is still in the loop. Might want to examine if the caller can
> trigger that or not. We want to make sure it is safe to do so, when
> removing the spin lock, the nr_ents might not update to the same value
> if two callers race it.

Right, the problem is explained with the comments in the beginning of
this file, I can update that with more details:

Entries are always charged / uncharged belonging to one folio, or
being uncharged with no folio related (no more users, other folios
can't use these entries unless freeing is done).

As a folio is always charged or uncharged as a whole, and charge /
uncharge never happens concurrently, and the xchg here implies full
barrier, so the set of entries will also always be used as a whole.

So yes this function does implies the caller must always pass in swap
entries belonging to one single folio, or entries that have no users.
It's quite fragile indeed, I can make the caller pass in the folio as
an argument to clarify this, with some more WARN or BUG_ON.

> >
> >         return old;
> >  }
> > @@ -85,9 +97,13 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
> >   */
> >  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
> >  {
> > +       struct swap_cgroup_ctrl *ctrl;
> > +
> >         if (mem_cgroup_disabled())
> >                 return 0;
> > -       return lookup_swap_cgroup(ent, NULL)->id;
> > +
> > +       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> > +       return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
> >  }
> >
> >  int swap_cgroup_swapon(int type, unsigned long max_pages)
> > @@ -98,14 +114,15 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
> >         if (mem_cgroup_disabled())
> >                 return 0;
> >
> > -       map = vcalloc(max_pages, sizeof(struct swap_cgroup));
> > +       BUILD_BUG_ON(!ID_PER_SC);
>
> It is simpler just to assert on: sizeof(atomic_t) >= sizeof(unsigned short).
> I think that is what it does here.
>
> You might also want to assert:  !(sizeof(atomic_t) % sizeof(unsigned short))

Good idea.


>
> Chris
>
> > +       map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC),
> > +                     sizeof(struct swap_cgroup));
> >         if (!map)
> >                 goto nomem;
> >
> >         ctrl = &swap_cgroup_ctrl[type];
> >         mutex_lock(&swap_cgroup_mutex);
> >         ctrl->map = map;
> > -       spin_lock_init(&ctrl->lock);
> >         mutex_unlock(&swap_cgroup_mutex);
> >
> >         return 0;
> > --
> > 2.47.1
> >
>