linux-kernel - Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbUoije1wgy3jPambP9-rbYs_Yq1Pajnv3U1MDOxFGU2fg@mail.gmail.com>
Date: Sat, 14 Dec 2024 08:07:08 -0800
From: Chris Li <chrisl@...nel.org>
To: Kairui Song <kasong@...cent.com>
Cc: linux-mm@...ck.org, Andrew Morton <akpm@...ux-foundation.org>, 
	Hugh Dickins <hughd@...gle.com>, "Huang, Ying" <ying.huang@...ux.alibaba.com>, 
	Yosry Ahmed <yosryahmed@...gle.com>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Shakeel Butt <shakeel.butt@...ux.dev>, Johannes Weiner <hannes@...xchg.org>, 
	Barry Song <baohua@...nel.org>, Michal Hocko <mhocko@...nel.org>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 3/3] mm, swap_cgroup: remove global swap cgroup lock

On Tue, Dec 10, 2024 at 1:29 AM Kairui Song <ryncsn@...il.com> wrote:
>
> From: Kairui Song <kasong@...cent.com>
>
> commit e9e58a4ec3b1 ("memcg: avoid use cmpxchg in swap cgroup maintainance")
> replaced the cmpxchg/xchg with a global irq spinlock because some archs
> doesn't support 2 bytes cmpxchg/xchg. Clearly this won't scale well.
>
> And as commented in swap_cgroup.c, this lock is not needed for map
> synchronization.
>
> Emulation of 2 bytes xchg with atomic cmpxchg isn't hard, so implement
> it to get rid of this lock. Introduced two helpers for doing so and they
> can be easily dropped if a generic 2 byte xchg is support.
>
> Testing using 64G brd and build with build kernel with make -j96 in 1.5G
> memory cgroup using 4k folios showed below improvement (10 test run):
>
> Before this series:
> Sys time: 10809.46 (stdev 80.831491)
> Real time: 171.41 (stdev 1.239894)
>
> After this commit:
> Sys time: 9621.26 (stdev 34.620000), -10.42%
> Real time: 160.00 (stdev 0.497814), -6.57%
>
> With 64k folios and 2G memcg:
> Before this series:
> Sys time: 8231.99 (stdev 30.030994)
> Real time: 143.57 (stdev 0.577394)
>
> After this commit:
> Sys time: 7403.47 (stdev 6.270000), -10.06%
> Real time: 135.18 (stdev 0.605000), -5.84%
>
> Sequential swapout of 8G 64k zero folios with madvise (24 test run):
> Before this series:
> 5461409.12 us (stdev 183957.827084)
>
> After this commit:
> 5420447.26 us (stdev 196419.240317)
>
> Sequential swapin of 8G 4k zero folios (24 test run):
> Before this series:
> 19736958.916667 us (stdev 189027.246676)
>
> After this commit:
> 19662182.629630 us (stdev 172717.640614)
>
> Performance is better or at least not worse for all tests above.
>
> Signed-off-by: Kairui Song <kasong@...cent.com>
> ---
>  mm/swap_cgroup.c | 73 +++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 28 deletions(-)
>
> diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
> index 1770b076f6b7..a0a8547dc85d 100644
> --- a/mm/swap_cgroup.c
> +++ b/mm/swap_cgroup.c
> @@ -7,19 +7,20 @@
>
>  static DEFINE_MUTEX(swap_cgroup_mutex);
>
> +/* Pack two cgroup id (short) of two entries in one swap_cgroup (atomic_t) */

Might not be two short if the atomic_t is more than 4 bytes. The
assumption here is that short is 2 bytes and atomic_t is 4 bytes. It
is hard to conclude that is the case for all architecture.

> +#define ID_PER_SC (sizeof(atomic_t) / sizeof(unsigned short))

You should use "sizeof(struct swap_cgroup) / sizeof(unsigned short)",
or get rid of struct swap_cgroup and directly use atomic_t.

> +#define ID_SHIFT (BITS_PER_TYPE(unsigned short))
> +#define ID_MASK (BIT(ID_SHIFT) - 1)
>  struct swap_cgroup {
> -       unsigned short          id;
> +       atomic_t ids;

You use struct swap_cgroup and atomic_t which assumes no padding added
to the struct. You might want to build an assert on sizeof(atomic_t)
== sizeof(struct swap_cgroup).

>  };
>
>  struct swap_cgroup_ctrl {
>         struct swap_cgroup *map;
> -       spinlock_t      lock;
>  };
>
>  static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
>
> -#define SC_PER_PAGE    (PAGE_SIZE/sizeof(struct swap_cgroup))
> -
>  /*
>   * SwapCgroup implements "lookup" and "exchange" operations.
>   * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
> @@ -30,19 +31,32 @@ static struct swap_cgroup_ctrl swap_cgroup_ctrl[MAX_SWAPFILES];
>   *    SwapCache(and its swp_entry) is under lock.
>   *  - When called via swap_free(), there is no user of this entry and no race.
>   * Then, we don't need lock around "exchange".
> - *
> - * TODO: we can push these buffers out to HIGHMEM.
>   */
> -static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
> -                                       struct swap_cgroup_ctrl **ctrlp)
> +static unsigned short __swap_cgroup_id_lookup(struct swap_cgroup *map,
> +                                             pgoff_t offset)
>  {
> -       pgoff_t offset = swp_offset(ent);
> -       struct swap_cgroup_ctrl *ctrl;
> +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;

Might not want to assume the ID_PER_SC is two. If some architecture
atomic_t is 64 bits then that code will break.

> +       unsigned int old_ids = atomic_read(&map[offset / ID_PER_SC].ids);

Here assume sizeof(unsigned int) == sizeof(atomic_t). Again,some
strange architecture might break it. Better use unsigned version of
aotmic_t;


>
> -       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> -       if (ctrlp)
> -               *ctrlp = ctrl;
> -       return &ctrl->map[offset];
> +       return (old_ids & (ID_MASK << shift)) >> shift;

Can be simplified as (old_ids >> shift) & ID_MASK. You might want to
double check that.

> +}
> +
> +static unsigned short __swap_cgroup_id_xchg(struct swap_cgroup *map,
> +                                           pgoff_t offset,
> +                                           unsigned short new_id)
> +{
> +       unsigned short old_id;
> +       unsigned int shift = (offset & 1) ? 0 : ID_SHIFT;

Same here, it assumes ID_PER_SC is 2.

> +       struct swap_cgroup *sc = &map[offset / ID_PER_SC];
> +       unsigned int new_ids, old_ids = atomic_read(&sc->ids);

Again it assumes  sizeof(unsigned int) == sizeof(atomic_t).

> +
> +       do {
> +               old_id = (old_ids & (ID_MASK << shift)) >> shift;
Can be simplify:
old_id = (old_ids >> shift) & ID_MASK;

> +               new_ids = (old_ids & ~(ID_MASK << shift));
> +               new_ids |= ((unsigned int)new_id) << shift;

new_ids |= (atomic_t) new_id << shift;

> +       } while (!atomic_try_cmpxchg(&sc->ids, &old_ids, new_ids));
> +
> +       return old_id;
>  }
>
>  /**
> @@ -58,21 +72,19 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
>                                   unsigned int nr_ents)
>  {
>         struct swap_cgroup_ctrl *ctrl;
> -       struct swap_cgroup *sc;
> -       unsigned short old;
> -       unsigned long flags;
>         pgoff_t offset = swp_offset(ent);
>         pgoff_t end = offset + nr_ents;
> +       unsigned short old, iter;
> +       struct swap_cgroup *map;
>
> -       sc = lookup_swap_cgroup(ent, &ctrl);
> +       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> +       map = ctrl->map;
>
> -       spin_lock_irqsave(&ctrl->lock, flags);
> -       old = sc->id;
> -       for (; offset < end; offset++, sc++) {
> -               VM_BUG_ON(sc->id != old);
> -               sc->id = id;
> -       }
> -       spin_unlock_irqrestore(&ctrl->lock, flags);

The above will always assign nr_ents of swap entry atomically.

> +       old = __swap_cgroup_id_lookup(map, offset);
> +       do {
> +               iter = __swap_cgroup_id_xchg(map, offset, id);
> +               VM_BUG_ON(iter != old);
> +       } while (++offset != end);

Here it is possible that some of the nr_ents can be changed while the
offset is still in the loop. Might want to examine if the caller can
trigger that or not. We want to make sure it is safe to do so, when
removing the spin lock, the nr_ents might not update to the same value
if two callers race it.
>
>         return old;
>  }
> @@ -85,9 +97,13 @@ unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
>   */
>  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  {
> +       struct swap_cgroup_ctrl *ctrl;
> +
>         if (mem_cgroup_disabled())
>                 return 0;
> -       return lookup_swap_cgroup(ent, NULL)->id;
> +
> +       ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> +       return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
>  }
>
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
> @@ -98,14 +114,15 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
>         if (mem_cgroup_disabled())
>                 return 0;
>
> -       map = vcalloc(max_pages, sizeof(struct swap_cgroup));
> +       BUILD_BUG_ON(!ID_PER_SC);

It is simpler just to assert on: sizeof(atomic_t) >= sizeof(unsigned short).
I think that is what it does here.

You might also want to assert:  !(sizeof(atomic_t) % sizeof(unsigned short))

Chris

> +       map = vcalloc(DIV_ROUND_UP(max_pages, ID_PER_SC),
> +                     sizeof(struct swap_cgroup));
>         if (!map)
>                 goto nomem;
>
>         ctrl = &swap_cgroup_ctrl[type];
>         mutex_lock(&swap_cgroup_mutex);
>         ctrl->map = map;
> -       spin_lock_init(&ctrl->lock);
>         mutex_unlock(&swap_cgroup_mutex);
>
>         return 0;
> --
> 2.47.1
>