linux-kernel - Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=OP_32757xb6ogZ5OaDk_6bhkm--F4akhQ-tj_WTzA4CA@mail.gmail.com>
Date: Fri, 26 Jan 2024 11:50:06 -0800
From: Nhat Pham <nphamcs@...il.com>
To: chengming.zhou@...ux.dev
Cc: hannes@...xchg.org, yosryahmed@...gle.com, akpm@...ux-foundation.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, 
	Chengming Zhou <zhouchengming@...edance.com>
Subject: Re: [PATCH 2/2] mm/zswap: fix race between lru writeback and swapoff

On Fri, Jan 26, 2024 at 12:32 AM <chengming.zhou@...ux.dev> wrote:
>
> From: Chengming Zhou <zhouchengming@...edance.com>
>
> LRU writeback has race problem with swapoff, as spotted by Yosry[1]:
>
> CPU1                    CPU2
> shrink_memcg_cb         swap_off
>   list_lru_isolate        zswap_invalidate
>                           zswap_swapoff
>                             kfree(tree)
>   // UAF
>   spin_lock(&tree->lock)
>
> The problem is that the entry in lru list can't protect the tree from
> being swapoff and freed, and the entry also can be invalidated and freed
> concurrently after we unlock the lru lock.
>
> We can fix it by moving the swap cache allocation ahead before
> referencing the tree, then check invalidate race with tree lock,
> only after that we can safely deref the entry. Note we couldn't
> deref entry or tree anymore after we unlock the folio, since we
> depend on this to hold on swapoff.
>
> So this patch moves all tree and entry usage to zswap_writeback_entry(),
> we only use the copied swpentry on the stack to allocate swap cache
> and return with folio locked, after which we can reference the tree.
> Then check invalidate race with tree lock, the following things is
> much the same like zswap_load().
>
> Since we can't deref the entry after zswap_writeback_entry(), we
> can't use zswap_lru_putback() anymore, instead we rotate the entry

I added list_lru_putback to the list_lru API specifically for this use
case (zswap_lru_putback()). Now that we no longer need it, maybe we
can also remove this as well (assuming no-one else is using this?).

This can be done in a separate patch though.

> in the LRU list so concurrent reclaimers have little chance to see it.
> Or it will be deleted from LRU list if writeback success.
>
> Another confusing part to me is the update of memcg nr_zswap_protected
> in zswap_lru_putback(). I'm not sure why it's needed here since
> if we raced with swapin, memcg nr_zswap_protected has already been
> updated in zswap_folio_swapin(). So not include this part for now.
>
> [1] https://lore.kernel.org/all/CAJD7tkasHsRnT_75-TXsEe58V9_OW6m3g6CF7Kmsvz8CKRG_EA@mail.gmail.com/
>
> Signed-off-by: Chengming Zhou <zhouchengming@...edance.com>

LGTM! This is quite elegant.
Acked-by: Nhat Pham <nphamcs@...il.com>

> ---
>  mm/zswap.c | 93 ++++++++++++++++++------------------------------------
>  1 file changed, 31 insertions(+), 62 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 81cb3790e0dd..fa2bdb7ec1d8 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -277,7 +277,7 @@ static inline struct zswap_tree *swap_zswap_tree(swp_entry_t swp)
>                  zpool_get_type((p)->zpools[0]))
>
>  static int zswap_writeback_entry(struct zswap_entry *entry,
> -                                struct zswap_tree *tree);
> +                                swp_entry_t swpentry);
>  static int zswap_pool_get(struct zswap_pool *pool);
>  static void zswap_pool_put(struct zswap_pool *pool);
>
> @@ -445,27 +445,6 @@ static void zswap_lru_del(struct list_lru *list_lru, struct zswap_entry *entry)
>         rcu_read_unlock();
>  }
>
> -static void zswap_lru_putback(struct list_lru *list_lru,
> -               struct zswap_entry *entry)
> -{
> -       int nid = entry_to_nid(entry);
> -       spinlock_t *lock = &list_lru->node[nid].lock;
> -       struct mem_cgroup *memcg;
> -       struct lruvec *lruvec;
> -
> -       rcu_read_lock();
> -       memcg = mem_cgroup_from_entry(entry);
> -       spin_lock(lock);
> -       /* we cannot use list_lru_add here, because it increments node's lru count */
> -       list_lru_putback(list_lru, &entry->lru, nid, memcg);
> -       spin_unlock(lock);
> -
> -       lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(entry_to_nid(entry)));
> -       /* increment the protection area to account for the LRU rotation. */
> -       atomic_long_inc(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> -       rcu_read_unlock();
> -}
> -
>  /*********************************
>  * rbtree functions
>  **********************************/
> @@ -860,40 +839,34 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>  {
>         struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>         bool *encountered_page_in_swapcache = (bool *)arg;
> -       struct zswap_tree *tree;
> -       pgoff_t swpoffset;
> +       swp_entry_t swpentry;
>         enum lru_status ret = LRU_REMOVED_RETRY;
>         int writeback_result;
>
> +       /*
> +        * First rotate to the tail of lru list before unlocking lru lock,
> +        * so the concurrent reclaimers have little chance to see it.
> +        * It will be deleted from the lru list if writeback success.
> +        */
> +       list_move_tail(item, &l->list);
> +
>         /*
>          * Once the lru lock is dropped, the entry might get freed. The
> -        * swpoffset is copied to the stack, and entry isn't deref'd again
> +        * swpentry is copied to the stack, and entry isn't deref'd again
>          * until the entry is verified to still be alive in the tree.
>          */
> -       swpoffset = swp_offset(entry->swpentry);
> -       tree = swap_zswap_tree(entry->swpentry);
> -       list_lru_isolate(l, item);

nit: IIUC, now that we're no longer removing the entry from the
list_lru, we protect against concurrent shrinking action via this
check inside zswap_writeback_entry() too right:

if (!folio_was_allocated) {
   folio_put(folio);
   return -EEXIST;
}

Maybe update the comment above it too?

> +       swpentry = entry->swpentry;
> +
>         /*
>          * It's safe to drop the lock here because we return either
>          * LRU_REMOVED_RETRY or LRU_RETRY.
>          */
>         spin_unlock(lock);
>
> -       /* Check for invalidate() race */
> -       spin_lock(&tree->lock);
> -       if (entry != zswap_rb_search(&tree->rbroot, swpoffset))
> -               goto unlock;
> -
> -       /* Hold a reference to prevent a free during writeback */
> -       zswap_entry_get(entry);
> -       spin_unlock(&tree->lock);
> -
> -       writeback_result = zswap_writeback_entry(entry, tree);
> +       writeback_result = zswap_writeback_entry(entry, swpentry);
>
> -       spin_lock(&tree->lock);
>         if (writeback_result) {
>                 zswap_reject_reclaim_fail++;
> -               zswap_lru_putback(&entry->pool->list_lru, entry);
>                 ret = LRU_RETRY;
>
>                 /*
> @@ -903,27 +876,10 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>                  */
>                 if (writeback_result == -EEXIST && encountered_page_in_swapcache)
>                         *encountered_page_in_swapcache = true;
> -
> -               goto put_unlock;
> +       } else {
> +               zswap_written_back_pages++;
>         }
> -       zswap_written_back_pages++;
> -
> -       if (entry->objcg)
> -               count_objcg_event(entry->objcg, ZSWPWB);
> -
> -       count_vm_event(ZSWPWB);
> -       /*
> -        * Writeback started successfully, the page now belongs to the
> -        * swapcache. Drop the entry from zswap - unless invalidate already
> -        * took it out while we had the tree->lock released for IO.
> -        */
> -       zswap_invalidate_entry(tree, entry);
>
> -put_unlock:
> -       /* Drop local reference */
> -       zswap_entry_put(entry);
> -unlock:
> -       spin_unlock(&tree->lock);
>         spin_lock(lock);
>         return ret;
>  }
> @@ -1408,9 +1364,9 @@ static void __zswap_load(struct zswap_entry *entry, struct page *page)
>   * freed.
>   */
>  static int zswap_writeback_entry(struct zswap_entry *entry,
> -                                struct zswap_tree *tree)
> +                                swp_entry_t swpentry)
>  {
> -       swp_entry_t swpentry = entry->swpentry;
> +       struct zswap_tree *tree;
>         struct folio *folio;
>         struct mempolicy *mpol;
>         bool folio_was_allocated;
> @@ -1442,18 +1398,31 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>          * backs (our zswap_entry reference doesn't prevent that), to
>          * avoid overwriting a new swap folio with old compressed data.
>          */
> +       tree = swap_zswap_tree(swpentry);
>         spin_lock(&tree->lock);
> -       if (zswap_rb_search(&tree->rbroot, swp_offset(entry->swpentry)) != entry) {
> +       if (zswap_rb_search(&tree->rbroot, swp_offset(swpentry)) != entry) {
>                 spin_unlock(&tree->lock);
>                 delete_from_swap_cache(folio);
>                 folio_unlock(folio);
>                 folio_put(folio);
>                 return -ENOMEM;
>         }
> +
> +       /* Safe to deref entry after the entry is verified above. */
> +       zswap_entry_get(entry);
>         spin_unlock(&tree->lock);
>
>         __zswap_load(entry, &folio->page);
>
> +       count_vm_event(ZSWPWB);
> +       if (entry->objcg)
> +               count_objcg_event(entry->objcg, ZSWPWB);
> +
> +       spin_lock(&tree->lock);
> +       zswap_invalidate_entry(tree, entry);
> +       zswap_entry_put(entry);
> +       spin_unlock(&tree->lock);
> +
>         /* folio is up to date */
>         folio_mark_uptodate(folio);
>
> --
> 2.40.1
>