linux-kernel - Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJD7tkaVdJ9B_UDQs+o1nLdbs62CeKgbCyEXbMdezaBgOruEWw@mail.gmail.com>
Date: Tue, 23 Jan 2024 22:57:30 -0800
From: Yosry Ahmed <yosryahmed@...gle.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Nhat Pham <nphamcs@...il.com>, 
	Chris Li <chrisl@...nel.org>, Chengming Zhou <zhouchengming@...edance.com>, 
	Huang Ying <ying.huang@...el.com>, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()

> > > > > However, while I have your attention on the swapoff path, there's a
> > > > > slightly irrelevant problem that I think might be there, but I am not
> > > > > sure.
> > > > >
> > > > > It looks to me like swapoff can race with writeback, and there may be
> > > > > a chance of UAF for the zswap tree. For example, if zswap_swapoff()
> > > > > races with shrink_memcg_cb(), I feel like we may free the tree as it
> > > > > is being used. For example if zswap_swapoff()->kfree(tree) happen
> > > > > right before shrink_memcg_cb()->list_lru_isolate(l, item).
> > > > >
> > > > > Please tell me that I am being paranoid and that there is some
> > > > > protection against zswap writeback racing with swapoff. It feels like
> > > > > we are very careful with zswap entries refcounting, but not with the
> > > > > zswap tree itself.
> > > >
> > > > Hm, I don't see how.
> > > >
> > > > Writeback operates on entries from the LRU. By the time
> > > > zswap_swapoff() is called, try_to_unuse() -> zswap_invalidate() should
> > > > will have emptied out the LRU and tree.
> > > >
> > > > Writeback could have gotten a refcount to the entry and dropped the
> > > > tree->lock. But then it does __read_swap_cache_async(), and while
> > > > holding the page lock checks the tree under lock once more; if that
> > > > finds the entry valid, it means try_to_unuse() hasn't started on this
> > > > page yet, and would be held up by the page lock/writeback state.
> > >
> > > Consider the following race:
> > >
> > > CPU 1                                 CPU 2
> > > # In shrink_memcg_cb()     # In swap_off
> > > list_lru_isolate()
> > >                                             zswap_invalidate()
> > >                                             ..
> > >                                             zswap_swapoff() -> kfree(tree)
> > > spin_lock(&tree->lock);
> > >
> > > Isn't this a UAF or am I missing something here?
> >
> > Oof. You're right, it looks like there is a bug. Digging through the
> > history, I think this is actually quite old: the original backend
> > shrinkers would pluck something off their LRU, drop all locks, then
> > try to acquire tree->lock. There is no protection against swapoff.
> >
> > The lock that is supposed to protect this is the LRU lock. That's
> > where reclaim and invalidation should synchronize. But it's not right:
> >
> > 1. We drop the LRU lock before acquiring the tree lock. We should
> >    instead trylock the tree while still holding the LRU lock to make
> >    sure the tree is safe against swapoff.
> >
> > 2. zswap_invalidate() acquires the LRU lock when refcount hits 0. But
> >    it must always cycle the LRU lock before freeing the tree for that
> >    synchronization to work.
> >
> > Once we're holding a refcount to the entry, it's safe to drop all
> > locks for the next step because we'll then work against the swapcache
> > and entry: __read_swap_cache_async() will try to pin and lock whatever
> > swap entry is at that type+offset. This also pins the type's current
> > tree. HOWEVER, if swapoff + swapon raced, this could be a different
> > tree than what we had in @tree, so
> >
> > 3. we shouldn't pass @tree to zswap_writeback_entry(). It needs to
> >    look up zswap_trees[] again after __read_swap_cache_async()
> >    succeeded to validate the entry.
> >
> > Once it succeeded, we can validate the entry. The entry is valid due
> > to our refcount. The zswap_trees[type] is valid due to the cache pin.
> >
> > However, if validation failed and we have a non-zero writeback_result,
> > there is one last bug:
> >
> > 4. the original entry's tree is no longer valid for the entry put.
> >
> > The current scheme handles invalidation fine (which is good because
> > that's quite common). But it's fundamentally unsynchronized against
> > swapoff (which has probably gone undetected because that's rare).
> >
> > I can't think of an immediate solution to this, but I wanted to put my
> > analysis out for comments.
>
>
> Thanks for the great analysis, I missed the swapoff/swapon race myself :)
>
> The first solution that came to mind for me was refcounting the zswap
> tree with RCU with percpu-refcount, similar to how cgroup refs are
> handled (init in zswap_swapon() and kill in zswap_swapoff()). I think
> the percpu-refcount may be an overkill in terms of memory usage
> though. I think we can still do our own refcounting with RCU, but it
> may be more complicated.

FWIW, I was able to reproduce the problem in a vm with the following
kernel diff:
diff --git a/mm/zswap.c b/mm/zswap.c
index 78df16d307aa8..6580a4be52a18 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -880,6 +880,9 @@ static enum lru_status shrink_memcg_cb(struct
list_head *item, struct list_lru_o
         */
        spin_unlock(lock);

+       pr_info("Sleeping in shrink_memcg_cb() before
spin_lock(&tree->lock)\n");
+       schedule_timeout_uninterruptible(msecs_to_jiffies(10 * 1000));
+
        /* Check for invalidate() race */
        spin_lock(&tree->lock);
        if (entry != zswap_rb_search(&tree->rbroot, swpoffset))

This basically expands the race window to 10 seconds. I have a
reproducer script attached that utilizes the zswap shrinker (which
makes this much easier to reproduce btw). The steps are as follows:
- Compile the kernel with the above diff, and both ZRAM & KASAN enabled.
- In one terminal, run zswap_wb_swapoff_race.sh.
- In a different terminal, once the "Sleeping in shrink_memcg_cb()"
message is logged, run "swapoff /dev/zram0".
- In the first terminal, once the 10 seconds elapse, I get a UAF BUG
from KASAN (log attached as well).

View attachment "zswap_wb_swapoff_race.sh" of type "text/x-sh" (988 bytes)

Download attachment "uaf_log" of type "application/octet-stream" (5550 bytes)