lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <30e949b5-9455-4053-93ef-1ca0ceb10c3f@bytedance.com>
Date: Tue, 30 Jan 2024 11:31:04 +0800
From: Chengming Zhou <zhouchengming@...edance.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Yosry Ahmed <yosryahmed@...gle.com>, Nhat Pham <nphamcs@...il.com>,
 Andrew Morton <akpm@...ux-foundation.org>, Chris Li <chriscli@...gle.com>,
 linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2 2/3] mm/zswap: fix race between lru writeback and
 swapoff

On 2024/1/30 11:17, Johannes Weiner wrote:
> On Tue, Jan 30, 2024 at 10:30:20AM +0800, Chengming Zhou wrote:
>> On 2024/1/30 08:22, Yosry Ahmed wrote:
>>> On Sun, Jan 28, 2024 at 01:28:50PM +0000, Chengming Zhou wrote:
>>>> @@ -860,40 +839,47 @@ static enum lru_status shrink_memcg_cb(struct list_head *item, struct list_lru_o
>>>>  {
>>>>  	struct zswap_entry *entry = container_of(item, struct zswap_entry, lru);
>>>>  	bool *encountered_page_in_swapcache = (bool *)arg;
>>>> -	struct zswap_tree *tree;
>>>> -	pgoff_t swpoffset;
>>>> +	swp_entry_t swpentry;
>>>>  	enum lru_status ret = LRU_REMOVED_RETRY;
>>>>  	int writeback_result;
>>>>  
>>>> +	/*
>>>> +	 * Rotate the entry to the tail before unlocking the LRU,
>>>> +	 * so that in case of an invalidation race concurrent
>>>> +	 * reclaimers don't waste their time on it.
>>>> +	 *
>>>> +	 * If writeback succeeds, or failure is due to the entry
>>>> +	 * being invalidated by the swap subsystem, the invalidation
>>>> +	 * will unlink and free it.
>>>> +	 *
>>>> +	 * Temporary failures, where the same entry should be tried
>>>> +	 * again immediately, almost never happen for this shrinker.
>>>> +	 * We don't do any trylocking; -ENOMEM comes closest,
>>>> +	 * but that's extremely rare and doesn't happen spuriously
>>>> +	 * either. Don't bother distinguishing this case.
>>>> +	 *
>>>> +	 * But since they do exist in theory, the entry cannot just
>>>> +	 * be unlinked, or we could leak it. Hence, rotate.
>>>
>>> The entry cannot be unlinked because we cannot get a ref on it without
>>> holding the tree lock, and we cannot deref the tree before we acquire a
>>> swap cache ref in zswap_writeback_entry() -- or if
>>> zswap_writeback_entry() fails. This should be called out explicitly
>>> somewhere. Perhaps we can describe this whole deref dance with added
>>> docs to shrink_memcg_cb().
>>
>> Maybe we should add some comments before or after zswap_writeback_entry()?
>> Or do you have some suggestions? I'm not good at this. :)
> 
> I agree with the suggestion of a central point to document this.
> 
> How about something like this:
> 
> /*
>  * As soon as we drop the LRU lock, the entry can be freed by
>  * a concurrent invalidation. This means the following:
>  *
>  * 1. We extract the swp_entry_t to the stack, allowing
>  *    zswap_writeback_entry() to pin the swap entry and
>  *    then validate the zwap entry against that swap entry's
>  *    tree using pointer value comparison. Only when that
>  *    is successful can the entry be dereferenced.
>  *
>  * 2. Usually, objects are taken off the LRU for reclaim. In
>  *    this case this isn't possible, because if reclaim fails
>  *    for whatever reason, we have no means of knowing if the
>  *    entry is alive to put it back on the LRU.
>  *
>  *    So rotate it before dropping the lock. If the entry is
>  *    written back or invalidated, the free path will unlink
>  *    it. For failures, rotation is the right thing as well.
>  *
>  *    Temporary failures, where the same entry should be tried
>  *    again immediately, almost never happen for this shrinker.
>  *    We don't do any trylocking; -ENOMEM comes closest,
>  *    but that's extremely rare and doesn't happen spuriously
>  *    either. Don't bother distinguishing this case.
>  */
> 

Thanks! I think this document is great enough.

>>> We could also use a comment in zswap_writeback_entry() (or above it) to
>>> state that we only deref the tree *after* we get the swapcache ref.
>>
>> I just notice there are some comments in zswap_writeback_entry(), should
>> we add more comments here?
>>
>> 	/*
>> 	 * folio is locked, and the swapcache is now secured against
>> 	 * concurrent swapping to and from the slot. Verify that the
>> 	 * swap entry hasn't been invalidated and recycled behind our
>> 	 * backs (our zswap_entry reference doesn't prevent that), to
>> 	 * avoid overwriting a new swap folio with old compressed data.
>> 	 */
> 
> The bit in () is now stale, since we're not even holding a ref ;)

Right.

> 
> Otherwise, a brief note that entry can not be dereferenced until
> validation would be helpful in zswap_writeback_entry(). The core of

Ok.

> the scheme I'd probably describe in shrink_memcg_cb(), though.
> 
> Can I ask a favor, though?
> 
> For non-critical updates to this patch, can you please make them
> follow-up changes? I just sent out 20 cleanup patches on top of this
> patch which would be super painful and error prone to rebase. I'd like
> to avoid that if at all possible.

Ok, so these comments changes should be changed on top of your cleanup series
and sent as a follow-up patch.

Thanks.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ