[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c942e424-276d-4df7-4917-d61063ab8502@bytedance.com>
Date: Thu, 27 Jul 2023 11:34:30 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Dave Chinner <david@...morbit.com>
Cc: akpm@...ux-foundation.org, tkhai@...ru, vbabka@...e.cz,
roman.gushchin@...ux.dev, djwong@...nel.org, brauner@...nel.org,
paulmck@...nel.org, tytso@....edu, steven.price@....com, cel@...nel.org,
senozhatsky@...omium.org, yujie.liu@...el.com, gregkh@...uxfoundation.org,
muchun.song@...ux.dev, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
x86@...nel.org, kvm@...r.kernel.org, xen-devel@...ts.xenproject.org,
linux-erofs@...ts.ozlabs.org, linux-f2fs-devel@...ts.sourceforge.net,
cluster-devel@...hat.com, linux-nfs@...r.kernel.org,
linux-mtd@...ts.infradead.org, rcu@...r.kernel.org, netdev@...r.kernel.org,
dri-devel@...ts.freedesktop.org, linux-arm-msm@...r.kernel.org,
dm-devel@...hat.com, linux-raid@...r.kernel.org,
linux-bcache@...r.kernel.org, virtualization@...ts.linux-foundation.org,
linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
linux-xfs@...r.kernel.org, linux-btrfs@...r.kernel.org
Subject: Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless
Hi Dave,
On 2023/7/27 07:09, Dave Chinner wrote:
> On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote:
>> On 2023/7/26 16:08, Dave Chinner wrote:
>>> On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
>>>> @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker *shrinker);
>>>> void shrinker_register(struct shrinker *shrinker);
>>>> void shrinker_unregister(struct shrinker *shrinker);
>>>> +static inline bool shrinker_try_get(struct shrinker *shrinker)
>>>> +{
>>>> + return READ_ONCE(shrinker->registered) &&
>>>> + refcount_inc_not_zero(&shrinker->refcount);
>>>> +}
>>>
>>> Why do we care about shrinker->registered here? If we don't set
>>> the refcount to 1 until we have fully initialised everything, then
>>> the shrinker code can key entirely off the reference count and
>>> none of the lookup code needs to care about whether the shrinker is
>>> registered or not.
>>
>> The purpose of checking shrinker->registered here is to stop running
>> shrinker after calling shrinker_free(), which can prevent the following
>> situations from happening:
>>
>> CPU 0 CPU 1
>>
>> shrinker_try_get()
>>
>> shrinker_try_get()
>>
>> shrinker_put()
>> shrinker_try_get()
>> shrinker_put()
>
> I don't see any race here? What is wrong with having multiple active
> users at once?
Maybe I'm overthinking. What I think is that if there are multiple users
at once, it may cause the above-mentioned livelock, which will cause
shrinker_free() to wait for a long time. But this probability should be
very low.
>
>>>
>>> This should use a completion, then it is always safe under
>>> rcu_read_lock(). This also gets rid of the shrinker_lock spin lock,
>>> which only exists because we can't take a blocking lock under
>>> rcu_read_lock(). i.e:
>>>
>>>
>>> void shrinker_put(struct shrinker *shrinker)
>>> {
>>> if (refcount_dec_and_test(&shrinker->refcount))
>>> complete(&shrinker->done);
>>> }
>>>
>>> void shrinker_free()
>>> {
>>> .....
>>> refcount_dec(&shrinker->refcount);
>>
>> I guess what you mean is shrinker_put(), because here may be the last
>> refcount.
>
> Yes, I did.
>
>>> wait_for_completion(&shrinker->done);
>>> /*
>>> * lookups on the shrinker will now all fail as refcount has
>>> * fallen to zero. We can now remove it from the lists and
>>> * free it.
>>> */
>>> down_write(shrinker_rwsem);
>>> list_del_rcu(&shrinker->list);
>>> up_write(&shrinker_rwsem);
>>> call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb);
>>> }
>>>
>>> ....
>>>
>>>> @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered);
>>>> void shrinker_register(struct shrinker *shrinker)
>>>> {
>>>> - down_write(&shrinker_rwsem);
>>>> - list_add_tail(&shrinker->list, &shrinker_list);
>>>> - shrinker->flags |= SHRINKER_REGISTERED;
>>>> + refcount_set(&shrinker->refcount, 1);
>>>> +
>>>> + spin_lock(&shrinker_lock);
>>>> + list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>>> + spin_unlock(&shrinker_lock);
>>>> +
>>>> shrinker_debugfs_add(shrinker);
>>>> - up_write(&shrinker_rwsem);
>>>> + WRITE_ONCE(shrinker->registered, true);
>>>> }
>>>> EXPORT_SYMBOL(shrinker_register);
>>>
>>> This just looks wrong - you are trying to use WRITE_ONCE() as a
>>> release barrier to indicate that the shrinker is now set up fully.
>>> That's not necessary - the refcount is an atomic and along with the
>>> rcu locks they should provides all the barriers we need. i.e.
>>
>> The reason I used WRITE_ONCE() here is because the shrinker->registered
>> will be read and written concurrently (read in shrinker_try_get() and
>> written in shrinker_free()), which is why I added shrinker::registered
>> field instead of using SHRINKER_REGISTERED flag (this can reduce the
>> addition of WRITE_ONCE()/READ_ONCE()).
>
> Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to
> use the field like this. You need release/acquire memory ordering
> here. i.e. smp_store_release()/smp_load_acquire().
>
> As it is, the refcount_inc_not_zero() provides a control dependency,
> as documented in include/linux/refcount.h, refcount_dec_and_test()
> provides release memory ordering. The only thing I think we may need
> is a write barrier before refcount_set(), such that if
> refcount_inc_not_zero() sees a non-zero value, it is guaranteed to
> see an initialised structure...
>
> i.e. refcounts provide all the existence and initialisation
> guarantees. Hence I don't see the need to use shrinker->registered
> like this and it can remain a bit flag protected by the
> shrinker_rwsem().
Ah, I didn't consider the memory order with refcount when I added
WRITE_ONCE/READ_ONCE to shrinker->registered, just didn't want KCSAN
to complain (there are multiple visitors at the same time, one of which
is a writer).
And the livelock case mentioned above is indeed unlikely to happen, so
I will delete shrinker->registered in the next version.
>
>
>>> void shrinker_register(struct shrinker *shrinker)
>>> {
>>> down_write(&shrinker_rwsem);
>>> list_add_tail_rcu(&shrinker->list, &shrinker_list);
>>> shrinker->flags |= SHRINKER_REGISTERED;
>>> shrinker_debugfs_add(shrinker);
>>> up_write(&shrinker_rwsem);
>>>
>>> /*
>>> * now the shrinker is fully set up, take the first
>>> * reference to it to indicate that lookup operations are
>>> * now allowed to use it via shrinker_try_get().
>>> */
>>> refcount_set(&shrinker->refcount, 1);
>>> }
>>>
>>>> diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
>>>> index f1becfd45853..c5573066adbf 100644
>>>> --- a/mm/shrinker_debug.c
>>>> +++ b/mm/shrinker_debug.c
>>>> @@ -5,6 +5,7 @@
>>>> #include <linux/seq_file.h>
>>>> #include <linux/shrinker.h>
>>>> #include <linux/memcontrol.h>
>>>> +#include <linux/rculist.h>
>>>> /* defined in vmscan.c */
>>>> extern struct rw_semaphore shrinker_rwsem;
>>>> @@ -161,17 +162,21 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
>>>> {
>>>> struct dentry *entry;
>>>> char buf[128];
>>>> - int id;
>>>> -
>>>> - lockdep_assert_held(&shrinker_rwsem);
>>>> + int id, ret = 0;
>>>> /* debugfs isn't initialized yet, add debugfs entries later. */
>>>> if (!shrinker_debugfs_root)
>>>> return 0;
>>>> + down_write(&shrinker_rwsem);
>>>> + if (shrinker->debugfs_entry)
>>>> + goto fail;
>>>> +
>>>> id = ida_alloc(&shrinker_debugfs_ida, GFP_KERNEL);
>>>> - if (id < 0)
>>>> - return id;
>>>> + if (id < 0) {
>>>> + ret = id;
>>>> + goto fail;
>>>> + }
>>>> shrinker->debugfs_id = id;
>>>> snprintf(buf, sizeof(buf), "%s-%d", shrinker->name, id);
>>>> @@ -180,7 +185,8 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
>>>> entry = debugfs_create_dir(buf, shrinker_debugfs_root);
>>>> if (IS_ERR(entry)) {
>>>> ida_free(&shrinker_debugfs_ida, id);
>>>> - return PTR_ERR(entry);
>>>> + ret = PTR_ERR(entry);
>>>> + goto fail;
>>>> }
>>>> shrinker->debugfs_entry = entry;
>>>> @@ -188,7 +194,10 @@ int shrinker_debugfs_add(struct shrinker *shrinker)
>>>> &shrinker_debugfs_count_fops);
>>>> debugfs_create_file("scan", 0220, entry, shrinker,
>>>> &shrinker_debugfs_scan_fops);
>>>> - return 0;
>>>> +
>>>> +fail:
>>>> + up_write(&shrinker_rwsem);
>>>> + return ret;
>>>> }
>>>> int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...)
>>>> @@ -243,6 +252,11 @@ struct dentry *shrinker_debugfs_detach(struct shrinker *shrinker,
>>>> shrinker->name = NULL;
>>>> *debugfs_id = entry ? shrinker->debugfs_id : -1;
>>>> + /*
>>>> + * Ensure that shrinker->registered has been set to false before
>>>> + * shrinker->debugfs_entry is set to NULL.
>>>> + */
>>>> + smp_wmb();
>>>> shrinker->debugfs_entry = NULL;
>>>> return entry;
>>>> @@ -266,14 +280,26 @@ static int __init shrinker_debugfs_init(void)
>>>> shrinker_debugfs_root = dentry;
>>>> /* Create debugfs entries for shrinkers registered at boot */
>>>> - down_write(&shrinker_rwsem);
>>>> - list_for_each_entry(shrinker, &shrinker_list, list)
>>>> + rcu_read_lock();
>>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>> + if (!shrinker_try_get(shrinker))
>>>> + continue;
>>>> + rcu_read_unlock();
>>>> +
>>>> if (!shrinker->debugfs_entry) {
>>>> - ret = shrinker_debugfs_add(shrinker);
>>>> - if (ret)
>>>> - break;
>>>> + /* Paired with smp_wmb() in shrinker_debugfs_detach() */
>>>> + smp_rmb();
>>>> + if (READ_ONCE(shrinker->registered))
>>>> + ret = shrinker_debugfs_add(shrinker);
>>>> }
>>>> - up_write(&shrinker_rwsem);
>>>> +
>>>> + rcu_read_lock();
>>>> + shrinker_put(shrinker);
>>>> +
>>>> + if (ret)
>>>> + break;
>>>> + }
>>>> + rcu_read_unlock();
>>>> return ret;
>>>> }
>>>
>>> And all this churn and complexity can go away because the
>>> shrinker_rwsem is still used to protect shrinker_register()
>>> entirely....
>>
>> My consideration is that during this process, there may be a
>> driver probe failure and then shrinker_free() is called (the
>> shrinker_debugfs_init() is called in late_initcall stage). In
>> this case, we need to use RCU+refcount to ensure that the shrinker
>> is not freed.
>
> Yeah, you're trying to work around the lack of a
> wait_for_completion() call in shrinker_free().
>
> With that, this doesn't need RCU at all, and the iteration can be
> done fully under the shrinker_rwsem() safely and so none of this
> code needs to change.
Oh, indeed, here does not need to be changed.
Thanks,
Qi
>
> Cheers,
>
> Dave.
Powered by blists - more mailing lists