linux-kernel - Re: [PATCH] mm: fix hanging shrinker management on long do_shrink

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f3fb2869-e89f-707b-c6e2-05d6e3abfb74@virtuozzo.com>
Date:   Thu, 19 Dec 2019 10:20:36 +0000
From:   Pavel Tikhomirov <ptikhomirov@...tuozzo.com>
To:     Michal Hocko <mhocko@...nel.org>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Roman Gushchin <guro@...com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Chris Down <chris@...isdown.name>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Tejun Heo <tj@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
        Konstantin Khorenko <khorenko@...tuozzo.com>,
        Kirill Tkhai <ktkhai@...tuozzo.com>,
        Andrey Ryabinin <aryabinin@...tuozzo.com>,
        Dave Chinner <david@...morbit.com>,
        Trond Myklebust <trond.myklebust@...merspace.com>,
        Anna Schumaker <anna.schumaker@...app.com>,
        "J. Bruce Fields" <bfields@...ldses.org>,
        Chuck Lever <chuck.lever@...cle.com>,
        "linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH] mm: fix hanging shrinker management on long
 do_shrink_slab



On 12/4/19 11:35 AM, Michal Hocko wrote:
> On Sat 30-11-19 00:45:41, Pavel Tikhomirov wrote:
>> We have a problem that shrinker_rwsem can be held for a long time for
>> read in shrink_slab, at the same time any process which is trying to
>> manage shrinkers hangs.
>>
>> The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list.
>> It tries to shrink something on nfs (hard) but nfs server is dead at
>> these moment already and rpc will never succeed. Generally any shrinker
>> can take significant time to do_shrink_slab, so it's a bad idea to hold
>> the list lock here.
> 
> Yes, this is a known problem and people have already tried to address it
> in the past. Have you checked previous attempts? SRCU based one
> http://lkml.kernel.org/r/153365347929.19074.12509495712735843805.stgit@localhost.localdomain
> but I believe there were others (I only had this one in my notes).
> Please make sure to Cc Dave Chinner when posting a next version because
> he had some concerns about the change of the behavior.

The approach of the patch you are referencing is quiet different, it 
will still hold global srcu_read_lock(&srcu) when diving in 
do_shrink_slab and hanging nfs will still block all [un]register_shrinker.

> 
>> We have a similar problem in shrink_slab_memcg, except that we are
>> traversing shrinker_map+shrinker_idr there.
>>
>> The idea of the patch is to inc a refcount to the chosen shrinker so it
>> won't disappear and release shrinker_rwsem while we are in
>> do_shrink_slab, after that we will reacquire shrinker_rwsem, dec
>> the refcount and continue the traversal.
> 
> The reference count part makes sense to me. RCU role needs a better
> explanation.

I have 2 rcu's in patch, 1-st is to protect shrinker_map same as it was 
before in memcg_set_shrinker_bit, 2-nd is to protect shrinker struct in 
put_shrinker from being freed, as unregister_shrinker can see refcnt==0 
without actually going to schedule().

> Also do you have any reason to not use completion for
> the final step? Openconding essentially the same concept sounds a bit
> awkward to me.

Thanks for a good hint, from the first glance we can rework wait_event 
part to wait_for_completion.

> 
>> We also need a wait_queue so that unregister_shrinker can wait for the
>> refcnt to become zero. Only after these we can safely remove the
>> shrinker from list and idr, and free the shrinker.
> [...]
>>    crash> bt ...
>>    PID: 18739  TASK: ...  CPU: 3   COMMAND: "bash"
>>     #0 [...] __schedule at ...
>>     #1 [...] schedule at ...
>>     #2 [...] rpc_wait_bit_killable at ... [sunrpc]
>>     #3 [...] __wait_on_bit at ...
>>     #4 [...] out_of_line_wait_on_bit at ...
>>     #5 [...] _nfs4_proc_delegreturn at ... [nfsv4]
>>     #6 [...] nfs4_proc_delegreturn at ... [nfsv4]
>>     #7 [...] nfs_do_return_delegation at ... [nfsv4]
>>     #8 [...] nfs4_evict_inode at ... [nfsv4]
>>     #9 [...] evict at ...
>>    #10 [...] dispose_list at ...
>>    #11 [...] prune_icache_sb at ...
>>    #12 [...] super_cache_scan at ...
>>    #13 [...] do_shrink_slab at ...
> 
> Are NFS people aware of this? Because this is simply not acceptable
> behavior. Memory reclaim cannot be block indefinitely or for a long
> time. There must be a way to simply give up if the underlying inode
> cannot be reclaimed.

Sorry that I didn't cc nfs people from the begining.

> 
> I still have to think about the proposed solution. It sounds a bit over
> complicated to me.
> 

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.