linux-kernel - Re: [4.15-rc9] fs_reclaim lockdep trace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201801292047.EHC05241.OHSQOJOVtFMFLF@I-love.SAKURA.ne.jp>
Date:   Mon, 29 Jan 2018 20:47:20 +0900
From:   Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:     peterz@...radead.org
Cc:     torvalds@...ux-foundation.org, davej@...emonkey.org.uk,
        npiggin@...il.com, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, netdev@...r.kernel.org, mhocko@...nel.org
Subject: Re: [4.15-rc9] fs_reclaim lockdep trace

Peter Zijlstra wrote:
> On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > This warning seems to be caused by commit d92a8cfcb37ecd13
> > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > location of
> > 
> >   /* this guy won't enter reclaim */
> >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> >           return false;
> > 
> > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > (__GFP_NOFS)").
> 
> I'm not entirly sure I get what you mean here. How did I move it? It was
> part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> mark the lock as held.

d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
fs_reclaim_acquire(), and removed current->lockdep_recursion handling.

----------
# git show d92a8cfcb37ecd13 | grep recursion
-# define INIT_LOCKDEP                          .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0,
+# define INIT_LOCKDEP                          .lockdep_recursion = 0,
        unsigned int                    lockdep_recursion;
-       if (unlikely(current->lockdep_recursion))
-       current->lockdep_recursion = 1;
-       current->lockdep_recursion = 0;
-        * context checking code. This tests GFP_FS recursion (a lock taken
----------

> 
> The new code has it in fs_reclaim_acquire/release to the same effect, if
> __GFP_NOMEMALLOC, we'll not acquire/release the lock.

Excuse me, but I can't catch.
We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

----------
+static bool __need_fs_reclaim(gfp_t gfp_mask)
+{
(...snipped...)
+       /* this guy won't enter reclaim */
+       if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+               return false;
(...snipped...)
+}
----------

> 
> 
> > Since __kmalloc_reserve() from __alloc_skb() adds
> > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > failing to return false despite PF_MEMALLOC context (and resulted in
> > lockdep warning).
> 
> But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> That's what the name says.

__GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
request should use.

----------
static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
{
        if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
                return 0;
        if (gfp_mask & __GFP_MEMALLOC)
                return ALLOC_NO_WATERMARKS;
        if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
                return ALLOC_NO_WATERMARKS;
        if (!in_interrupt()) {
                if (current->flags & PF_MEMALLOC)
                        return ALLOC_NO_WATERMARKS;
                else if (oom_reserves_allowed(current))
                        return ALLOC_OOM;
        }

        return 0;
}
----------

But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

----------
        /* Attempt with potentially adjusted zonelist and alloc_flags */
        page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
        if (page)
                goto got_pg;

        /* Caller is not willing to reclaim, we can't balance anything */
        if (!can_direct_reclaim)
                goto nopage;

        /* Avoid recursion of direct reclaim */
        if (current->flags & PF_MEMALLOC)
                goto nopage;

        /* Try direct reclaim and then allocating */
        page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                                        &did_some_progress);
        if (page)
                goto got_pg;

        /* Try direct compaction and then allocating */
        page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
                                        compact_priority, &compact_result);
        if (page)
                goto got_pg;

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;
----------

Then, how can fs_reclaim contribute to deadlock?

> 
> > Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> > __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> > commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> > allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> > Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> > return false.
> 
> This does not in fact explain what's going on, it just points to
> 'random' patches.
> 
> Are you talking about this:
> 
> +       /* Avoid recursion of direct reclaim */
> +       if (p->flags & PF_MEMALLOC)
> +               goto nopage;
> 
> bit?

Yes.

> 
> > Reported-by: Dave Jones <davej@...emonkey.org.uk>
> > Signed-off-by: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
> > Cc: Peter Zijlstra <peterz@...radead.org>
> > Cc: Nick Piggin <npiggin@...il.com>
> > ---
> >  mm/page_alloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 76c9688..7804b0e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
> >  		return false;
> >  
> >  	/* this guy won't enter reclaim */
> > -	if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > +	if (current->flags & PF_MEMALLOC)
> >  		return false;
> 
> I'm _really_ uncomfortable doing that. Esp. without a solid explanation
> of how this raelly can't possibly lead to trouble. Which the above semi
> incoherent rambling is not.
> 
> Your backtrace shows the btrfs shrinker doing an allocation, that's the
> exact kind of thing we need to be extremely careful with.
> 

If btrfs is already holding some lock (and thus __GFP_FS is not safe),
that lock must be printed at

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<000000001a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<0000000084f438c2>] fs_reclaim_acquire.part.102+0x5/0x30

doesn't it? But sk_lock-AF_INET6 is not a FS lock, and fs_reclaim does not
actually lock something.