linux-kernel - Re: BUG_ON(!mapping_empty(&inode->i

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LSU.2.11.2103311413560.1201@eggly.anvils>
Date:   Wed, 31 Mar 2021 14:58:12 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Matthew Wilcox <willy@...radead.org>
cc:     Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-nvdimm@...ts.01.org,
        linux-kernel@...r.kernel.org
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Wed, 31 Mar 2021, Matthew Wilcox wrote:
> On Tue, Mar 30, 2021 at 06:30:22PM -0700, Hugh Dickins wrote:
> > Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
> > mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
> > I hit clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)); on two
> > machines, within an hour or few, repeatably though not to order.
> > 
> > The stack backtrace has always been clear_inode < ext4_clear_inode <
> > ext4_evict_inode < evict < dispose_list < prune_icache_sb <
> > super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
> > shrink_node_memgs < shrink_node < balance_pgdat < kswapd.
> > 
> > ext4 is the disk filesystem I read the source to build from, and also
> > the filesystem I use on a loop device on a tmpfs file: I have not tried
> > with other filesystems, nor checked whether perhaps it happens always on
> > the loop one or always on the disk one.  I have not seen it happen with
> > tmpfs - probably because its inodes cannot be evicted by the shrinker
> > anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
> > inodes (but suspect that may be down to timing, or less pressure).
> > I doubt it's a matter of filesystem: think it's an XArray thing.
> > 
> > Whenever I've looked at the XArray nodes involved, the root node
> > (shift 6) contained one or three (adjacent) pointers to empty shift
> > 0 nodes, which each had offset and parent and array correctly set.
> > Is there some way in which empty nodes can get left behind, and so
> > fail eviction's mapping_empty() check?
> 
> There isn't _supposed_ to be.  The XArray is supposed to delete nodes
> whenever the ->count reaches zero.  It might give me a clue if you could
> share a dump of the tree, if you still have that handy.

Very useful suggestion: the xa_dump() may not give you more of a clue,
but just running again last night to gather that info has revealed more.

> 
> > I did wonder whether some might get left behind if xas_alloc() fails
> > (though probably the tree here is too shallow to show that).  Printks
> > showed that occasionally xas_alloc() did fail while testing (maybe at
> > memcg limit), but there was no correlation with the BUG_ONs.
> 
> This is a problem inherited from the radix tree, and I really want to
> justify fixing it ... I think I may have enough infrastructure in place
> to do it now (as part of the xas_split() commit we can now allocate
> multiple xa_nodes in xas->xa_alloc).  But you're right; if we allocated
> all the way down to an order-0 node, then this isn't the bug.
> 
> Were you using the ALLOW_ERROR_INJECTION feature on
> __add_to_page_cache_locked()?  I haven't looked into how that works,
> and maybe that could leave us in an inconsistent state.

No, no error injection: not something I've ever looked at either.

> 
> > I did wonder whether this is a long-standing issue, which your new
> > BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
> > a BUG_ON(!xa_empty(&inode->i_data.i_pages)) after its nrpages and
> > nrexceptional BUG_ONs.  The result there surprised me: I expected
> > it to behave the same way, but it hits that BUG_ON in a minute or
> > so, instead of an hour or so.  Was there a fix you made somewhere,
> > to avoid the BUG_ON(!mapping_empty) most of the time? but needs
> > more work. I looked around a little, but didn't find any.
> 
> I didn't make a fix for this issue; indeed I haven't observed it myself.

That was interesting to me last night, but not so interesting now
we have more info (below).

> It seems like cgroups are a good way to induce allocation failures, so
> I should play around with that a bit.  The userspace test-suite has a
> relatively malicious allocator that will fail every allocation not marked
> as GFP_KERNEL, so it always exercises the fallback path for GFP_NOWAIT,
> but then it will always succeed eventually.
> 
> > I had hoped to work this out myself, and save us both some writing:
> > but better hand over to you, in the hope that you'll quickly guess
> > what's up, then I can try patches. I do like the no-nrexceptionals
> > series, but there's something still to be fixed.
> 
> Agreed.  It seems like it's unmasking a bug that already existed, so
> it's not an argument for dropping the series, but we should fix the bug
> so we don't crash people's machines.
> 
> Arguably, the condition being checked for is not serious enough for a
> BUG_ON.  A WARN_ON, yes, and dump the tree for later perusal, but it's
> just a memory leak, and not (I think?) likely to lead to later memory
> corruption.  The nodes don't contain any pages, so there's nothing to
> point to the mapping.

Good suggestion, yes, use a WARN_ON instead of a BUG_ON.  And even when
this immediate bug is fixed, will still be necessary for so long as the
radix-tree -ENOMEM might leave intermediate nodes behind.  (An easy way
to fix that might be just to add a cleanup pass when !mapping_empty().)

I followed your suggestion and used WARN_ON instead of BUG_ON in last
night's runs (when you make that change, do remember to do what I
forgot at first - reset i_pages to NULL - otherwise anyone who inherits
that struct inode thereafter inherits these nodes - I got lots of
warnings on tiny files!).

And also printed out s_dev i_ino i_size: in the hope that they might
shed some more light.  Hope fulfilled: *every* instance, on every
machine, has been /usr/bin/ld.bfd.

My kernels are built with CONFIG_READ_ONLY_THP_FOR_FS=y.  And checking
old /var/log/messages from last summer, when I was first playing around
with that, I see the temporary debug messages I had added for it:
"collapse_file(/usr/bin/ld.bfd, 0)" meaning that it was making a THP
at offset 0.  (Other executables too, notably cc1; but maybe ld.bfd
is the one most likely to get evicted.)

I suspect there's a bug in the XArray handling in collapse_file(),
which sometimes leaves empty nodes behind.

Here's the xa_dump() info from one of the machines:

[  348.026010] xarray: ffff88800f01eb98 head ffff88800855f2a2 flags 21 marks 0 0 0
[  348.028101] 0-4095: node ffff88800855f2a0 max 0 parent 0000000000000000 shift 6 count 1 values 0 array ffff88800f01eb98 list ffff88800855f2b8 ffff88800855f2b8 marks 0 0 0
[  348.030335] 960-1023: node ffff888023dbd890 offset 15 parent ffff88800855f2a0 shift 0 count 0 values 0 array ffff88800f01eb98 list ffff888023dbd8a8 ffff888023dbd8a8 marks 0 0 0
[  348.032736] s_dev 259:5 i_ino 935588 i_size 9112288

[ 4521.543662] xarray: ffff888003e8f898 head ffff88801256963a flags 21 marks 0 0 0
[ 4521.545683] 0-4095: node ffff888012569638 max 0 parent 0000000000000000 shift 6 count 1 values 0 array ffff888003e8f898 list ffff888012569650 ffff888012569650 marks 0 0 0
[ 4521.547889] 384-447: node ffff8880215b2088 offset 6 parent ffff888012569638 shift 0 count 0 values 0 array ffff888003e8f898 list ffff8880215b20a0 ffff8880215b20a0 marks 0 0 0
[ 4521.550151] s_dev 259:5 i_ino 935588 i_size 9112288

[17526.162821] xarray: ffff88800f09a998 head ffff88801610808a flags 21 marks 0 0 0
[17526.163936] 0-4095: node ffff888016108088 max 0 parent 0000000000000000 shift 6 count 2 values 0 array ffff88800f09a998 list ffff8880161080a0 ffff8880161080a0 marks 0 0 0
[17526.165176] 384-447: node ffff888022a5fc00 offset 6 parent ffff888016108088 shift 0 count 0 values 0 array ffff88800f09a998 list ffff888022a5fc18 ffff888022a5fc18 marks 0 0 0
[17526.167012] 960-1023: node ffff888008656188 offset 15 parent ffff888016108088 shift 0 count 0 values 0 array ffff88800f09a998 list ffff8880086561a0 ffff8880086561a0 marks 0 0 0
[17526.168225] s_dev 259:5 i_ino 935588 i_size 9112288

[27201.416553] xarray: ffff888026638a98 head ffff8880241fd14a flags 21 marks 0 0 0
[27201.418724] 0-4095: node ffff8880241fd148 max 0 parent 0000000000000000 shift 6 count 1 values 0 array ffff888026638a98 list ffff8880241fd160 ffff8880241fd160 marks 0 0 0
[27201.421067] 384-447: node ffff888026471aa8 offset 6 parent ffff8880241fd148 shift 0 count 0 values 0 array ffff888026638a98 list ffff888026471ac0 ffff888026471ac0 marks 0 0 0
[27201.423353] s_dev 259:5 i_ino 935588 i_size 9112288

[28189.092410] xarray: ffff888022468858 head ffff888016358262 flags 21 marks 0 0 0
[28189.094548] 0-4095: node ffff888016358260 max 0 parent 0000000000000000 shift 6 count 1 values 0 array ffff888022468858 list ffff888016358278 ffff888016358278 marks 0 0 0
[28189.096821] 384-447: node ffff8880124455f8 offset 6 parent ffff888016358260 shift 0 count 0 values 0 array ffff888022468858 list ffff888012445610 ffff888012445610 marks 0 0 0
[28189.099217] s_dev 259:5 i_ino 935588 i_size 9112288

935588 -rwxr-xr-x 1 root root 9112288 Dec  7 14:28 /usr/bin/ld.bfd

Hugh