linux-kernel - Re: [PATCH] nilfs2: drop the inode which has been removed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKFNMono7BGpLOOjF1TcUpj7GM=x-aATHUv+fCXTs6=WVhYMUw@mail.gmail.com>
Date: Sun, 8 Dec 2024 02:49:28 +0900
From: Ryusuke Konishi <konishi.ryusuke@...il.com>
To: Edward Adam Davis <eadavis@...com>
Cc: linux-kernel@...r.kernel.org, linux-nilfs@...r.kernel.org, 
	syzbot+9260555647a5132edd48@...kaller.appspotmail.com, 
	syzkaller-bugs@...glegroups.com
Subject: Re: [PATCH] nilfs2: drop the inode which has been removed

On Fri, Dec 6, 2024 at 10:14 PM Edward Adam Davis wrote:
>
> On Fri, 6 Dec 2024 01:04:23 +0900, Ryusuke Konishi wrote:
> > > syzbot reported a WARNING in nilfs_rmdir. [1]
> > >
> > > The inode is used twice by the same task to unmount and remove directories
> > > ".nilfs" and "file0", it trigger warning in nilfs_rmdir.
> > >
> > > Avoid to this issue, check i_size and i_nlink in nilfs_iget(), if they are
> > > both 0, it means that this inode has been removed, and iput is executed to
> > > reclaim it.
> > >
> > > [1]
> > > WARNING: CPU: 1 PID: 5824 at fs/inode.c:407 drop_nlink+0xc4/0x110 fs/inode.c:407
> > > Modules linked in:
> > > CPU: 1 UID: 0 PID: 5824 Comm: syz-executor223 Not tainted 6.12.0-syzkaller-12113-gbcc8eda6d349 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> > > RIP: 0010:drop_nlink+0xc4/0x110 fs/inode.c:407
> > > Code: bb 70 07 00 00 be 08 00 00 00 e8 57 0b e6 ff f0 48 ff 83 70 07 00 00 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc e8 9d 4c 7e ff 90 <0f> 0b 90 eb 83 44 89 e1 80 e1 07 80 c1 03 38 c1 0f 8c 5c ff ff ff
> > > RSP: 0018:ffffc900037f7c70 EFLAGS: 00010293
> > > RAX: ffffffff822124a3 RBX: 1ffff1100e7ae034 RCX: ffff88807cf53c00
> > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > > RBP: 0000000000000000 R08: ffffffff82212423 R09: 1ffff1100f8ba8ee
> > > R10: dffffc0000000000 R11: ffffed100f8ba8ef R12: ffff888073d701a0
> > > R13: 1ffff1100e79f5c4 R14: ffff888073d70158 R15: dffffc0000000000
> > > FS:  0000555558d1e480(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000555558d37878 CR3: 000000007d920000 CR4: 00000000003526f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  <TASK>
> > >  nilfs_rmdir+0x1b0/0x250 fs/nilfs2/namei.c:342
> > >  vfs_rmdir+0x3a3/0x510 fs/namei.c:4394
> > >  do_rmdir+0x3b5/0x580 fs/namei.c:4453
> > >  __do_sys_rmdir fs/namei.c:4472 [inline]
> > >  __se_sys_rmdir fs/namei.c:4470 [inline]
> > >  __x64_sys_rmdir+0x47/0x50 fs/namei.c:4470
> > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > >  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
> > >  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > >
> > > Reported-and-tested-by: syzbot+9260555647a5132edd48@...kaller.appspotmail.com
> > > Closes: https://syzkaller.appspot.com/bug?extid=9260555647a5132edd48
> > > Signed-off-by: Edward Adam Davis <eadavis@...com>
> > > ---
> > >  fs/nilfs2/inode.c | 9 ++++++++-
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
> > > index cf9ba481ae37..254a5e46f8ea 100644
> > > --- a/fs/nilfs2/inode.c
> > > +++ b/fs/nilfs2/inode.c
> > > @@ -544,8 +544,15 @@ struct inode *nilfs_iget(struct super_block *sb, struct nilfs_root *root,
> > >         inode = nilfs_iget_locked(sb, root, ino);
> > >         if (unlikely(!inode))
> > >                 return ERR_PTR(-ENOMEM);
> > > -       if (!(inode->i_state & I_NEW))
> > > +
> > > +       if (!(inode->i_state & I_NEW)) {
> > > +               if (!inode->i_size && !inode->i_nlink) {
> > > +                       make_bad_inode(inode);
> > > +                       iput(inode);
> > > +                       return ERR_PTR(-EIO);
> > > +               }
> > >                 return inode;
> > > +       }
> > >
> > >         err = __nilfs_read_inode(sb, root, ino, inode);
> > >         if (unlikely(err)) {
> > > --
> > > 2.47.0
> >
> > Thank you Edward.
> >
> > This fix seems good except for the i_size check, but I think we need
> > to look into what's going on a bit more.
> >
> > I was unable to work for a while due to machine trouble, so I'd like
> > to know if you have made any progress on your investigation.
> >
> > First, is this caused by a corrupted filesystem image? Or is it that
> > the directories or files with the same inode number were generated
> > during the namespace operations (due to a timing issue or something),
> > and could this problem occur even if the original filesystem image is
> > normal?
> According to the log when I reproduced the problem, I analyzed that the
> problem occurred like this:
>
>                 CPU0                                    CPU1
>                 ====                                    ====
>                 nilfs_mkdir      // file0
>                 nilfs_new_inode  // ino is 11
>                 mount            // mount file0
>                                                         umount       // .nilfs, ino is 11
>                                                         nilfs_rmdir  // ino is 11, i_size = 0, i_nlink = 0
>                                                         umount       // file0, ino is 11
>                                                         nilfs_rmdir  // ino 11, i_size = 0, i_nlink = 0, trigger warning
>

Thank you for explaining the situation.  I understand what's going on.

In the end, going back to my question, this is caused by file system corruption.
Because the inode bitmap is corrupted, an inode with an inode number
that should exist as a ".nilfs" file was reassigned by nilfs_mkdir for
"file0", causing an inode duplication during execution.
And this causes an underflow of i_nlink in rmdir operations.

After considering the issue, I came to the conclusion that although
there is an approach that strengthens checks for bitmap
inconsistencies and inode type inconsistencies to detect errors in
advance, these approaches are difficult to fundamentally solve.

This is caused by a corrupted inode bitmap, but checking is difficult
in the first place when there are multiple directories with the same
inode number in the name space.

Therefore, the appropriate solution is to either check for i_nlink
underflow during rmdir, or to detect i_nlink == 0 in the call path of
nilfs_lookup() --> nilfs_iget(), and I think the latter approach is
better, as you have chosen.

> >
> > When I mounted the mount_0 image as read-only, the filesystem looked
> > normal without such inode duplication.
> >
> > At least, nilfs_read_inode_common(), which reads inodes from block
> > devices, is implemented to return an error with -ESTALE if i_nlink ==
> > 0.  So it seems that nilfs_iget() picked up this inode with i_nlilnk
> > == 0 because it hit an inode being deleted in the inode cache.  Why is
> > that happening?
> Are you talking about the following call trace?
> If so, then because the value of inode->i_state is I_DIRTY (set in nilfs_mkdir)
> it will not enter __nilfs_read_inode().
>
> nilfs_iget()->
>   __nilfs_read_inode()->
>     nilfs_read_inode_common()

Yes, in this case __nifls_read_inode() is not called.  As mentioned
above, the conclusion is that i_nlink abnormalities can occur in other
FS corruption patterns such as inode bitmap corruption and directory
inconsistency.

> >
> > Also, why do you put the i_size check as an AND condition?
> i_size will set to 0 in nilfs_rmdir(), so check it too.
> > i_size is independent of i_nlink and the inode lifecycles.  If i_size
> > is also broken, this check will not work properly.
> > If something is not working and you have included it as a workaround,
> > I would like to know about it.

To fix the problem, please modify the patch as follows:

(1) In nilfs_iget(), if the inode without I_NEW obtained by
nilfs_iget_locked() has i_nlink == 0, call iput() and return
ERR_PTR(-ESTALE).  Do not call make_bad_inode(). Also do not check for
i_size == 0 (it is not a good idea to check this to identify the case
of rmdir).

(2) In nilfs_lookup(), if the return value of nilfs_iget() is
ERR_PTR(-ESTALE), output a message with nilfs_error() indicating that
file system corruption has been detected , and returns ERR_PTR(-EIO).
Please refer to ext2_lookup() for the concrete implementation.

The reason for not calling make_bad_inode() is to prevent interference
from the side when i_nlink is set to 0 and nilfs_evict() is trying to
delete the inode. The VFS layer guarantees that inode acquisition and
evict are mutually exclusive, but it is possible to grab it before
evict(), and the result of interference in that case is unpredictable.
Instead, as mentioned above, I think it is safer to call nilfs_error()
at the nilfs_lookup() level, where namespace operations are involved.

Also, could you please explain in the commit log that inode
duplication has occurred due to file system corruption (in this case,
inode bitmap corruption), and that when inode duplication occurs due
to file system inconsistencies like this, a link count underflow can
occur during an rmdir operation, so that a link count check is
necessary at runtime?

Thank you in advance.

Ryusuke Konishi