linux-kernel - Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110228175734.GC20805@quack.suse.cz>
Date:	Mon, 28 Feb 2011 18:57:34 +0100
From:	Jan Kara <jack@...e.cz>
To:	Josh Hunt <johunt@...mai.com>
Cc:	Jan Kara <jack@...e.cz>, Al Viro <viro@...IV.linux.org.uk>,
	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	"sandeen@...hat.com" <sandeen@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Subject: Re: [RFC][PATCH] ext2: Resolve i_nlink race in ext2_rename

On Thu 24-02-11 12:18:32, Josh Hunt wrote:
> On 02/24/2011 03:20 AM, Jan Kara wrote:
> > On Thu 24-02-11 06:37:49, Al Viro wrote:
> >> On Wed, Feb 23, 2011 at 10:21:41PM -0800, Josh Hunt wrote:
> >>> [resending: left Jan off the original mail by accident]
> >>>
> >>> We have a multi-threaded workload which is currently "losing" files in the form
> >>> of unattached inodes. The workload is link, rename, unlink intensive. This is
> >>> happening on an ext2 filesystem and have reproduced the issue in kernel
> >>> 2.6.37.  Here's a sample strace:
> >>>
> >>>     open("/a/tmp/tmpfile.1296184058", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 9
> >>>     link("/a/tmp/tmpfile.1296184058", "/a/tmp/tmpfile.28117.1296184059") = 0
> >>>     rename("/a/tmp/tmpfile.28117.1296184059", "/a/tmp/tmpfile") = 0
> >>>     stat64("/a/tmp/tmpfile", {st_mode=S_IFREG|0644, st_size=24248267, ...}) = 0
> >>>     link("/a/tmp/tmpfile", "/a/tmp/submit/tmpfile")        = 0
> >>>     open("/a/tmp/tmpfile.1296184058", O_RDONLY) = 13
> >>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDWR|O_CREAT|O_EXCL, 0600) = 824
> >>>     rename("/a/tmp/submit/tmpfile", "/a/tmp/submit/tmpfile.send.q9SNoL")                = 0
> >>>     unlink("/a/tmp/tmpfile.1296184058") = 0
> >>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 827
> >>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 828
> >>>     open("/a/tmp/submit/tmpfile.send.q9SNoL", O_RDONLY) = 829
> >>>     unlink("/a/tmp/submit/tmpfile.send.q9SNoL") = 0
> >>>
> >>> The application behavior shown above repeats indefinitely with most filenames
> >>> changing during each iteration except for 'tmpfile'. Looking into this issue I
> >>> see that vfs_rename_other() only takes i_mutex for the new inode and the new
> >>> inode's directory as well as the old directory's mutex. This works for
> >>> modifying the dir entry and appears to be fine for most filesystems, but
> >>> ext2 and a few others (exofs, minix, nilfs2, omfs, sysv, ufs) modify i_nlink
> >>> inside of their respective rename functions without grabbing the i_mutex. The
> >>> modifications are done through calls to inode_inc_link_count(old_inode) and
> >>> inode_dec_link_count(old_inode), etc.
> >>>
> >>> Taking the mutex for the old inode appears to resolve the issue of the
> >>> lost files/unattached inodes that I am seeing with this workload. I've attached
> >>> a patch below doing what I've described above. If this is an accepted solution
> >>> I believe other filesystems may also be affected by this and I could provide
> >>> a patch for those as well.
> >>
> >> I don't know...  The thing is, we mostly do that to make life easier for
> >> fsck in case of crash.  Other than that, there's no reason to play with
> >> link count of that sucker at all.  The question is, do we really want
> >> such rename() interrupted by dirty shutdown to result in what looks like two
> >> legitimate links to that inode without any indications of what had happened?
> >> Note that fsck (at least on ext2) will correct link counts anyway and if
> >> nothing else, we probably want some noise pointing to the inode in question...
> >   Yeah, I agree that playing with the link count is not worth it. It is
> > even more disputable because it would have some reasonable effect only if
> > we happened to write out the moved inode after it is linked to the new
> > directory and before it is unlinked from the old one. Moreover we'd need
> > to writeout the new directory and not the old directory before crash
> > happens. All this is highly unlikely and even if that happens, it is
> > questionable whether the result is worth it. So I'll just do away with
> > those games with link count...
> >   The patch is attached. Josh, can you test it as well? Thanks.
> > 
> > 								Honza
> Jan
> 
> I'm not seeing the problem with your patch as was expected since we're
> not messing with i_nlink anymore. Al suggested marking the inode as
> dirty where we were previously doing the old_inode dec. I believe this
> is needed as well since we are updating it's ctime. I've attached a
> version marking the inode dirty and it also fixes the comment making
> reference to calling inode_dec_link_count().
  Yeah, good catch. Thanks.

> I'm not completely clear on the historical reasons for messing with the
> link count of old_inode in the first place. It was just to simulate the
> linking and unlinking of the old_inode?
  Yes.

So I took your patch and used a changelog from mine as I find it more
descriptive. The resulting patch is in my tree (and attached).

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR

View attachment "0001-ext2-Fix-link-count-corruption-under-heavy-link-rena.patch" of type "text/x-patch" (2369 bytes)