[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxj8NVwrCTswut+icF2t1-7gtW_cmyuGO7WUWdNZLHOBYA@mail.gmail.com>
Date: Thu, 26 Dec 2019 20:04:00 +0200
From: Amir Goldstein <amir73il@...il.com>
To: Chris Down <chris@...isdown.name>
Cc: linux-fsdevel <linux-fsdevel@...r.kernel.org>,
Al Viro <viro@...iv.linux.org.uk>,
Matthew Wilcox <willy@...radead.org>,
Jeff Layton <jlayton@...nel.org>,
Johannes Weiner <hannes@...xchg.org>,
Tejun Heo <tj@...nel.org>,
linux-kernel <linux-kernel@...r.kernel.org>, kernel-team@...com,
"zhengbin (A)" <zhengbin13@...wei.com>
Subject: Re: [PATCH] fs: inode: Recycle inodenum from volatile inode slabs
On Thu, Dec 26, 2019 at 5:48 PM Chris Down <chris@...isdown.name> wrote:
>
> In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.
> On affected tiers, in excess of 10% of hosts show multiple files with
> different content and the same inode number, with some servers even
> having as many as 150 duplicated inode numbers with differing file
> content.
>
> This causes actual, tangible problems in production. For example, we
> have complaints from those working on remote caches that their
> application is reporting cache corruptions because it uses (device,
> inodenum) to establish the identity of a particular cache object, but
> because it's not unique any more, the application refuses to continue
> and reports cache corruption. Even worse, sometimes applications may not
> even detect the corruption but may continue anyway, causing phantom and
> hard to debug behaviour.
>
> In general, userspace applications expect that (device, inodenum) should
> be enough to be uniquely point to one inode, which seems fair enough.
> One might also need to check the generation, but in this case:
>
> 1. That's not currently exposed to userspace
> (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY);
> 2. Even with generation, there shouldn't be two live inodes with the
> same inode number on one device.
>
> In order to fix this, we reuse inode numbers from recycled slabs where
> possible, allowing us to significantly reduce the risk of 32 bit
> wraparound.
>
> There are probably some other potential users of this, like some FUSE
> internals, and {proc,sys,kern}fs style APIs, but doing a general opt-out
> codemod requires some thinking depending on the particular callsites and
> how far up the stack they are, we might end up recycling an i_ino value
> that actually does have some semantic meaning. As such, to start with
> this patch only opts in a few get_next_ino-heavy filesystems, and those
> which looked straightforward and without likelihood for corner cases:
>
> - bpffs
> - configfs
> - debugfs
> - efivarfs
> - hugetlbfs
> - ramfs
> - tmpfs
>
I'm confused about this list.
I suggested to convert tmpfs and hugetlbfs because they use a private
inode cache pool, therefore, you can know for sure that a recycled i_ino
was allocated by get_next_ino().
If I am not mistaken, other fs above are using the common inode_cache
pool, so when you recycle i_ino from that pool you don't know where it
came from and cannot trust its uniqueness in the get_next_ino() domain.
Even if *all* filesystems that currently use common inode_cache use
get_next_ino() exclusively to allocate ino numbers, that could change
in the future.
I'd go even further to say that introducing a generic helper for this sort
of thing is asking for trouble. It is best to keep the recycle logic well within
the bounds of the specific filesystem driver, which is the owner of the
private inode cache and the responsible for allocating ino numbers in
this pool.
Thanks and happy holidays,
Amir.
Powered by blists - more mailing lists