linux-kernel - Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191225125448.GA309148@chrisdown.name>
Date:   Wed, 25 Dec 2019 12:54:48 +0000
From:   Chris Down <chris@...isdown.name>
To:     Amir Goldstein <amir73il@...il.com>
Cc:     Matthew Wilcox <willy@...radead.org>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        Jeff Layton <jlayton@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Tejun Heo <tj@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>, kernel-team@...com,
        Hugh Dickins <hughd@...gle.com>,
        Miklos Szeredi <miklos@...redi.hu>,
        "zhengbin (A)" <zhengbin13@...wei.com>,
        Roman Gushchin <guro@...com>
Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when
 ino_t is 64 bit

Amir Goldstein writes:
>> The slab i_ino recycling approach works somewhat, but is unfortunately neutered
>> quite a lot by the fact that slab recycling is per-memcg. That is, replacing
>> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
>> callsites only leads to about 10% slab reuse, which doesn't really stem the
>> bleeding of 32-bit inums on an affected workload:
>>
>>      # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
>>          4454 recycle_or_get_next_ino: not recycled
>>           546 recycle_or_get_next_ino: recycled
>>
>
>Too bad..
>Maybe recycled ino should be implemented all the same because it is simple
>and may improve workloads that are not so MEMCG intensive.

Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for 
this) since it's not a total solution for the problem, but still helps somewhat 
and we all seem to agree that it's overall an uncontroversial improvement.

>> Roman (who I've just added to cc) tells me that currently we only have
>> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
>> contributes fairly significantly here since there are multiple tasks across
>> multiple cgroups which are contributing to the get_next_ino() thrash.
>>
>> I think this is a good start, but we need something of a different magnitude in
>> order to actually solve this problem with the current slab infrastructure. How
>> about something like the following?
>>
>> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is
>> 2. Use get_next_ino_full in tmpfs (et al.)
>
>I would prefer that filesystems making heavy use of get_next_ino, be converted
>to use a private ino pool per sb:
>
>ino_pool_create()
>ino_pool_get_next()
>
>flags to ino_pool_create() can determine the desired ino range.
>Does the Facebook use case involve a single large tmpfs or many
>small ones? I would guess the latter and therefore we are trying to solve
>a problem that nobody really needs to solve (i.e. global efficient ino pool).

Unfortunately in the case under discussion, it's all in one large tmpfs in 
/dev/shm. I can empathise with that -- application owners often prefer to use 
the mounts provided to them rather than having to set up their own. For this 
one case we can change that, but I think it seems reasonable to support this 
case since using a single tmpfs can be a reasonable decision as an application 
developer, especially if you only have unprivileged access to the system.

>> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
>>     pass if they want the 32-bit inode numbers back. This would still allow
>>     people who want to make this tradeoff to use xino.
>
>inode32|inode64 (see man xfs(5)).

Ah great, thanks! I'll reuse precedent from those.

>> 4. (If you like) Also add a CONFIG option to disable this at compile time.
>>
>
>I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
>might me best determined by CONFIG option, so distro builders could decide
>if they want to take the risk of breaking applications on tmpfs.

Sounds good.

>But if you implement per sb ino pool, maybe inode64 will no longer be
>required for your use case?

In this case I think per-sb ino pool will help a bit, but unfortunately not by 
an order of magnitude. As with the recycling patch this will help reduce thrash 
a bit but not conclusively prevent the problem from happening long-term. To fix 
that, I think we really do need the option to use ino_t-sized get_next_ino_full 
(or per-sb equivalent).

Happy holidays, and thanks for your feedback!

Chris