linux-kernel - Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ffp7bvnaa3qxjdc54gj3tlhgryctyguzzcax7kqnh7tumotqet@4rjsmb2zos5i>
Date: Wed, 28 Feb 2024 15:50:08 +0000
From: Daniel Gomez <da.gomez@...sung.com>
To: Jan Kara <jack@...e.cz>
CC: Hugh Dickins <hughd@...gle.com>, "viro@...iv.linux.org.uk"
	<viro@...iv.linux.org.uk>, "brauner@...nel.org" <brauner@...nel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>, "dagmcr@...il.com"
	<dagmcr@...il.com>, "linux-fsdevel@...r.kernel.org"
	<linux-fsdevel@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
	"willy@...radead.org" <willy@...radead.org>, "hch@...radead.org"
	<hch@...radead.org>, "mcgrof@...nel.org" <mcgrof@...nel.org>, Pankaj Raghav
	<p.raghav@...sung.com>, "gost.dev@...sung.com" <gost.dev@...sung.com>
Subject: Re: [RFC PATCH 0/9] shmem: fix llseek in hugepages

On Tue, Feb 27, 2024 at 11:42:01AM +0000, Daniel Gomez wrote:
> On Tue, Feb 20, 2024 at 01:39:05PM +0100, Jan Kara wrote:
> > On Tue 20-02-24 10:26:48, Daniel Gomez wrote:
> > > On Mon, Feb 19, 2024 at 02:15:47AM -0800, Hugh Dickins wrote:
> > > I'm uncertain when we may want to be more elastic. In the case of XFS with iomap
> > > and support for large folios, for instance, we are 'less' elastic than here. So,
> > > what exactly is the rationale behind wanting shmem to be 'more elastic'?
> > 
> > Well, but if you allocated space in larger chunks - as is the case with
> > ext4 and bigalloc feature, you will be similarly 'elastic' as tmpfs with
> > large folio support... So simply the granularity of allocation of
> > underlying space is what matters here. And for tmpfs the underlying space
> > happens to be the page cache.
> 
> But it seems like the underlying space 'behaves' differently when we talk about
> large folios and huge pages. Is that correct? And this is reflected in the fstat
> st_blksize. The first one is always based on the host base page size, regardless
> of the order we get. The second one is always based on the host huge page size
> configured (at the moment I've tested 2MiB, and 1GiB for x86-64 and 2MiB, 512
> MiB and 16GiB for ARM64).

Apologies, I was mixing the values available in HugeTLB and those supported in
THP (pmd-size only). Thus, it is 2MiB for x86-64, and 2MiB, 32 MiB and 512 MiB
for ARM64 with 4k, 16k and 64k Base Page Size, respectively.

> 
> If that is the case, I'd agree this is not needed for huge pages but only when
> we adopt large folios. Otherwise, we won't have a way to determine the step/
> granularity for seeking data/holes as it could be anything from order-0 to
> order-9. Note: order-1 support currently in LBS v1 thread here [1].
> 
> Regarding large folios adoption, we have the following implementations [2] being
> sent to the mailing list. Would it make sense then, to have this block tracking
> for the large folios case? Notice that my last attempt includes a partial
> implementation of block tracking discussed here.
> 
> [1] https://lore.kernel.org/all/20240226094936.2677493-2-kernel@pankajraghav.com/
> 
> [2] shmem: high order folios support in write path
> v1: https://lore.kernel.org/all/20230915095042.1320180-1-da.gomez@samsungcom/
> v2: https://lore.kernel.org/all/20230919135536.2165715-1-da.gomez@samsungcom/
> v3 (RFC): https://lore.kernel.org/all/20231028211518.3424020-1-da.gomez@samsung.com/
> 
> > 
> > > If we ever move shmem to large folios [1], and we use them in an oportunistic way,
> > > then we are going to be more elastic in the default path.
> > > 
> > > [1] https://lore.kernel.org/all/20230919135536.2165715-1-da.gomez@samsung.com
> > > 
> > > In addition, I think that having this block granularity can benefit quota
> > > support and the reclaim path. For example, in the generic/100 fstest, around
> > > ~26M of data are reported as 1G of used disk when using tmpfs with huge pages.
> > 
> > And I'd argue this is a desirable thing. If 1G worth of pages is attached
> > to the inode, then quota should be accounting 1G usage even though you've
> > written just 26MB of data to the file. Quota is about constraining used
> > resources, not about "how much did I write to the file".
> 
> But these are two separate values. I get that the system wants to track how many
> pages are attached to the inode, so is there a way to report (in addition) the
> actual use of these pages being consumed?
> 
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@...e.com>
> > SUSE Labs, CR