lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <D51IU4N746MI.FDS6C7GYO4RP@samsung.com>
Date: Mon, 21 Oct 2024 15:34:03 +0200
From: Daniel Gomez <da.gomez@...sung.com>
To: "Kirill A. Shutemov" <kirill@...temov.name>, Baolin Wang
	<baolin.wang@...ux.alibaba.com>
CC: Matthew Wilcox <willy@...radead.org>, <akpm@...ux-foundation.org>,
	<hughd@...gle.com>, <david@...hat.com>, <wangkefeng.wang@...wei.com>,
	<21cnbao@...il.com>, <ryan.roberts@....com>, <ioworker0@...il.com>,
	<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>, "Kirill A . Shutemov"
	<kirill.shutemov@...ux.intel.com>
Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs

On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote:
> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote:
>> 
>> 
>> On 2024/10/17 19:26, Kirill A. Shutemov wrote:
>> > On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote:
>> > > + Kirill
>> > > 
>> > > On 2024/10/16 22:06, Matthew Wilcox wrote:
>> > > > On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote:
>> > > > > Considering that tmpfs already has the 'huge=' option to control the THP
>> > > > > allocation, it is necessary to maintain compatibility with the 'huge='
>> > > > > option, as well as considering the 'deny' and 'force' option controlled
>> > > > > by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'.
>> > > > 
>> > > > No, it's not.  No other filesystem honours these settings.  tmpfs would
>> > > > not have had these settings if it were written today.  It should simply
>> > > > ignore them, the way that NFS ignores the "intr" mount option now that
>> > > > we have a better solution to the original problem.
>> > > > 
>> > > > To reiterate my position:
>> > > > 
>> > > >    - When using tmpfs as a filesystem, it should behave like other
>> > > >      filesystems.
>> > > >    - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should
>> > > >      behave like anonymous memory.
>> > > 
>> > > I do agree with your point to some extent, but the ‘huge=’ option has
>> > > existed for nearly 8 years, and the huge orders based on write size may not
>> > > achieve the performance of PMD-sized THP in some scenarios, such as when the
>> > > write length is consistently 4K. So, I am still concerned that ignoring the
>> > > 'huge' option could lead to compatibility issues.
>> > 
>> > Yeah, I don't think we are there yet to ignore the mount option.
>> 
>> OK.
>> 
>> > Maybe we need to get a new generic interface to request the semantics
>> > tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_*
>> > handles to make kernel allocate PMD-size folio on any allocation or on
>> > allocations within i_size. I think this behaviour is useful beyond tmpfs.
>> > 
>> > Then huge= implementation for tmpfs can be re-defined to set these
>> > per-inode FADV_ flags by default. This way we can keep tmpfs compatible
>> > with current deployments and less special comparing to rest of
>> > filesystems on kernel side.
>> 
>> I did a quick search, and I didn't find any other fs that require PMD-sized
>> huge pages, so I am not sure if FADV_* is useful for filesystems other than
>> tmpfs. Please correct me if I missed something.
>
> What do you mean by "require"? THPs are always opportunistic.
>
> IIUC, we don't have a way to hint kernel to use huge pages for a file on
> read from backing storage. Readahead is not always the right way.
>
>> > If huge= is not set, tmpfs would behave the same way as the rest of
>> > filesystems.
>> 
>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large
>> folios based on the write size? If yes, that means it will change the
>> default huge behavior for tmpfs. Because previously having 'huge=' is not
>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I
>> mentioned:
>> "Another possible choice is to make the huge pages allocation based on write
>> size as the *default* behavior for tmpfs, ..."
>
> I am more worried about breaking existing users of huge pages. So changing
> behaviour of users who don't specify huge is okay to me.

I think moving tmpfs to allocate large folios opportunistically by
default (as it was proposed initially) doesn't necessary conflict with
the default behaviour (huge=never). We just need to clarify that in
the documentation.

However, and IIRC, one of the requests from Hugh was to have a way to
disable large folios which is something other FS do not have control
of as of today. Ryan sent a proposal to actually control that globally
but I think it didn't move forward. So, what are we missing to go back
to implement large folios in tmpfs in the default case, as any other fs
leveraging large folios?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ