linux-kernel - Re: [PATCH] mm: support large mapping building for tmpfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ec5d4e52-658b-4fdc-b7f9-f844ab29665c@redhat.com>
Date: Wed, 2 Jul 2025 10:45:16 +0200
From: David Hildenbrand <david@...hat.com>
To: Baolin Wang <baolin.wang@...ux.alibaba.com>, akpm@...ux-foundation.org,
 hughd@...gle.com
Cc: ziy@...dia.com, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com,
 npache@...hat.com, ryan.roberts@....com, dev.jain@....com,
 baohua@...nel.org, vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com,
 mhocko@...e.com, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: support large mapping building for tmpfs

>> Hm, are we sure about that?
> 
> IMO, referring to the definition of RSS:
> "resident set size (RSS) is the portion of memory (measured in
> kilobytes) occupied by a process that is held in main memory (RAM). "
> 
> Seems we should report the whole large folio already in file to users.
> Moreover, the tmpfs mount already adds the 'huge=always (or within)'
> option to allocate large folios, so the increase in RSS seems also expected?

Well, traditionally we only account what is actually mapped. If you
MADV_DONTNEED part of the large folio, or only mmap() parts of it,
the RSS would never cover the whole folio -- only what is mapped.

I discuss part of that in:

commit 749492229e3bd6222dda7267b8244135229d1fd8
Author: David Hildenbrand <david@...hat.com>
Date:   Mon Mar 3 17:30:13 2025 +0100

     mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)

And how my changes there affect some system stats (e.g., "AnonPages", "Mapped").
But the RSS stays unchanged and corresponds to what is actually mapped into
the process.

Doing something similar for the RSS would be extremely hard (single page mapped into process
-> account whole folio to RSS), because it's per-folio-per-process information, not
per-folio information.

So by mapping more in a single page fault, you end up increasing "RSS". But I wouldn't
call that "expected". I rather suspect that nobody will really care :)

> 
> Also, how does fault_around_bytes interact
>> here?
> 
> The ‘fault_around’ is a bit tricky. Currently, 'fault_around' only
> applies to read faults (via do_read_fault()) and does not control write
> shared faults (via do_shared_fault()). Additionally, in the
> do_shared_fault() function, PMD-sized large folios are also not
> controlled by 'fault_around', so I just follow the handling of PMD-sized
> large folios.
> 
>>> In order to support large mappings for tmpfs, besides checking VMA
>>> limits and
>>> PMD pagetable limits, it is also necessary to check if the linear page
>>> offset
>>> of the VMA is order-aligned within the file.
>>
>> Why?
>>
>> This only applies to PMD mappings. See below.
> 
> I previously had the same question, but I saw the comments for
> ‘thp_vma_suitable_order’ function, so I added the check here. If it's
> not necessary to check non-PMD-sized large folios, should we update the
> comments for 'thp_vma_suitable_order'?

I was not quite clear about PMD vs. !PMD.

The thing is, when you *allocate* a new folio, it must adhere at least to
pagecache alignment (e.g., cannot place an order-2 folio at pgoff 1) -- that is what
thp_vma_suitable_order() checks. Otherwise you cannot add it to the pagecache.

But once you *obtain* a folio from the pagecache and are supposed to map it
into the page tables, that must already hold true.

So you should be able to just blindly map whatever is given to you here
AFAIKS.

If you would get a pagecache folio that violates the linear page offset requirement
at that point, something else would have messed up the pagecache.

Or am I missing something?

-- 
Cheers,

David / dhildenb