linux-kernel - Re: [PATCH v2] mm/page_alloc: minor clean up for memmap_init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <17807dce-0894-08b9-d53a-55e3696d6c08@oracle.com>
Date:   Tue, 14 Jun 2022 11:33:01 +0100
From:   Joao Martins <joao.m.martins@...cle.com>
To:     Muchun Song <songmuchun@...edance.com>,
        Miaohe Lin <linmiaohe@...wei.com>
Cc:     akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] mm/page_alloc: minor clean up for
 memmap_init_compound()

[was out the past couple days, hence the late response]

On 6/12/22 16:44, Muchun Song wrote:
> On Sat, Jun 11, 2022 at 10:13:52AM +0800, Miaohe Lin wrote:
>> Since commit 5232c63f46fd ("mm: Make compound_pincount always available"),
>> compound_pincount_ptr is stored at first tail page now. So we should call
>> prep_compound_head() after the first tail page is initialized to take
>> advantage of the likelihood of that tail struct page being cached given
>> that we will read them right after in prep_compound_head().
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@...wei.com>
>> Cc: Joao Martins <joao.m.martins@...cle.com>
>> ---
>> v2:
>>   Don't move prep_compound_head() outside loop per Joao.
>> ---
>>  mm/page_alloc.c | 17 +++++++++++------
>>  1 file changed, 11 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 4c7d99ee58b4..048df5d78add 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6771,13 +6771,18 @@ static void __ref memmap_init_compound(struct page *head,
>>  		set_page_count(page, 0);
>>  
>>  		/*
>> -		 * The first tail page stores compound_mapcount_ptr() and
>> -		 * compound_order() and the second tail page stores
>> -		 * compound_pincount_ptr(). Call prep_compound_head() after
>> -		 * the first and second tail pages have been initialized to
>> -		 * not have the data overwritten.
>> +		 * The first tail page stores compound_mapcount_ptr(),
>> +		 * compound_order() and compound_pincount_ptr(). Call
>> +		 * prep_compound_head() after the first tail page have
>> +		 * been initialized to not have the data overwritten.
>> +		 *
>> +		 * Note the idea to make this right after we initialize
>> +		 * the offending tail pages is trying to take advantage
>> +		 * of the likelihood of those tail struct pages being
>> +		 * cached given that we will read them right after in
>> +		 * prep_compound_head().
>>  		 */
>> -		if (pfn == head_pfn + 2)
>> +		if (unlikely(pfn == head_pfn + 1))
>>  			prep_compound_head(head, order);
> 
> For me it is weird not to put this out of the loop. I saw the reason
> is because of the caching suggested by Joao. But I think this is not
> a hot path and putting it out of the loop may be more intuitive at least
> for me.  Maybe this optimization is unnecessary (maybe I am wrong).

So, depending on your setup, this might actually sit in the boot path. Yes, it is at
bringup/teardown of new memory, so it does not sit in a 'hot path' and struct pages are
cold. But it is part of a function that initialiazes a whole DIMM worth of struct pages.
And PMEM dimms can be denser than RAM ones IIRC. In my case we usually have 128G PMEM
DIMMs in our servers.

> And it will be consistent with prep_compound_page() (at least it does
> not do the similar optimization) if we drop this optimization.
> 
> Hi Joao,
> 
> I am wondering is it a significant optimization for zone device memory?
> I found this code existed from the 1st version you introduced. 

Not quite. It did not existed in the RFC. As a matter of fact the very first
version was totally ignoring anything cache related (i.e. just calling
prep_compound_page() in the loop for all struct pages after all the struct pages were
inited) until Dan suggested I fix that part because people in the past have spent time
optimizing it.

> So
> I suspect maybe you have some numbers, would you like to share with us?
> 

128G DIMMs /with struct pages placed in DRAM/ with 2M page size used to take around
250-400ms. Now once you placed the struct pages in PMEM those numbers go up to 4 secs all
the way up to 8secs (there's a lot of high variance). Now imagine 12 dimms and those
numbers can get in the ranges of 3 - 4.8secs for DRAM-struct-pages, and with
PMEM-struct-pages to more than 48secs.

Note that initializing as compound pages brought those numbers closer in the middle
of the interval given that we need to do more work other than just initializing the
raw struct page. With DRAM struct pages with the vmemmap deduplication trick (which is now
default used) these got decreased down to 80-110ms per DIMM. But I actually got started
with numbers in the order of ~180-190ms per pmem DIMM (ignore cache effects). I should
note that I haven't measured /exactly/ the benefit of prep_compound_head() early calling.
But the other numbers help gauging the cache effects in this path.

Earlier (in v1) I merely expressed a minor concern. /Maybe/ this matters or maybe the cost
of prep_compound_head() outweighs the cache-miss avoidance of the succeeding two tail page
cache-lines per 2M page. Well, now it's one tail page. Nonetheless, I would expect that
this is part of the testing the submitter performs, given that this is not one of those
'no functional change' patches as written in v1 commit message :( Should that be the case,
then let's go with v1 as that certainly brings consistency with prep_compound_page().