[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fcd99cb8-0ccf-49f3-3450-63b5ca5eac1d@oracle.com>
Date: Wed, 14 Oct 2020 12:15:35 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
kirill@...temov.name, mhocko@...nel.org,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 6/8] mm, clear_huge_page: use clear_page_uncached() for
gigantic pages
On 2020-10-14 8:28 a.m., Ingo Molnar wrote:
>
> * Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>> Uncached writes are suitable for circumstances where the region written to
>> is not expected to be read again soon, or the region written to is large
>> enough that there's no expectation that we will find the writes in the
>> cache.
>>
>> Accordingly switch to using clear_page_uncached() for gigantic pages.
>>
>> Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
>> ---
>> mm/memory.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index eeae590e526a..4d2c58f83ab1 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5092,7 +5092,7 @@ static void clear_gigantic_page(struct page *page,
>> for (i = 0; i < pages_per_huge_page;
>> i++, p = mem_map_next(p, page, i)) {
>> cond_resched();
>> - clear_user_highpage(p, addr + i * PAGE_SIZE);
>> + clear_user_highpage_uncached(p, addr + i * PAGE_SIZE);
>> }
>> }
>
> So this does the clearing in 4K chunks, and your measurements suggest that
> short memory clearing is not as efficient, right?
I did not measure that separately (though I should), but the performance numbers
around that were somewhat puzzling.
For MOVNTI, the performance via perf bench (single call to memset_movnti())
is pretty close (within margin of error) to what we see with the page-fault
workload (4K chunks in clear_page_nt().)
With 'REP;STOS' though, there's degradation (~30% Broadwell, ~5% Rome) between
perf bench (single call to memset_erms()) compared to the page-fault workload
(4K chunks in clear_page_erms()).
In the second case, we are executing a lot more 'REP;STOS' loops while the
number of instructions in the first case is pretty much the same, so maybe
that's what accounts for it. But I checked and we are not frontend bound.
Maybe there are high setup costs for 'REP;STOS' on Broadwell? It does advertise
X86_FEATURE_ERMS though...
>
> I'm wondering whether it would make sense to do 2MB chunked clearing on
> 64-bit CPUs, instead of 512x 4k clearing? Both 2MB and GB pages are
> continuous in memory, so accessible to these instructions in a single
> narrow loop.
Yeah, I think it makes sense to do and should be quite straight-forward
as well. I'll try that out. I suspect it might help the X86_FEATURE_NT_BAD
models more but there's no reason why for it to hurt anywhere.
Ankur
>
> Thanks,
>
> Ingo
>
Powered by blists - more mailing lists