lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 22 Apr 2021 12:31:52 -0700
From:   Florian Fainelli <f.fainelli@...il.com>
To:     David Hildenbrand <david@...hat.com>,
        Michal Hocko <mhocko@...e.com>
Cc:     Vlastimil Babka <vbabka@...e.cz>, Mel Gorman <mgorman@...e.de>,
        Minchan Kim <minchan@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>, l.stach@...gutronix.de,
        LKML <linux-kernel@...r.kernel.org>,
        Jaewon Kim <jaewon31.kim@...sung.com>,
        Michal Nazarewicz <mina86@...a86.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Oscar Salvador <OSalvador@...e.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: alloc_contig_range() with MIGRATE_MOVABLE performance regression
 since 4.9



On 4/22/2021 11:35 AM, David Hildenbrand wrote:
> On 22.04.21 19:50, Florian Fainelli wrote:
>>
>>
>> On 4/22/2021 1:56 AM, David Hildenbrand wrote:
>>> On 22.04.21 09:49, Michal Hocko wrote:
>>>> Cc David and Oscar who are familiar with this code as well.
>>>>
>>>> On Wed 21-04-21 11:36:01, Florian Fainelli wrote:
>>>>> Hi all,
>>>>>
>>>>> I have been trying for the past few days to identify the source of a
>>>>> performance regression that we are seeing with the 5.4 kernel but not
>>>>> with the 4.9 kernel on ARM64. Testing something newer like 5.10 is
>>>>> a bit
>>>>> challenging at the moment but will happen eventually.
>>>>>
>>>>> What we are seeing is a ~3x increase in the time needed for
>>>>> alloc_contig_range() to allocate 1GB in blocks of 2MB pages. The
>>>>> system
>>>>> is idle at the time and there are no other contenders for memory other
>>>>> than the user-space programs already started (DHCP client, shell,
>>>>> etc.).
>>>
>>> Hi,
>>>
>>> If you can easily reproduce it might be worth to just try bisecting;
>>> that could be faster than manually poking around in the code.
>>>
>>> Also, it would be worth having a look at the state of upstream Linux.
>>> Upstream Linux developers tend to not care about minor performance
>>> regressions on oldish kernels.
>>
>> This is a big pain point here and I cannot agree more, but until we
>> bridge that gap, this is not exactly easy to do for me unfortunately and
>> neither is bisection :/
>>
>>>
>>> There has been work on improving exactly the situation you are
>>> describing -- a "fail fast" / "no retry" mode for alloc_contig_range().
>>> Maybe it tackles exactly this issue.
>>>
>>> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org
>>>
>>> Minchan is already on cc.
>>
>> This patch does not appear to be helping, in fact, I had locally applied
>> this patch from way back when:
>>
>> https://lkml.org/lkml/2014/5/28/113
>>
>> which would effectively do this unconditionally. Let me see if I can
>> showcase this problem a x86 virtual machine operating in similar
>> conditions to ours.
> 
> How exactly are you allocating these 2MiB blocks?
> 
> Via CMA->alloc_contig_range() or via alloc_contig_range() directly? I
> assume via CMA.

I am allocating this memory directly via alloc_contig_range(start, end,
MIGRATE_MOVABLE, GFP_KERNEL), just looping over 1024MB via 2MB
increments. This is just a synthetic benchmark though we do have an
allocator that behaves just like that as well.

> 
> For
> 
> https://lkml.kernel.org/r/20210121175502.274391-3-minchan@kernel.org
> 
> to do its work you'll have to passĀ  __GFP_NORETRY to
> alloc_contig_range(). This requires CMA adaptions, from where we call
> alloc_contig_range().

Yes, I did modify the alloc_contig_range() caller to pass GFP_KERNEL |
__GFP_NORETRY. I did run for a more iterations (1000) and the results
are not very conclusive as with __GFP_NORETRY the allocation time per
allocation was not significantly better, in fact it was slightly worse
by 100us than without.

My x86 VM with 1GB of DRAM including 512MB being in ZONE_MOVABLE does
shows identical numbers for both 4.9 and 5.4 so this must be something
specific to ARM64 and/or the code we added to create a ZONE_MOVABLE on
that architecture since movablecore does not appear to have any effect
unlike x86.
-- 
Florian

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ