linux-kernel - Re: darktable performance regression on AMD systems caused by "mm: align larger anonymous mappings on THP boundaries"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f81ef5bd-e930-4982-a5a8-cd4aca272912@suse.cz>
Date: Thu, 24 Oct 2024 11:58:43 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Thorsten Leemhuis <regressions@...mhuis.info>,
 Rik van Riel <riel@...riel.com>
Cc: Matthias <matthias@...enbinder.de>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Linux kernel regressions list <regressions@...ts.linux.dev>,
 LKML <linux-kernel@...r.kernel.org>, Linux-MM <linux-mm@...ck.org>,
 Yang Shi <yang@...amperecomputing.com>, Petr Tesarik <ptesarik@...e.com>
Subject: Re: darktable performance regression on AMD systems caused by "mm:
 align larger anonymous mappings on THP boundaries"

On 10/24/24 09:45, Thorsten Leemhuis wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
> 
> Rik, I noticed a report about a regression in bugzilla.kernel.org that
> appears to be caused by the following change of yours:
> 
> efa7df3e3bb5da ("mm: align larger anonymous mappings on THP boundaries")
> [v6.7]
> 
> It might be one of those "some things got faster, a few things became
> slower" situations. Not sure. Felt odd that the reporter was able to
> reproduce it on two AMD systems, but not on a Intel system. Maybe there
> is a bug somewhere else that was exposed by this.

It seems very similar to what we've seen with spec benchmarks such as cactus
and bisected to the same commit:

https://bugzilla.suse.com/show_bug.cgi?id=1229012

The exact regression varies per system. Intel regresses too but relatively
less. The theory is that there are many large-ish allocations that don't
have individual sizes aligned to 2MB and would have been merged, commit
efa7df3e3bb5da causes them to become separate areas where each aligns its
start at 2MB boundary and there are gaps between. This (gaps and vma
fragmentation) itself is not great, but most of the problem seemed to be
from the start alignment, which togethter with the access pattern causes
more TLB or cache missess due to limited associtativity.

So maybe darktable has a similar problem. A simple candidate fix could
change commit efa7df3e3bb5da so that the mapping size has to be a multiple
of THP size (2MB) in order to become aligned, right now it's enough if it's
THP sized or larger.

> So in the end it felt worth forwarding by mail to me. Not tracking this
> yet, first waiting for feedback.
> 
> To quote from https://bugzilla.kernel.org/show_bug.cgi?id=219366 :
> 
>> Matthias 2024-10-09 05:37:51 UTC
>> 
>> I am using a darktable benchmark and I am finding that RAW-to-JPG
>> conversion is about 15-25 % slower with kernels 6.7-6.10. The last
>> fast kernel series is 6.6. I also tested kernel series 6.5 and it is
>> as fast as 6.6
>> 
>> I know this sounds weird. What has darktable to do with the kernel?
>> But the numbers are true. And the darktable devs tell me that this
>> is a kernel regression. The darktable github issue is: https://
>> github.com/darktable-org/darktable/issues/17397  You can find more
>> details there.
>> 
>> What do I do to measure the performance?
>> 
>> I am executing darktable on the command line. opencl is disabled so
>> that all activities are only on the CPU:
>> 
>> darktable-cli bench.SRW /tmp/test.jpg --core --disable-opencl -d
>> perf -d opencl --configdir /tmp
>> 
>> ( bench.SRW and the sidecar file can be found here: https://
>> drive.google.com/drive/folders/1cfV2b893JuobVwGiZXcaNv5-yszH6j-N )
>> 
>> This will show some debug output. The line to look for is
>> 
>> 4,2765 [dev_process_export] pixel pipeline processing took 3,811
>> secs (81,883 CPU)
>> 
>> This gives an exact number how much time darktable needed to convert
>> the image. The time darktable needs has a clear dependency on the
>> kernel version. It is fast with kernel 6.6. and older and slow with
>> kernel 6.7 and newer. Something must have happened from 6.6 to 6.7
>> which slows down darktable.
>> 
>> The darktable debug output shows that basically only one module is
>> responsible for the slow down: 'atrous'
>> 
>> with kernel 6.6.47:
>> 
>> 4,0548 [dev_pixelpipe] took 0,635 secs (14,597 CPU) [export]
>> processed 'atrous' on CPU, blended on CPU ... 4,2765
>> [dev_process_export] pixel pipeline processing took 3,811 secs
>> (81,883 CPU)
>> 
>> with kernel 6.10.6:
>> 
>> 4,9645 [dev_pixelpipe] took 1,489 secs (33,736 CPU) [export]
>> processed 'atrous' on CPU, blended on CPU ... 5,2151
>> [dev_process_export] pixel pipeline processing took 4,773 secs
>> (102,452 CPU)
>> 
>> 
>> This is also being discussed here: https://discuss.pixls.us/t/
>> darktable-performance-regression-with-kernel-6-7-and-newer/45945/1 
>> And other users confirm the performance degradation.
> 
> [...]
> 
>> This seems to affect AMD only. I reproduced this performance
>> degradation on two different Ryzen Desktop PCs (Ryzen 5 and Ryzen
>> 9). But I can not reproduce it on my Intel PC (Lenovo X1 Carbon,
>> core i5).
> 
> [...]
> 
>> By the way, there is also a thread in the darktable forum on this topic:
>> https://discuss.pixls.us/t/darktable-performance-regression-with-kernel-6-7-and-newer/45945
>>  
>> Some users reproduced it there as well.
> 
> See the ticket for more details. The reporter is CCed. openZFS is in
> use, but the problem was reproduced on vanilla kernels.
> 
> Ciao, Thorsten
>