linux-kernel - Re: [PATCH] drm: ttm: do not direct reclaim when allocating high order pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3227b440-5dbf-433d-8658-f37f9561554a@amd.com>
Date: Thu, 11 Sep 2025 16:31:27 +0200
From: Christian König <christian.koenig@....com>
To: Michel Dänzer <michel.daenzer@...lbox.org>,
 Thadeu Lima de Souza Cascardo <cascardo@...lia.com>
Cc: Huang Rui <ray.huang@....com>, Matthew Auld <matthew.auld@...el.com>,
 Matthew Brost <matthew.brost@...el.com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
 kernel-dev@...lia.com, Sergey Senozhatsky <senozhatsky@...omium.org>
Subject: Re: [PATCH] drm: ttm: do not direct reclaim when allocating high
 order pages

On 11.09.25 14:49, Michel Dänzer wrote:
>>>> What we are seeing here is on a low memory (4GiB) single node system with
>>>> an APU, that it will have lots of latencies trying to allocate memory by
>>>> doing direct reclaim trying to allocate order-10 pages, which will fail and
>>>> down it goes until it gets to order-4 or order-3. With this change, we
>>>> don't see those latencies anymore and memory pressure goes down as well.
>>> That reminds me of the scenario I described in the 00862edba135 ("drm/ttm: Use GFP_TRANSHUGE_LIGHT for allocating huge pages") commit log, where taking a filesystem backup could cause Firefox to freeze for on the order of a minute.
>>>
>>> Something like that can't just be ignored as "not a problem" for a potential 30% performance gain.
>>
>> Well using 2MiB is actually a must have for certain HW features and we have quite a lot of people pushing to always using them.
> 
> Latency can't just be ignored though. Interactive apps intermittently freezing because this code desperately tries to reclaim huge pages while the system is under memory pressure isn't acceptable.

Why should that not be acceptable?

The purpose of the fallback is to allow displaying messages like "Your system is low on memory, please close some application!" instead of triggering the OOM killer directly.

In that situation latency is not really a priority any more, but rather functionality.

> Maybe there could be some kind of mechanism which periodically scans BOs for sub-optimal page orders and tries migrating their storage to more optimal pages.

Well the problem usually happens because automatic page de-fragmentation is turned off, we had quite a number of bug reports for that.

So you are basically suggesting to implement something on the BO level which the system administrator has previously turned off on the page level.

On the other hand in this particular case it could be that the system just doesn't has not enough memory for the particular use case.

>> So that TTM still falls back to lower order allocations is just a compromise to not trigger the OOM killer.
>>
>> What we could do is to remove the fallback, but then Cascardos use case wouldn't be working any more at all.
> 
> Surely the issue is direct reclaim, not the fallback.

I would rather say the issue is that fallback makes people think that direct reclaim isn't mandatory.

Regards,
Christian.