linux-kernel - Re: [RFC PATCH] mm: Introduce new MADV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <70610ea1-5932-a19f-5eba-c4fba06335da@linux.alibaba.com>
Date:   Thu, 20 Oct 2022 15:15:26 +0800
From:   Baolin Wang <baolin.wang@...ux.alibaba.com>
To:     David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org
Cc:     arnd@...db.de, jingshan@...ux.alibaba.com, linux-mm@...ck.org,
        linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] mm: Introduce new MADV_NOMOVABLE behavior



On 10/19/2022 11:17 PM, David Hildenbrand wrote:
>> I observed one migration failure case (which is not easy to reproduce)
>> is that, the 'thp_migration_fail' count is 1 and the
>> 'thp_split_page_failed' count is also 1.
>>
>> That means when migrating a THP which is in CMA area, but can not
>> allocate a new THP due to memory fragmentation, so it will split the
>> THP. However THP split is also failed, probably the reason is temporary
>> reference count of this THP. And the temporary reference count can be
>> caused by dropping page caches (I observed the drop caches operation in
>> the system), but we can not drop the shmem page caches due to they are
>> already dirty at that time.
>>
>> So we can try again in migrate_pages() if THP split is failed to
>> mitigate the failure of migration, especially for the failure reason is
>> temporary reference count? Does this sound reasonable for you?
> 
> It sound reasonable, and I understand that debugging these issues is 
> tricky. But we really have to figure out the root cause to make these 
> pages that are indeed movable (but only temporarily not movable for 
> reason XYZ) movable.
> 
> We'd need some indication to retry migration longer / again.

OK. Let me try this and see if there are other possible failure cases in 
the products.

>>
>> However I still worried there are other possible cases to cause
>> migration failure, so no CMA allocation for our case seems more stable 
>> IMO.
> 
> Yes, I can understand that. But as one example, you're approach doesn't 
> handle the case that a page that was allocated on !CMA/!ZONE_MOVABLE 
> would get migrated to CMA/ZONE_MOVABLE just before you would try pinning 
> the page (to migrate it again off CMA/ZONE_MOVABLE).

Indeed, like you said before, just helpful to minimize page migration 
now. Maybe I can take MADV_PINNABLE into considering when allocating new 
pages, such as alloc_migration_target().

Anyway let me try to fix the root cause first to see if it can solve our 
problem.

> We really have to fix the root cause.

OK. Thanks for your input.