lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <70610ea1-5932-a19f-5eba-c4fba06335da@linux.alibaba.com>
Date:   Thu, 20 Oct 2022 15:15:26 +0800
From:   Baolin Wang <baolin.wang@...ux.alibaba.com>
To:     David Hildenbrand <david@...hat.com>, akpm@...ux-foundation.org
Cc:     arnd@...db.de, jingshan@...ux.alibaba.com, linux-mm@...ck.org,
        linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] mm: Introduce new MADV_NOMOVABLE behavior



On 10/19/2022 11:17 PM, David Hildenbrand wrote:
>> I observed one migration failure case (which is not easy to reproduce)
>> is that, the 'thp_migration_fail' count is 1 and the
>> 'thp_split_page_failed' count is also 1.
>>
>> That means when migrating a THP which is in CMA area, but can not
>> allocate a new THP due to memory fragmentation, so it will split the
>> THP. However THP split is also failed, probably the reason is temporary
>> reference count of this THP. And the temporary reference count can be
>> caused by dropping page caches (I observed the drop caches operation in
>> the system), but we can not drop the shmem page caches due to they are
>> already dirty at that time.
>>
>> So we can try again in migrate_pages() if THP split is failed to
>> mitigate the failure of migration, especially for the failure reason is
>> temporary reference count? Does this sound reasonable for you?
> 
> It sound reasonable, and I understand that debugging these issues is 
> tricky. But we really have to figure out the root cause to make these 
> pages that are indeed movable (but only temporarily not movable for 
> reason XYZ) movable.
> 
> We'd need some indication to retry migration longer / again.

OK. Let me try this and see if there are other possible failure cases in 
the products.

>>
>> However I still worried there are other possible cases to cause
>> migration failure, so no CMA allocation for our case seems more stable 
>> IMO.
> 
> Yes, I can understand that. But as one example, you're approach doesn't 
> handle the case that a page that was allocated on !CMA/!ZONE_MOVABLE 
> would get migrated to CMA/ZONE_MOVABLE just before you would try pinning 
> the page (to migrate it again off CMA/ZONE_MOVABLE).

Indeed, like you said before, just helpful to minimize page migration 
now. Maybe I can take MADV_PINNABLE into considering when allocating new 
pages, such as alloc_migration_target().

Anyway let me try to fix the root cause first to see if it can solve our 
problem.

> We really have to fix the root cause.

OK. Thanks for your input.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ