linux-kernel - Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before pinning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55D1A525.5090706@ozlabs.ru>
Date:	Mon, 17 Aug 2015 19:11:01 +1000
From:	Alexey Kardashevskiy <aik@...abs.ru>
To:	Vlastimil Babka <vbabka@...e.cz>, linux-mm@...ck.org
Cc:	Alexander Duyck <alexander.h.duyck@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Benjamin Herrenschmidt <benh@...nel.crashing.org>,
	David Gibson <david@...son.dropbear.id.au>,
	Johannes Weiner <hannes@...xchg.org>,
	Joonsoo Kim <js1304@...il.com>, Mel Gorman <mgorman@...e.de>,
	Michal Hocko <mhocko@...e.cz>,
	Paul Mackerras <paulus@...ba.org>,
	Sasha Levin <sasha.levin@...cle.com>,
	linux-kernel@...r.kernel.org,
	Alex Williamson <alex.williamson@...hat.com>,
	Alexander Graf <agraf@...e.de>,
	Paolo Bonzini <pbonzini@...hat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@...ux.vnet.ibm.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [RFC PATCH kernel vfio] mm: vfio: Move pages out of CMA before
 pinning

On 08/17/2015 05:45 PM, Vlastimil Babka wrote:
> On 08/05/2015 10:08 AM, Alexey Kardashevskiy wrote:
>> This is about VFIO aka PCI passthrough used from QEMU.
>> KVM is irrelevant here.
>>
>> QEMU is a machine emulator. It allocates guest RAM from anonymous memory
>> and these pages are movable which is ok. They may happen to be allocated
>> from the contiguous memory allocation zone (CMA). Which is also ok as
>> long they are movable.
>>
>> However if the guest starts using VFIO (which can be hotplugged into
>> the guest), in most cases it involves DMA which requires guest RAM pages
>> to be pinned and not move once their addresses are programmed to
>> the hardware for DMA.
>>
>> So we end up in a situation when quite many pages in CMA are not movable
>> anymore. And we get bunch of these:
>>
>> [77306.513966] alloc_contig_range: [1f3800, 1f78c4) PFNs busy
>> [77306.514448] alloc_contig_range: [1f3800, 1f78c8) PFNs busy
>> [77306.514927] alloc_contig_range: [1f3800, 1f78cc) PFNs busy
>
> IIRC CMA was for mobile devices and their camera/codec drivers and you
> don't use QEMU on those? What do you need CMA for in your case?


I do not want QEMU to get memory from CMA, this is my point. It just 
happens sometime that the kernel allocates movable pages from there.


>
>> This is a very rough patch to start the conversation about how to move
>> pages properly. mm/page_alloc.c does this and
>> arch/powerpc/mm/mmu_context_iommu.c exploits it.
>
> OK such conversation should probably start by mentioning the VM_PINNED
> effort by Peter Zijlstra: https://lkml.org/lkml/2014/5/26/345
>
> It's more general approach to dealing with pinned pages, and moving them
> out of CMA area (and compacting them in general) prior pinning is one of
> the things that should be done within that framework.


And I assume these patches did not go anywhere, right?...



> Then there's the effort to enable migrating pages other than LRU during
> compaction (and thus CMA allocation): https://lwn.net/Articles/650864/
> I don't know if that would be applicable in your use case, i.e. are the
> pins for DMA short-lived and can the isolation/migration code wait a bit
> for the transfer to finish so it can grab them, or something?


Pins for DMA are long-lived, pretty much as long as the guest is running. 
So this "compaction" is too late.


>>
>> Please do not comment on the style and code placement,
>> this is just to give some context :)
>>
>> Obviously, this does not work well - it manages to migrate only few pages
>> and crashes as it is missing locks/disabling interrupts and I probably
>> should not just remove pages from LRU list (normally, I guess, only these
>> can migrate) and a million of other things.
>>
>> The questions are:
>>
>> - what is the correct way of telling if the page is in CMA?
>> is (get_pageblock_migratetype(page) == MIGRATE_CMA) good enough?
>
> Should be.
>
>> - how to tell MM to move page away? I am calling migrate_pages() with
>> an get_new_page callback which allocates a page with GFP_USER but without
>> GFP_MOVABLE which should allocate new page out of CMA which seems ok but
>> there is a little convern that we might want to add MOVABLE back when
>> VFIO device is unplugged from the guest.
>
> Hmm, once the page is allocated, then the migratetype is not tracked
> anywhere (except in page_owner debug data). But the unmovable allocations
> might exhaust available unmovable pageblocks and lead to fragmentation. So
> "add MOVABLE back" would be too late. Instead we would need to tell the
>allocator somehow to give us movable page but outside of CMA.

It is it movable, why do we care if it is in CMA or not?

> CMA's own
> __alloc_contig_migrate_range() avoids this problem by allocating movable
> pages, but the range has been already page-isolated and thus the allocator
> won't see the pages there.You obviously can't take this approach and
> isolate all CMA pageblocks like that.  That smells like a new __GFP_FLAG, meh.


I understood (more or less) all of it except the __GFP_FLAG - when/what 
would use it?



>> - do I need to isolate pages by using isolate_migratepages_range,
>> reclaim_clean_pages_from_list like __alloc_contig_migrate_range does?
>> I dropped them for now and the patch uses only @migratepages from
>> the compact_control struct.
>
> You don't have to do reclaim_clean_pages_from_list(), but the isolation has
> to be careful, yeah.


The isolation here means the whole CMA zone isolation which I "obviously 
can't take this approach"? :)


>> - are there any flags in madvise() to address this (could not
>> locate any relevant)?
>
> AFAIK there's no madvise(I_WILL_BE_PINNING_THIS_RANGE)
>
>> - what else is missing? disabled interrupts? locks?
>
> See what isolate_migratepages_block() does.


Thanks for the pointers! I'll have a closer look at Peter's patchset.


-- 
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/