linux-kernel - Re: [PATCH v14 4/5] mm: support reporting free page blocks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <599A79DF.2000707@intel.com>
Date:   Mon, 21 Aug 2017 14:12:47 +0800
From:   Wei Wang <wei.w.wang@...el.com>
To:     Michal Hocko <mhocko@...nel.org>
CC:     virtio-dev@...ts.oasis-open.org, linux-kernel@...r.kernel.org,
        qemu-devel@...gnu.org, virtualization@...ts.linux-foundation.org,
        kvm@...r.kernel.org, linux-mm@...ck.org, mst@...hat.com,
        akpm@...ux-foundation.org, mawilcox@...rosoft.com,
        david@...hat.com, cornelia.huck@...ibm.com,
        mgorman@...hsingularity.net, aarcange@...hat.com,
        amit.shah@...hat.com, pbonzini@...hat.com, willy@...radead.org,
        liliang.opensource@...il.com, yang.zhang.wz@...il.com,
        quan.xu@...yun.com
Subject: Re: [PATCH v14 4/5] mm: support reporting free page blocks

On 08/18/2017 09:46 PM, Michal Hocko wrote:
> On Thu 17-08-17 11:26:55, Wei Wang wrote:
>> This patch adds support to walk through the free page blocks in the
>> system and report them via a callback function. Some page blocks may
>> leave the free list after zone->lock is released, so it is the caller's
>> responsibility to either detect or prevent the use of such pages.
> This could see more details to be honest. Especially the usecase you are
> going to use this for. This will help us to understand the motivation
> in future when the current user might be gone a new ones largely diverge
> into a different usage. This wouldn't be the first time I have seen
> something like that.

OK, I will more details here about how it's used to accelerate live 
migration.

>> Signed-off-by: Wei Wang <wei.w.wang@...el.com>
>> Signed-off-by: Liang Li <liang.z.li@...el.com>
>> Cc: Michal Hocko <mhocko@...nel.org>
>> Cc: Michael S. Tsirkin <mst@...hat.com>
>> ---
>>   include/linux/mm.h |  6 ++++++
>>   mm/page_alloc.c    | 44 ++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 50 insertions(+)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 46b9ac5..cd29b9f 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1835,6 +1835,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size,
>>   		unsigned long zone_start_pfn, unsigned long *zholes_size);
>>   extern void free_initmem(void);
>>   
>> +extern void walk_free_mem_block(void *opaque1,
>> +				unsigned int min_order,
>> +				void (*visit)(void *opaque2,
>> +					      unsigned long pfn,
>> +					      unsigned long nr_pages));
>> +
>>   /*
>>    * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
>>    * into the buddy system. The freed pages will be poisoned with pattern
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 6d00f74..a721a35 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4762,6 +4762,50 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>>   	show_swap_cache_info();
>>   }
>>   
>> +/**
>> + * walk_free_mem_block - Walk through the free page blocks in the system
>> + * @opaque1: the context passed from the caller
>> + * @min_order: the minimum order of free lists to check
>> + * @visit: the callback function given by the caller
> The original suggestion for using visit was motivated by a visit design
> pattern but I can see how this can be confusing. Maybe a more explicit
> name wold be better. What about report_free_range.


I'm afraid that name would be too long to fit in nicely.
How about simply naming it "report"?


>
>> + *
>> + * The function is used to walk through the free page blocks in the system,
>> + * and each free page block is reported to the caller via the @visit callback.
>> + * Please note:
>> + * 1) The function is used to report hints of free pages, so the caller should
>> + * not use those reported pages after the callback returns.
>> + * 2) The callback is invoked with the zone->lock being held, so it should not
>> + * block and should finish as soon as possible.
> I think that the explicit note about zone->lock is not really need. This
> can change in future and I would even bet that somebody might rely on
> the lock being held for some purpose and silently get broken with the
> change. Instead I would much rather see something like the following:
> "
> Please note that there are no locking guarantees for the callback

Just a little confused with this one:

The callback is invoked within zone->lock, why would we claim it "no
locking guarantees for the callback"?

> and
> that the reported pfn range might be freed or disappear after the
> callback returns so the caller has to be very careful how it is used.
>
> The callback itself must not sleep or perform any operations which would
> require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> or via any lock dependency. It is generally advisable to implement
> the callback as simple as possible and defer any heavy lifting to a
> different context.
>
> There is no guarantee that each free range will be reported only once
> during one walk_free_mem_block invocation.
>
> pfn_to_page on the given range is strongly discouraged and if there is
> an absolute need for that make sure to contact MM people to discuss
> potential problems.
>
> The function itself might sleep so it cannot be called from atomic
> contexts.
>
> In general low orders tend to be very volatile and so it makes more
> sense to query larger ones for various optimizations which like
> ballooning etc... This will reduce the overhead as well.
> "

I think it looks quite comprehensive. Thanks.


Best,
Wei