[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ca39c36e-2646-d04c-e5ae-5fd2896f1279@redhat.com>
Date: Tue, 4 Jun 2019 09:16:15 -0400
From: Nitesh Narayan Lal <nitesh@...hat.com>
To: David Hildenbrand <david@...hat.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, pbonzini@...hat.com, lcapitulino@...hat.com,
pagupta@...hat.com, wei.w.wang@...el.com, yang.zhang.wz@...il.com,
riel@...riel.com, mst@...hat.com, dodgen@...gle.com,
konrad.wilk@...cle.com, dhildenb@...hat.com, aarcange@...hat.com,
alexander.duyck@...il.com
Subject: Re: [RFC][Patch v10 1/2] mm: page_hinting: core infrastructure
On 6/3/19 3:57 PM, David Hildenbrand wrote:
> On 03.06.19 19:03, Nitesh Narayan Lal wrote:
>> This patch introduces the core infrastructure for free page hinting in
>> virtual environments. It enables the kernel to track the free pages which
>> can be reported to its hypervisor so that the hypervisor could
>> free and reuse that memory as per its requirement.
>>
>> While the pages are getting processed in the hypervisor (e.g.,
>> via MADV_FREE), the guest must not use them, otherwise, data loss
>> would be possible. To avoid such a situation, these pages are
>> temporarily removed from the buddy. The amount of pages removed
>> temporarily from the buddy is governed by the backend(virtio-balloon
>> in our case).
>>
>> To efficiently identify free pages that can to be hinted to the
>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
>> chunks are reported to the hypervisor - especially, to not break up THP
>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
>> in the bitmap are an indication whether a page *might* be free, not a
>> guarantee. A new hook after buddy merging sets the bits.
>>
>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
>> asynchronously processes the bitmaps, trying to isolate and report pages
>> that are still free. The backend (virtio-balloon) is responsible for
>> reporting these batched pages to the host synchronously. Once reporting/
>> freeing is complete, isolated pages are returned back to the buddy.
>>
>> There are still various things to look into (e.g., memory hotplug, more
>> efficient locking, possible races when disabling).
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@...hat.com>
>> ---
>> drivers/virtio/Kconfig | 1 +
>> include/linux/page_hinting.h | 46 +++++++
>> mm/Kconfig | 6 +
>> mm/Makefile | 2 +
>> mm/page_alloc.c | 17 +--
>> mm/page_hinting.c | 236 +++++++++++++++++++++++++++++++++++
>> 6 files changed, 301 insertions(+), 7 deletions(-)
>> create mode 100644 include/linux/page_hinting.h
>> create mode 100644 mm/page_hinting.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index 35897649c24f..5a96b7a2ed1e 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -46,6 +46,7 @@ config VIRTIO_BALLOON
>> tristate "Virtio balloon driver"
>> depends on VIRTIO
>> select MEMORY_BALLOON
>> + select PAGE_HINTING
>> ---help---
>> This driver supports increasing and decreasing the amount
>> of memory within a KVM guest.
>> diff --git a/include/linux/page_hinting.h b/include/linux/page_hinting.h
>> new file mode 100644
>> index 000000000000..e65188fe1e6b
>> --- /dev/null
>> +++ b/include/linux/page_hinting.h
>> @@ -0,0 +1,46 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_PAGE_HINTING_H
>> +#define _LINUX_PAGE_HINTING_H
>> +
>> +/*
>> + * Minimum page order required for a page to be hinted to the host.
>> + */
>> +#define PAGE_HINTING_MIN_ORDER (MAX_ORDER - 2)
>> +
>> +/*
>> + * struct page_hinting_cb: holds the callbacks to store, report and cleanup
>> + * isolated pages.
>> + * @prepare: Callback responsible for allocating an array to hold
>> + * the isolated pages.
>> + * @hint_pages: Callback which reports the isolated pages synchornously
>> + * to the host.
>> + * @cleanup: Callback to free the the array used for reporting the
>> + * isolated pages.
>> + * @max_pages: Maxmimum pages that are going to be hinted to the host
>> + * at a time of granularity >= PAGE_HINTING_MIN_ORDER.
>> + */
>> +struct page_hinting_cb {
>> + int (*prepare)(void);
>> + void (*hint_pages)(struct list_head *list);
>> + void (*cleanup)(void);
>> + int max_pages;
> If we allocate the array in virtio-balloon differently (e.g. similar to
> bulk inflation/deflation of pfn's right now), we can most probably get
> rid of prepare() and cleanup(), simplifying the code further.
Yeap, as Alexander suggested. I will go for static allocation and remove
this prepare() and cleanup().
>> +};
>> +
>> +#ifdef CONFIG_PAGE_HINTING
>> +void page_hinting_enqueue(struct page *page, int order);
>> +void page_hinting_enable(const struct page_hinting_cb *cb);
>> +void page_hinting_disable(void);
>> +#else
>> +static inline void page_hinting_enqueue(struct page *page, int order)
>> +{
>> +}
>> +
>> +static inline void page_hinting_enable(struct page_hinting_cb *cb)
>> +{
>> +}
>> +
>> +static inline void page_hinting_disable(void)
>> +{
>> +}
>> +#endif
>> +#endif /* _LINUX_PAGE_HINTING_H */
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index ee8d1f311858..177d858de758 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -764,4 +764,10 @@ config GUP_BENCHMARK
>> config ARCH_HAS_PTE_SPECIAL
>> bool
>>
>> +# PAGE_HINTING will allow the guest to report the free pages to the
>> +# host in regular interval of time.
>> +config PAGE_HINTING
>> + bool
>> + def_bool n
>> + depends on X86_64
>> endmenu
>> diff --git a/mm/Makefile b/mm/Makefile
>> index ac5e5ba78874..bec456dfee34 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -41,6 +41,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
>> interval_tree.o list_lru.o workingset.o \
>> debug.o $(mmu-y)
>>
>> +
>> # Give 'page_alloc' its own module-parameter namespace
>> page-alloc-y := page_alloc.o
>> page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
>> @@ -94,6 +95,7 @@ obj-$(CONFIG_Z3FOLD) += z3fold.o
>> obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>> obj-$(CONFIG_CMA) += cma.o
>> obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
>> +obj-$(CONFIG_PAGE_HINTING) += page_hinting.o
>> obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
>> obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
>> obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3b13d3914176..d12f69e0e402 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -68,6 +68,7 @@
>> #include <linux/lockdep.h>
>> #include <linux/nmi.h>
>> #include <linux/psi.h>
>> +#include <linux/page_hinting.h>
>>
>> #include <asm/sections.h>
>> #include <asm/tlbflush.h>
>> @@ -873,10 +874,10 @@ compaction_capture(struct capture_control *capc, struct page *page,
>> * -- nyc
>> */
>>
>> -static inline void __free_one_page(struct page *page,
>> +inline void __free_one_page(struct page *page,
>> unsigned long pfn,
>> struct zone *zone, unsigned int order,
>> - int migratetype)
>> + int migratetype, bool hint)
>> {
>> unsigned long combined_pfn;
>> unsigned long uninitialized_var(buddy_pfn);
>> @@ -951,6 +952,8 @@ static inline void __free_one_page(struct page *page,
>> done_merging:
>> set_page_order(page, order);
>>
>> + if (hint)
>> + page_hinting_enqueue(page, order);
>> /*
>> * If this is not the largest possible page, check if the buddy
>> * of the next-highest order is free. If it is, it's possible
>> @@ -1262,7 +1265,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>> if (unlikely(isolated_pageblocks))
>> mt = get_pageblock_migratetype(page);
>>
>> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
>> + __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
>> trace_mm_page_pcpu_drain(page, 0, mt);
>> }
>> spin_unlock(&zone->lock);
>> @@ -1271,14 +1274,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>> static void free_one_page(struct zone *zone,
>> struct page *page, unsigned long pfn,
>> unsigned int order,
>> - int migratetype)
>> + int migratetype, bool hint)
>> {
>> spin_lock(&zone->lock);
>> if (unlikely(has_isolate_pageblock(zone) ||
>> is_migrate_isolate(migratetype))) {
>> migratetype = get_pfnblock_migratetype(page, pfn);
>> }
>> - __free_one_page(page, pfn, zone, order, migratetype);
>> + __free_one_page(page, pfn, zone, order, migratetype, hint);
>> spin_unlock(&zone->lock);
>> }
>>
>> @@ -1368,7 +1371,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>> migratetype = get_pfnblock_migratetype(page, pfn);
>> local_irq_save(flags);
>> __count_vm_events(PGFREE, 1 << order);
>> - free_one_page(page_zone(page), page, pfn, order, migratetype);
>> + free_one_page(page_zone(page), page, pfn, order, migratetype, true);
>> local_irq_restore(flags);
>> }
>>
>> @@ -2968,7 +2971,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
>> */
>> if (migratetype >= MIGRATE_PCPTYPES) {
>> if (unlikely(is_migrate_isolate(migratetype))) {
>> - free_one_page(zone, page, pfn, 0, migratetype);
>> + free_one_page(zone, page, pfn, 0, migratetype, true);
>> return;
>> }
>> migratetype = MIGRATE_MOVABLE;
>> diff --git a/mm/page_hinting.c b/mm/page_hinting.c
>> new file mode 100644
>> index 000000000000..7341c6462de2
>> --- /dev/null
>> +++ b/mm/page_hinting.c
>> @@ -0,0 +1,236 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Page hinting support to enable a VM to report the freed pages back
>> + * to the host.
>> + *
>> + * Copyright Red Hat, Inc. 2019
>> + *
>> + * Author(s): Nitesh Narayan Lal <nitesh@...hat.com>
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/page_hinting.h>
>> +#include <linux/kvm_host.h>
>> +
>> +/*
>> + * struct hinting_bitmap: holds the bitmap pointer which tracks the freed PFNs
>> + * and other required parameters which could help in retrieving the original
>> + * PFN value using the bitmap.
>> + * @bitmap: Pointer to the bitmap of free PFN.
>> + * @base_pfn: Starting PFN value for the zone whose bitmap is stored.
>> + * @free_pages: Tracks the number of free pages of granularity
>> + * PAGE_HINTING_MIN_ORDER.
>> + * @nbits: Indicates the total size of the bitmap in bits allocated
>> + * at the time of initialization.
>> + */
>> +struct hinting_bitmap {
>> + unsigned long *bitmap;
>> + unsigned long base_pfn;
>> + atomic_t free_pages;
>> + unsigned long nbits;
>> +} bm_zone[MAX_NR_ZONES];
>> +
>> +static void init_hinting_wq(struct work_struct *work);
>> +extern int __isolate_free_page(struct page *page, unsigned int order);
>> +extern void __free_one_page(struct page *page, unsigned long pfn,
>> + struct zone *zone, unsigned int order,
>> + int migratetype, bool hint);
>> +const struct page_hinting_cb *hcb;
>> +struct work_struct hinting_work;
>> +
>> +static unsigned long find_bitmap_size(struct zone *zone)
>> +{
>> + unsigned long nbits = ALIGN(zone->spanned_pages,
>> + PAGE_HINTING_MIN_ORDER);
>> +
>> + nbits = nbits >> PAGE_HINTING_MIN_ORDER;
>> + return nbits;
> I think we can simplify this to
>
> return (zone->spanned_pages >> PAGE_HINTING_MIN_ORDER) + 1;
>
I will check this.
>> +}
>> +
>> +void page_hinting_enable(const struct page_hinting_cb *callback)
>> +{
>> + struct zone *zone;
>> + int idx = 0;
>> + unsigned long bitmap_size = 0;
> You should probably protect enabling via a mutex and return -EINVAL or
> similar if we already have a callback set (if we ever have different
> drivers). But this has very little priority :)
I will have to look into this.
>> +
>> + for_each_populated_zone(zone) {
>> + spin_lock(&zone->lock);
>> + bitmap_size = find_bitmap_size(zone);
>> + bm_zone[idx].bitmap = bitmap_zalloc(bitmap_size, GFP_KERNEL);
>> + if (!bm_zone[idx].bitmap)
>> + return;
>> + bm_zone[idx].nbits = bitmap_size;
>> + bm_zone[idx].base_pfn = zone->zone_start_pfn;
>> + spin_unlock(&zone->lock);
>> + idx++;
>> + }
>> + hcb = callback;
>> + INIT_WORK(&hinting_work, init_hinting_wq);
> There are also possible races when enabling, you will have to take care
> of at one point.
No page will be enqueued until hcb is not set.
I can probably move it below INIT_WORK.
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_enable);
>> +
>> +void page_hinting_disable(void)
>> +{
>> + struct zone *zone;
>> + int idx = 0;
>> +
>> + cancel_work_sync(&hinting_work);
>> + hcb = NULL;
>> + for_each_populated_zone(zone) {
>> + spin_lock(&zone->lock);
>> + bitmap_free(bm_zone[idx].bitmap);
>> + bm_zone[idx].base_pfn = 0;
>> + bm_zone[idx].nbits = 0;
>> + atomic_set(&bm_zone[idx].free_pages, 0);
>> + spin_unlock(&zone->lock);
>> + idx++;
>> + }
>> +}
>> +EXPORT_SYMBOL_GPL(page_hinting_disable);
>> +
>> +static unsigned long pfn_to_bit(struct page *page, int zonenum)
>> +{
>> + unsigned long bitnr;
>> +
>> + bitnr = (page_to_pfn(page) - bm_zone[zonenum].base_pfn)
>> + >> PAGE_HINTING_MIN_ORDER;
>> + return bitnr;
>> +}
>> +
>> +static void release_buddy_pages(struct list_head *pages)
> maybe "release_isolated_pages", not sure.
>
>> +{
>> + int mt = 0, zonenum, order;
>> + struct page *page, *next;
>> + struct zone *zone;
>> + unsigned long bitnr;
>> +
>> + list_for_each_entry_safe(page, next, pages, lru) {
>> + zonenum = page_zonenum(page);
>> + zone = page_zone(page);
>> + bitnr = pfn_to_bit(page, zonenum);
>> + spin_lock(&zone->lock);
>> + list_del(&page->lru);
>> + order = page_private(page);
>> + set_page_private(page, 0);
>> + mt = get_pageblock_migratetype(page);
>> + __free_one_page(page, page_to_pfn(page), zone,
>> + order, mt, false);
>> + spin_unlock(&zone->lock);
>> + }
>> +}
>> +
>> +static void bm_set_pfn(struct page *page)
>> +{
>> + unsigned long bitnr = 0;
>> + int zonenum = page_zonenum(page);
>> + struct zone *zone = page_zone(page);
>> +
>> + lockdep_assert_held(&zone->lock);
>> + bitnr = pfn_to_bit(page, zonenum);
>> + if (bm_zone[zonenum].bitmap &&
>> + bitnr < bm_zone[zonenum].nbits &&
>> + !test_and_set_bit(bitnr, bm_zone[zonenum].bitmap))
>> + atomic_inc(&bm_zone[zonenum].free_pages);
>> +}
>> +
>> +static void scan_hinting_bitmap(int zonenum, int free_pages)
>> +{
>> + unsigned long set_bit, start = 0;
>> + struct page *page;
>> + struct zone *zone;
>> + int scanned_pages = 0, ret = 0, order, isolated_cnt = 0;
>> + LIST_HEAD(isolated_pages);
>> +
>> + ret = hcb->prepare();
>> + if (ret < 0)
>> + return;
>> + for (;;) {
>> + ret = 0;
>> + set_bit = find_next_bit(bm_zone[zonenum].bitmap,
>> + bm_zone[zonenum].nbits, start);
>> + if (set_bit >= bm_zone[zonenum].nbits)
>> + break;
>> + page = pfn_to_online_page((set_bit << PAGE_HINTING_MIN_ORDER) +
>> + bm_zone[zonenum].base_pfn);
>> + if (!page)
>> + continue;
> You are not clearing the bit / decrementing the counter.
>
>> + zone = page_zone(page);
>> + spin_lock(&zone->lock);
>> +
>> + if (PageBuddy(page) && page_private(page) >=
>> + PAGE_HINTING_MIN_ORDER) {
>> + order = page_private(page);
>> + ret = __isolate_free_page(page, order);
>> + }
>> + clear_bit(set_bit, bm_zone[zonenum].bitmap);
>> + spin_unlock(&zone->lock);
>> + if (ret) {
>> + /*
>> + * restoring page order to use it while releasing
>> + * the pages back to the buddy.
>> + */
>> + set_page_private(page, order);
>> + list_add_tail(&page->lru, &isolated_pages);
>> + isolated_cnt++;
>> + if (isolated_cnt == hcb->max_pages) {
>> + hcb->hint_pages(&isolated_pages);
>> + release_buddy_pages(&isolated_pages);
>> + isolated_cnt = 0;
>> + }
>> + }
>> + start = set_bit + 1;
>> + scanned_pages++;
>> + }
>> + if (isolated_cnt) {
>> + hcb->hint_pages(&isolated_pages);
>> + release_buddy_pages(&isolated_pages);
>> + }
>> + hcb->cleanup();
>> + if (scanned_pages > free_pages)
>> + atomic_sub((scanned_pages - free_pages),
>> + &bm_zone[zonenum].free_pages);
> This looks overly complicated. Can't we somehow simply decrement when
> clearing a bit?
>
>> +}
>> +
>> +static bool check_hinting_threshold(void)
>> +{
>> + int zonenum = 0;
>> +
>> + for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> + if (atomic_read(&bm_zone[zonenum].free_pages) >=
>> + hcb->max_pages)
>> + return true;
>> + }
>> + return false;
>> +}
>> +
>> +static void init_hinting_wq(struct work_struct *work)
>> +{
>> + int zonenum = 0, free_pages = 0;
>> +
>> + for (; zonenum < MAX_NR_ZONES; zonenum++) {
>> + free_pages = atomic_read(&bm_zone[zonenum].free_pages);
>> + if (free_pages >= hcb->max_pages) {
>> + /* Find a better way to synchronize per zone
>> + * free_pages.
>> + */
>> + atomic_sub(free_pages,
>> + &bm_zone[zonenum].free_pages);
> I can't follow yet why we need that information.
We don't want to enqueue multiple jobs just because we are delaying the
decrementing of free_pages.
I agree, even I am not convinced with the approach which I have right now.
I will try to come up with a better way.
> Wouldn't it be enough
> to just track the number of set bits in the bitmap and start hinting
> depending on that count? (there are false positives, but do we really care?)
In an attempt to minimize the false positives, I might have overly
complicated this part.
I will try to simplify this in the next posting.
>
>> + scan_hinting_bitmap(zonenum, free_pages);
>> + }
>> + }
>> +}
>> +
>> +void page_hinting_enqueue(struct page *page, int order)
>> +{
>> + if (hcb && order >= PAGE_HINTING_MIN_ORDER)
>> + bm_set_pfn(page);
>> + else
>> + return;
>> +
>> + if (check_hinting_threshold()) {
>> + int cpu = smp_processor_id();
>> +
>> + queue_work_on(cpu, system_wq, &hinting_work);
>> + }
>> +}
>>
>
--
Regards
Nitesh
Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)
Powered by blists - more mailing lists