linux-kernel - Re: [RFC][Patch v11 1/2] mm: page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Uf7xsdh9OgBq-kyTkyvh8Qo9kV4uiWTVP7NKqzO4X0wyg@mail.gmail.com>
Date:   Fri, 12 Jul 2019 09:22:37 -0700
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Nitesh Narayan Lal <nitesh@...hat.com>
Cc:     kvm list <kvm@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-mm <linux-mm@...ck.org>,
        Paolo Bonzini <pbonzini@...hat.com>, lcapitulino@...hat.com,
        pagupta@...hat.com, wei.w.wang@...el.com,
        Yang Zhang <yang.zhang.wz@...il.com>,
        Rik van Riel <riel@...riel.com>,
        David Hildenbrand <david@...hat.com>,
        "Michael S. Tsirkin" <mst@...hat.com>, dodgen@...gle.com,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        dhildenb@...hat.com, Andrea Arcangeli <aarcange@...hat.com>,
        john.starks@...rosoft.com, Dave Hansen <dave.hansen@...el.com>,
        Michal Hocko <mhocko@...e.com>
Subject: Re: [RFC][Patch v11 1/2] mm: page_hinting: core infrastructure

On Thu, Jul 11, 2019 at 6:13 PM Nitesh Narayan Lal <nitesh@...hat.com> wrote:
>
>
> On 7/11/19 7:20 PM, Alexander Duyck wrote:
> > On Thu, Jul 11, 2019 at 10:58 AM Nitesh Narayan Lal <nitesh@...hat.com> wrote:
> >>
> >> On 7/10/19 5:56 PM, Alexander Duyck wrote:
> >>> On Wed, Jul 10, 2019 at 12:52 PM Nitesh Narayan Lal <nitesh@...hat.com> wrote:
> >>>> This patch introduces the core infrastructure for free page hinting in
> >>>> virtual environments. It enables the kernel to track the free pages which
> >>>> can be reported to its hypervisor so that the hypervisor could
> >>>> free and reuse that memory as per its requirement.
> >>>>
> >>>> While the pages are getting processed in the hypervisor (e.g.,
> >>>> via MADV_FREE), the guest must not use them, otherwise, data loss
> >>>> would be possible. To avoid such a situation, these pages are
> >>>> temporarily removed from the buddy. The amount of pages removed
> >>>> temporarily from the buddy is governed by the backend(virtio-balloon
> >>>> in our case).
> >>>>
> >>>> To efficiently identify free pages that can to be hinted to the
> >>>> hypervisor, bitmaps in a coarse granularity are used. Only fairly big
> >>>> chunks are reported to the hypervisor - especially, to not break up THP
> >>>> in the hypervisor - "MAX_ORDER - 2" on x86, and to save space. The bits
> >>>> in the bitmap are an indication whether a page *might* be free, not a
> >>>> guarantee. A new hook after buddy merging sets the bits.
> >>>>
> >>>> Bitmaps are stored per zone, protected by the zone lock. A workqueue
> >>>> asynchronously processes the bitmaps, trying to isolate and report pages
> >>>> that are still free. The backend (virtio-balloon) is responsible for
> >>>> reporting these batched pages to the host synchronously. Once reporting/
> >>>> freeing is complete, isolated pages are returned back to the buddy.
> >>>>
> >>>> There are still various things to look into (e.g., memory hotplug, more
> >>>> efficient locking, possible races when disabling).
> >>>>
> >>>> Signed-off-by: Nitesh Narayan Lal <nitesh@...hat.com>
> > So just FYI, I thought I would try the patches. It looks like there
> > might be a bug somewhere that is causing it to free memory it
> > shouldn't be. After about 10 minutes my VM crashed with a system log
> > full of various NULL pointer dereferences.
>
> That's interesting, I have tried the patches with MADV_DONTNEED as well.
> I just retried it but didn't see any crash. May I know what kind of
> workload you are running?

I was running the page_fault1 test on a VM with 80G of memory.

> >  The only change I had made
> > is to use MADV_DONTNEED instead of MADV_FREE in QEMU since my headers
> > didn't have MADV_FREE on the host. It occurs to me one advantage of
> > MADV_DONTNEED over MADV_FREE is that you are more likely to catch
> > these sort of errors since it zeros the pages instead of leaving them
> > intact.
> For development purpose maybe. For the final patch-set I think we
> discussed earlier why we should keep MADV_FREE.

I'm still not convinced MADV_FREE is a net win, at least for
performance. You are still paying the cost for the VMEXIT in order to
regain ownership of the page. In the case that you are under memory
pressure it is essentially equivalent to MADV_DONTNEED. Also it
doesn't really do much to help with the memory footprint of the VM
itself. With the MADV_DONTNEED the pages are freed back and you have a
greater liklihood of reducing the overall memory footprint of the
entire system since you would be more likely to be assigned pages that
were recently used rather than having to access a cold page.

<snip>

> >>>> +void page_hinting_enqueue(struct page *page, int order)
> >>>> +{
> >>>> +       int zone_idx;
> >>>> +
> >>>> +       if (!page_hitning_conf || order < PAGE_HINTING_MIN_ORDER)
> >>>> +               return;
> >>> I would think it is going to be expensive to be jumping into this
> >>> function for every freed page. You should probably have an inline
> >>> taking care of the order check before you even get here since it would
> >>> be faster that way.
> >> I see, I can take a look. Thanks.
> >>>> +
> >>>> +       bm_set_pfn(page);
> >>>> +       if (atomic_read(&page_hinting_active))
> >>>> +               return;
> >>> So I would think this piece is racy. Specifically if you set a PFN
> >>> that is somewhere below the PFN you are currently processing in your
> >>> scan it is going to remain unset until you have another page freed
> >>> after the scan is completed. I would worry you can end up with a batch
> >>> free of memory resulting in a group of pages sitting at the start of
> >>> your bitmap unhinted.
> >> True, but that will be hinted next time threshold is met.
> > Yes, but that assumes that there is another free immediately coming.
> > It is possible that you have a big application run and then
> > immediately shut down and have it free all its memory at once. Worst
> > case scenario would be that it starts by freeing from the end and
> > works toward the start. With that you could theoretically end up with
> > a significant chunk of memory waiting some time for another big free
> > to come along.
>
> Any suggestion on some benchmark/test application which I could run to
> see this kind of behavior?

Like I mentioned before, try doing a VM with a bigger memory
footprint. You could probably just do a stack of VMs like what we were
doing with the memhog test. Basically the longer it takes to process
all the pages the greater the liklihood that there are still pages
left when they are freed.