linux-kernel - Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABKxMyO21rF+f2vpS+t++DAFHiy_MeWDBjB-AvupysKnDHRJfA@mail.gmail.com>
Date:   Tue, 18 Oct 2022 17:27:42 +0800
From:   黄杰 <huangjie.albert@...edance.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Muchun Song <songmuchun@...edance.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm: hugetlb: support get/set_policy for hugetlb_vm_ops

David Hildenbrand <david@...hat.com> 于2022年10月17日周一 20:00写道：
>
> On 17.10.22 13:46, 黄杰 wrote:
> > David Hildenbrand <david@...hat.com> 于2022年10月17日周一 19:33写道：
> >>
> >> On 17.10.22 11:48, 黄杰 wrote:
> >>> David Hildenbrand <david@...hat.com> 于2022年10月17日周一 16:44写道：
> >>>>
> >>>> On 12.10.22 10:15, Albert Huang wrote:
> >>>>> From: "huangjie.albert" <huangjie.albert@...edance.com>
> >>>>>
> >>>>> implement these two functions so that we can set the mempolicy to
> >>>>> the inode of the hugetlb file. This ensures that the mempolicy of
> >>>>> all processes sharing this huge page file is consistent.
> >>>>>
> >>>>> In some scenarios where huge pages are shared:
> >>>>> if we need to limit the memory usage of vm within node0, so I set qemu's
> >>>>> mempilciy bind to node0, but if there is a process (such as virtiofsd)
> >>>>> shared memory with the vm, in this case. If the page fault is triggered
> >>>>> by virtiofsd, the allocated memory may go to node1 which  depends on
> >>>>> virtiofsd.
> >>>>>
> >>>>
> >>>> Any VM that uses hugetlb should be preallocating memory. For example,
> >>>> this is the expected default under QEMU when using huge pages.
> >>>>
> >>>> Once preallocation does the right thing regarding NUMA policy, there is
> >>>> no need to worry about it in other sub-processes.
> >>>>
> >>>
> >>> Hi, David
> >>> thanks for your reminder
> >>>
> >>> Yes, you are absolutely right, However, the pre-allocation mechanism
> >>> does solve this problem.
> >>> However, some scenarios do not like to use the pre-allocation mechanism, such as
> >>> scenarios that are sensitive to virtual machine startup time, or
> >>> scenarios that require
> >>> high memory utilization. The on-demand allocation mechanism may be better,
> >>> so the key point is to find a way support for shared policy。
> >>
> >> Using hugetlb -- with a fixed pool size -- without preallocation is like
> >> playing with fire. Hugetlb reservation makes one believe that on-demand
> >> allocation is going to work, but there are various scenarios where that
> >> can go seriously wrong, and you can run out of huge pages.
> >>
> >> If you're using hugetlb as memory backend for a VM without
> >> preallocation, you really have to be very careful. I can only advise
> >> against doing that.
> >>
> >>
> >> Also: why does another process read/write *first* to a guest physical
> >> memory location before the OS running inside the VM even initialized
> >> that memory? That sounds very wrong. What am I missing?
> >>
> >
> > for example : virtio ring buffer.
> > For the avial descriptor, the guest kernel only gives an address to
> > the backend,
> > and does not actually access the memory.
>
> Okay, thanks. So we're essentially providing uninitialized memory to a
> device? Hm, that implies that the device might have access to memory
> that was previously used by someone else ... not sure how to feel about
> that, but maybe this is just the way of doing things.
>
> The "easy" user-space fix would be to simply similarly mbind() in  the
> other processes where we mmap(). Has that option been explored?

This can also solve the problem temporarily, but we need to change all
processes that share memory with it, so it can't be done once and for
all

>
> --
> Thanks,
>
> David / dhildenb
>