[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqz8qqa9t84.fsf@ackerleytng-ctop.c.googlers.com>
Date: Thu, 13 Feb 2025 07:52:43 +0000
From: Ackerley Tng <ackerleytng@...gle.com>
To: Peter Xu <peterx@...hat.com>
Cc: tabba@...gle.com, quic_eberman@...cinc.com, roypat@...zon.co.uk,
jgg@...dia.com, david@...hat.com, rientjes@...gle.com, fvdl@...gle.com,
jthoughton@...gle.com, seanjc@...gle.com, pbonzini@...hat.com,
zhiquan1.li@...el.com, fan.du@...el.com, jun.miao@...el.com,
isaku.yamahata@...el.com, muchun.song@...ux.dev, mike.kravetz@...cle.com,
erdemaktas@...gle.com, vannapurve@...gle.com, qperret@...gle.com,
jhubbard@...dia.com, willy@...radead.org, shuah@...nel.org,
brauner@...nel.org, bfoster@...hat.com, kent.overstreet@...ux.dev,
pvorel@...e.cz, rppt@...nel.org, richard.weiyang@...il.com,
anup@...infault.org, haibo1.xu@...el.com, ajones@...tanamicro.com,
vkuznets@...hat.com, maciej.wieczor-retman@...el.com, pgonda@...gle.com,
oliver.upton@...ux.dev, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
kvm@...r.kernel.org, linux-kselftest@...r.kernel.org, linux-fsdevel@...ck.org
Subject: Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and
truncate from hugetlb
Peter Xu <peterx@...hat.com> writes:
> On Tue, Sep 10, 2024 at 11:43:46PM +0000, Ackerley Tng wrote:
>> +static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
>> + struct hugepage_subpool *spool)
>> +{
>> + bool memcg_charge_was_prepared;
>> + struct mem_cgroup *memcg;
>> + struct mempolicy *mpol;
>> + nodemask_t *nodemask;
>> + struct folio *folio;
>> + gfp_t gfp_mask;
>> + int ret;
>> + int nid;
>> +
>> + gfp_mask = htlb_alloc_mask(h);
>> +
>> + memcg = get_mem_cgroup_from_current();
>> + ret = mem_cgroup_hugetlb_try_charge(memcg,
>> + gfp_mask | __GFP_RETRY_MAYFAIL,
>> + pages_per_huge_page(h));
>> + if (ret == -ENOMEM)
>> + goto err;
>> +
>> + memcg_charge_was_prepared = ret != -EOPNOTSUPP;
>> +
>> + /* Pages are only to be taken from guest_memfd subpool and nowhere else. */
>> + if (hugepage_subpool_get_pages(spool, 1))
>> + goto err_cancel_charge;
>> +
>> + nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol,
>> + &nodemask);
>> + /*
>> + * charge_cgroup_reservation is false because we didn't make any cgroup
>> + * reservations when creating the guest_memfd subpool.
>
> Hmm.. isn't this the exact reason to set charge_cgroup_reservation==true
> instead?
>
> IIUC gmem hugetlb pages should participate in the hugetlb cgroup resv
> charge as well. It is already involved in the rest cgroup charges, and I
> wonder whether it's intended that the patch treated the resv accounting
> specially.
>
> Thanks,
>
Thank you for your careful reviews!
I misunderstood charging a cgroup for hugetlb reservations when I was
working on this patch.
Before this, I thought hugetlb_cgroup_charge_cgroup_rsvd() was only for
resv_map reservations, so I set charge_cgroup_reservation to false since
guest_memfd didn't use resv_map, but I understand better now. Please
help me check my understanding:
+ All reservations are made at the hstate
+ In addition, every reservation is associated with a subpool (through
spool->rsv_hpages) or recorded in a resv_map
+ Reservations are either in a subpool or in a resv_map but not both
+ hugetlb_cgroup_charge_cgroup_rsvd() is for any reservation
Regarding the time that a cgroup is charged for reservations:
+ If a reservation is made during subpool creation, the cgroup is not
charged during the reservation by the subpool, probably by design
since the process doing the mount may not be the process using the
pages
+ Charging a cgroup for the reservation happens in
hugetlb_reserve_pages(), which is called at mmap() time.
For guest_memfd, I see two options:
Option 1: Charge cgroup for reservations at fault time
Pros:
+ Similar in behavior to a fd on a hugetlbfs mount, where the cgroup of
the process calling fallocate() is charged for the reservation.
+ Symmetric approach, since uncharging happens when the hugetlb folio is
freed.
Cons:
+ Room for allocation failure after guest_memfd creation. Even though
this guest_memfd had been created with a subpool and pages have been
reserved, there is a chance of hitting the cgroup's hugetlb
reservation cap and failing to allocate a page.
Option 2 (preferred): Charge cgroup for reservations at guest_memfd
creation time
Pros:
+ Once guest_memfd file is created, a page is guaranteed at fault time.
+ Simplifies/doesn't carry over the complexities of the hugetlb(fs)
reservation system
Cons:
+ The cgroup being charged is the cgroup of the process creating
guest_memfd, which might be an issue if users expect the process
faulting the page to be charged.
Implementation:
+ At guest_memfd creation time, when creating the subpool, charge the
cgroups for everything:
+ for hugetlb usage
+ hugetlb reservation usage and
+ hugetlb usage by page count (as in mem_cgroup_charge_hugetlb(),
which is new since [1])
+ Refactoring in [1] would be focused on just dequeueing a folio or
failing which, allocating a surplus folio.
+ After allocation, don't set cgroup on the folio so that the freeing
process doesn't uncharge anything
+ Uncharge when the file is closed
Please let me know if anyone has any thoughts/suggestions!
>> + *
>> + * use_hstate_resv is true because we reserved from global hstate when
>> + * creating the guest_memfd subpool.
>> + */
>> + folio = hugetlb_alloc_folio(h, mpol, nid, nodemask, false, true);
>> + mpol_cond_put(mpol);
>> +
>> + if (!folio)
>> + goto err_put_pages;
>> +
>> + hugetlb_set_folio_subpool(folio, spool);
>> +
>> + if (memcg_charge_was_prepared)
>> + mem_cgroup_commit_charge(folio, memcg);
>> +
>> +out:
>> + mem_cgroup_put(memcg);
>> +
>> + return folio;
>> +
>> +err_put_pages:
>> + hugepage_subpool_put_pages(spool, 1);
>> +
>> +err_cancel_charge:
>> + if (memcg_charge_was_prepared)
>> + mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h));
>> +
>> +err:
>> + folio = ERR_PTR(-ENOMEM);
>> + goto out;
>> +}
[1] https://lore.kernel.org/all/7348091f4c539ed207d9bb0f3744d0f0efb7f2b3.1726009989.git.ackerleytng@google.com/
Powered by blists - more mailing lists