[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com>
Date: Sun, 12 Jan 2025 20:43:04 +0530
From: Dev Jain <dev.jain@....com>
To: Nico Pache <npache@...hat.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Cc: ryan.roberts@....com, anshuman.khandual@....com, catalin.marinas@....com,
cl@...two.org, vbabka@...e.cz, mhocko@...e.com, apopple@...dia.com,
dave.hansen@...ux.intel.com, will@...nel.org, baohua@...nel.org,
jack@...e.cz, srivatsa@...il.mit.edu, haowenchao22@...il.com,
hughd@...gle.com, aneesh.kumar@...nel.org, yang@...amperecomputing.com,
peterx@...hat.com, ioworker0@...il.com, wangkefeng.wang@...wei.com,
ziy@...dia.com, jglisse@...gle.com, surenb@...gle.com,
vishal.moola@...il.com, zokeefe@...gle.com, zhengqi.arch@...edance.com,
jhubbard@...dia.com, 21cnbao@...il.com, willy@...radead.org,
kirill.shutemov@...ux.intel.com, david@...hat.com, aarcange@...hat.com,
raquini@...hat.com, sunnanyong@...wei.com, usamaarif642@...il.com,
audra@...hat.com, akpm@...ux-foundation.org
Subject: Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP
support
On 09/01/25 5:01 am, Nico Pache wrote:
> khugepaged scans PMD ranges for potential collapse to a hugepage. To add
> mTHP support we use this scan to instead record chunks of fully utilized
> sections of the PMD.
>
> create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks.
> by default we will set this to order 3. The reasoning is that for 4K 512
> PMD size this results in a 64 bit bitmap which has some optimizations.
> For other arches like ARM64 64K, we can set a larger order if needed.
>
> khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap
> that represents chunks of fully utilized regions. We can then determine
> what mTHP size fits best and in the following patch, we set this bitmap
> while scanning the PMD.
>
> max_ptes_none is used as a scale to determine how "full" an order must
> be before being considered for collapse.
>
> Signed-off-by: Nico Pache <npache@...hat.com>
Here is my objective and subjective analysis.
--------------------- Mathematical Analysis ----------------------------
First off, in my series, I am missing one thing: When I fail to collapse
a range as a result of exhausting all orders, I should jump to the next
range starting with the alignment of order at which I just failed (i.e,
the minimum allowable order). Currently I am exiting which is wasteful.
This should be easy to extend.
Let's call Nico Pache's method NP, and Dev Jain's method DJ.
The only difference between NP and DJ is the remembrance of the state of
the PTEs (I have already reverted to using write lock for my v2, see
this reply:
https://lore.kernel.org/all/71a2f471-3082-4ca2-ac48-2f664977282f@arm.com/).
NP saves empty and filled PTEs in a bitmap, and then uses the optimized
(let us assume them to be constant time operations, hopefully?) bitmap
APIs, like bitmap_shift_right(), and bitmap_weight(). The latter is what
determines whether for a particular order, the range has enough filled
PTEs to justify calling collapse_huge_page(). DJ does this naively with
a brute force iteration. Now, the edge NP has over DJ is just before
calling collapse_huge_page(). Post calling that, everything remains the
same; assuming that both DJ and NP derive the same collapsed ranges,
then, collapse_huge_page() succeeds in NP if and only if it succeeds in
DJ. NP knows quickly, when and when not to call collapse_huge_page().
So the question is, how many iterations of PTE scans does NP save over
DJ? We prove a stronger result:
Let the PTE table consist of 2^x pte entries, where x >= 2 belongs to
the set of natural numbers (x >= 2 because anon collapse is not
supported for x < 2). Let f(x) = #iterations performed by DJ in the
worst case. The worst case is, all orders are enabled, and we have some
distribution of the PTEs.
Lemma: f(x) <= 2^x * (x-1).
Proof: We perform weak mathematical induction on x. Assume
zero-indexing, and assume the worst case that all orders are enabled.
Base case: Let x = 4. We have 16 entries. NP does 16 iterations. In the
worst case, this what DJ may do: it will iterate all 16, and not
collapse. Then it will iterate from 0-7 pte entries, and not collapse.
Then, it will iterate from 0-3, and may or may not collapse. Here is the
worst case (When I write l-r below, I mean the range l-r, both inclusive):
0-15 fail -> 0-7 fail -> 0-3 fail/success -> 4-7 fail/success -> 8-15
fail -> 8-11 fail/success -> 12-15 fail/success
#iterations = 16+8+4+4+8+4+4 = 48 = 2^4 * (4-1).
Convince yourself that f(2) == 4 and f(3) <= 16.
Inductive hypothesis: Assume the lemma is true for some x > 4.
We need to prove for x+1. Let X = 2^(x+1) - 1, and Y = 2^x - 1.
Let DJ start scanning from 0. If 0-X is success, we are done. So, assume
0-X fails. Now, DJ looks at 0-Y. Note that, for any x s.t 0 <= x <= X,
if DJ starts scanning from x, there is no way it will cross the scan
into the next half, i.e Y+1-X, since the scan length from x will be
atmost the highest power-of-two alignment of x. Given this, we scan 0-Y
completely, and then start from Y+1. Having established the above
argument, we can use the inductive hypothesis on 0-Y and Y+1-X to derive
that f(x) <= 2^(x+1) + 2f(x) <= 2^(x+1) + 2(2^x * (x-1)) = 2^(x+1) +
2^(x+1) * (x-1) = 2^(x+1) * (x). Q.E.D
(You can simulate the proof for x=9; what I mean to say is, we can
divide 0-511 into 0-255 and 256-511).
So, for our case, NP performs 512 iterations, and DJ performs in the
worst case, 512 * 8 = 4096 iterations. Hmm...
----------------------- Subjective Analysis --------------------------
[1] The worst case is, well, the worst case; the practical case on arm64
machines is, only 2048k and 64k is enabled. So, NP performs 512
iterations, and DJ performs 512 + 16 * (number of 64K chunks) = 512 +
512 = 1024 iterations. That is not much difference.
[2] Both implementations have the creep problem described here:
https://lore.kernel.org/all/20241216165105.56185-13-dev.jain@arm.com/
[3] The bitmaps are being created only for pte_none case, whereas we
also have the shared and the swap case. In fact, for the none case, if
we have PMD-order enabled, we will almost surely collapse to PMD size,
given that the common case is khugepaged_max_ptes_none = 511: if we have
one PTE filled, we will call collapse_huge_page(), and both DJ and NP
will perform 512 iterations. Therefore, the bitmaps also need to be
extended to the shared and the swap case so as to get any potential
benefit from the idea in a practical scenario.
[4] NP does not handle scanning VMAs of size less than PMD. Since NP
introduces a single entry point of khugepaged_collapse_single_pmd(), I
am not sure how difficult it will be to extend the implementation, and
given that, MTHP_BITMAP_SIZE is a compile time constant. I have extended
this in my v2 and it works.
[5] In NP, for a bit to be set, the chunk completely needs to be
filled/shared/swapped out. This completely changes the meaning of the
sysfs parameters max_ptes_*. It also makes it very hard to debug since
it may happen that, distribution D1 has more PTEs filled but less bits
in the bitmap set than distribution D2. DJ also changes the meaning of
the parameters due to scaling errors, but that is only an off-by-one
error, therefore, the behaviour is easier to predict.
[6] In NP, we have: remember the state of the PTEs ->
alloc_charge_folio() -> read_lock(), unlock() -> mmap_write_lock() ->
anon_vma_lock_write() -> TLB flush for PMD. There is a significant time
difference here, and the remembered PTEs may be vastly different from
what we have now. Obviously I cannot pinpoint an exact number as to how
bad this is or not for the accuracy of khugepaged. For DJ, since a
particular PTE may come into the scan range multiple times, DJ gives the
range a chance if the distribution changed recently.
[7] The last time I tried to save on #iterations of PTE entries, this
happened:
https://lore.kernel.org/all/ZugxqJ-CjEi5lEW_@casper.infradead.org/
Matthew Wilcox pointed out a potential regression in a patch which was
an "obvious optimization" to me on paper; I tested and it turned out he
was correct:
https://lore.kernel.org/all/8700274f-b521-444e-8d17-c06039a1376c@arm.com/
We could argue whether it is worth to have the bitmap memory
initialization, copying, weight checking, and recursion overhead.
This is the most I can come up with by analyzing from a third person
perspective :)
Powered by blists - more mailing lists