[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230403052233.1880567-1-ankur.a.arora@oracle.com>
Date: Sun, 2 Apr 2023 22:22:24 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc: torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
luto@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
hpa@...or.com, mingo@...hat.com, juri.lelli@...hat.com,
willy@...radead.org, mgorman@...e.de, peterz@...radead.org,
rostedt@...dmis.org, tglx@...utronix.de,
vincent.guittot@...aro.org, jon.grimm@....com, bharata@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
ankur.a.arora@...cle.com
Subject: [PATCH 0/9] x86/clear_huge_page: multi-page clearing
This series introduces multi-page clearing for hugepages.
This is a follow up of some of the ideas discussed at:
https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/
On x86 page clearing is typically done via string intructions. These,
unlike a MOV loop, allow us to explicitly advertise the region-size to
the processor, which could serve as a hint to current (and/or
future) uarchs to elide cacheline allocation.
In current generation processors, Milan (and presumably other Zen
variants) use the hint to elide cacheline allocation (for
region-size > LLC-size.)
An additional reason for doing this is that string instructions are typically
microcoded, and clearing in bigger chunks than the current page-at-a-
time logic amortizes some of the cost.
All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.
There are, however, some problems:
1. extended zeroing periods means there's an increased latency due to
the now missing preemption points.
That's handled in patches 7, 8, 9:
"sched: define TIF_ALLOW_RESCHED"
"irqentry: define irqentry_exit_allow_resched()"
"x86/clear_huge_page: make clear_contig_region() preemptible"
by the context marking itself reschedulable, and rescheduling in
irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)
2. the current page-at-a-time clearing logic does left-right narrowing
towards the faulting page which benefits workloads by maintaining
cache locality for workloads which have a sequential pattern. Clearing
in large chunks loses that.
Some (but not all) of that could be ameliorated by something like
this patch:
https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/
But, before doing that I'd like some comments on whether that is
worth doing for this specific use case?
Rest of the series:
Patches 1, 2, 3:
"huge_pages: get rid of process_huge_page()"
"huge_page: get rid of {clear,copy}_subpage()"
"huge_page: allow arch override for clear/copy_huge_page()"
are mechanical and they simplify some of the current clear_huge_page()
logic.
Patches 4, 5:
"x86/clear_page: parameterize clear_page*() to specify length"
"x86/clear_pages: add clear_pages()"
add clear_pages() and helpers.
Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
chunked x86 clear_huge_page() implementation.
Performance
==
Demand fault performance gets a decent boost:
*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 8.76 11.82 +34.93%
pg-sz=1GB 8.99 12.18 +35.48%
*Milan* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)
pg-sz=2MB 12.24 17.54 +43.30%
pg-sz=1GB 17.98 37.24 +107.11%
vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
worse when user space tries to touch those pages:
*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(mem=4GB/task, tasks=128)
stime 293.02 +- .49% 239.39 +- .83% -18.30%
utime 440.11 +- .28% 508.74 +- .60% +15.59%
wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%
*Milan* mm/clear_huge_page x86/clear_huge_page change
(mem=1GB/task, tasks=512)
stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%
Also at:
github.com/terminus/linux clear-pages.v1
Comments appreciated!
Ankur Arora (9):
huge_pages: get rid of process_huge_page()
huge_page: get rid of {clear,copy}_subpage()
huge_page: allow arch override for clear/copy_huge_page()
x86/clear_page: parameterize clear_page*() to specify length
x86/clear_pages: add clear_pages()
mm/clear_huge_page: use multi-page clearing
sched: define TIF_ALLOW_RESCHED
irqentry: define irqentry_exit_allow_resched()
x86/clear_huge_page: make clear_contig_region() preemptible
arch/x86/include/asm/page.h | 6 +
arch/x86/include/asm/page_32.h | 6 +
arch/x86/include/asm/page_64.h | 25 +++--
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/lib/clear_page_64.S | 45 ++++++--
arch/x86/mm/hugetlbpage.c | 59 ++++++++++
include/linux/sched.h | 29 +++++
kernel/entry/common.c | 8 ++
kernel/sched/core.c | 36 +++---
mm/memory.c | 174 +++++++++++++++--------------
10 files changed, 270 insertions(+), 120 deletions(-)
--
2.31.1
Powered by blists - more mailing lists