[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+CK2bBt0Gujv9BdhghVkbFRirAxCYXbpH-nquccPsKGnGwOBQ@mail.gmail.com>
Date: Thu, 9 Feb 2023 13:15:56 -0500
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Chih-En Lin <shiyn.lin@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Qi Zheng <zhengqi.arch@...edance.com>,
David Hildenbrand <david@...hat.com>,
"Matthew Wilcox (Oracle)" <willy@...radead.org>,
Christophe Leroy <christophe.leroy@...roup.eu>,
John Hubbard <jhubbard@...dia.com>,
Nadav Amit <namit@...are.com>, Barry Song <baohua@...nel.org>,
Steven Rostedt <rostedt@...dmis.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>,
Namhyung Kim <namhyung@...nel.org>,
Yang Shi <shy828301@...il.com>, Peter Xu <peterx@...hat.com>,
Vlastimil Babka <vbabka@...e.cz>,
"Zach O'Keefe" <zokeefe@...gle.com>,
Yun Zhou <yun.zhou@...driver.com>,
Hugh Dickins <hughd@...gle.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Yu Zhao <yuzhao@...gle.com>, Juergen Gross <jgross@...e.com>,
Tong Tiangen <tongtiangen@...wei.com>,
Liu Shixin <liushixin2@...wei.com>,
Anshuman Khandual <anshuman.khandual@....com>,
Li kunyu <kunyu@...china.com>,
Minchan Kim <minchan@...nel.org>,
Miaohe Lin <linmiaohe@...wei.com>,
Gautam Menghani <gautammenghani201@...il.com>,
Catalin Marinas <catalin.marinas@....com>,
Mark Brown <broonie@...nel.org>, Will Deacon <will@...nel.org>,
Vincenzo Frascino <Vincenzo.Frascino@....com>,
Thomas Gleixner <tglx@...utronix.de>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
Andy Lutomirski <luto@...nel.org>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
Fenghua Yu <fenghua.yu@...el.com>,
Andrei Vagin <avagin@...il.com>,
Barret Rhoden <brho@...gle.com>,
Michal Hocko <mhocko@...e.com>,
"Jason A. Donenfeld" <Jason@...c4.com>,
Alexey Gladkov <legion@...nel.org>,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, linux-trace-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org,
Dinglan Peng <peng301@...due.edu>,
Pedro Fonseca <pfonseca@...due.edu>,
Jim Huang <jserv@...s.ncku.edu.tw>,
Huichun Feng <foxhoundsk.tw@...il.com>
Subject: Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@...il.com> wrote:
>
> v3 -> v4
> - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g.,
> s390 and powerpc32, don't support the PMD entry and PTE table
> operations.
> - Fix unmatch type of break_cow_pte_range() in
> migrate_vma_collect_pmd().
> - Don’t break COW PTE in folio_referenced_one().
> - Fix the wrong VMA range checking in break_cow_pte_range().
> - Only break COW when we modify the soft-dirty bit in
> clear_refs_pte_range().
> - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c.
> - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to
> tlb_flush_pmd_range().
> - Handle VM_DONTCOPY with COW PTE fork.
> - Fix the wrong address and invalid vma in recover_pte_range().
> - Fix the infinite page fault loop in GUP routine.
> In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE
> handler, we return -EMLINK to let the GUP handles the page fault
> (call faultin_page() in __get_user_pages()).
> - return not_found(pvmw) if the break COW PTE failed in
> page_vma_mapped_walk().
> - Since COW PTE has the same result as the normal COW selftest, it
> probably passed the COW selftest.
>
> # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
> not ok 33 No leak from parent into child
> # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB)
> not ok 44 No leak from parent into child
> # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB)
> not ok 55 No leak from child into parent
> # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
> not ok 66 No leak from child into parent
>
> Bail out! 4 out of 147 tests failed
> # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0
> See the more information about anon cow hugetlb tests:
> https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/
>
>
> v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/
>
> RFC v2 -> v3
> - Change the sysctl with PID to prctl(PR_SET_COW_PTE).
> - Account all the COW PTE mapped pages in fork() instead of defer it to
> page fault (break COW PTE).
> - If there is an unshareable mapped page (maybe pinned or private
> device), recover all the entries that are already handled by COW PTE
> fork, then copy to the new one.
> - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP,
> follow_pfn_pte().
> - Remove the PTE ownership since we don't need it.
> - Use pte lock to protect the break COW PTE and free COW-ed PTE.
> - Do TLB flushing in break COW PTE handler.
> - Handle THP, KSM, madvise, mprotect, uffd and migrate device.
> - Handle the replacement page of uprobe.
> - Handle the clear_refs_write() of fs/proc.
> - All of the benchmarks dropped since the accounting and pte lock.
> The benchmarks of v3 is worse than RFC v2, most of the cases are
> similar to the normal fork, but there still have an use case
> (TriforceAFL) is better than the normal fork version.
>
> RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/
>
> RFC v1 -> RFC v2
> - Change the clone flag method to sysctl with PID.
> - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
> MMF_COW_PTE_READY, for the sysctl.
> - Change the owner pointer to use the folio padding.
> - Handle all the VMAs that cover the PTE table when doing the break COW PTE.
> - Remove the self-defined refcount to use the _refcount for the page
> table page.
> - Add the exclusive flag to let the page table only own by one task in
> some situations.
> - Invalidate address range MMU notifier and start the write_seqcount
> when doing the break COW PTE.
> - Handle the swap cache and swapoff.
>
> RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/
>
> ---
>
> Currently, copy-on-write is only used for the mapped memory; the child
> process still needs to copy the entire page table from the parent
> process during forking. The parent process might take a lot of time and
> memory to copy the page table when the parent has a big page table
> allocated. For example, the memory usage of a process after forking with
> 1 GB mapped memory is as follows:
For some reason, I was not able to reproduce performance improvements
with a simple fork() performance measurement program. The results that
I saw are the following:
Base:
Fork latency per gigabyte: 0.004416 seconds
Fork latency per gigabyte: 0.004382 seconds
Fork latency per gigabyte: 0.004442 seconds
COW kernel:
Fork latency per gigabyte: 0.004524 seconds
Fork latency per gigabyte: 0.004764 seconds
Fork latency per gigabyte: 0.004547 seconds
AMD EPYC 7B12 64-Core Processor
Base:
Fork latency per gigabyte: 0.003923 seconds
Fork latency per gigabyte: 0.003909 seconds
Fork latency per gigabyte: 0.003955 seconds
COW kernel:
Fork latency per gigabyte: 0.004221 seconds
Fork latency per gigabyte: 0.003882 seconds
Fork latency per gigabyte: 0.003854 seconds
Given, that page table for child is not copied, I was expecting the
performance to be better with COW kernel, and also not to depend on
the size of the parent.
Test program:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <sys/types.h>
#define USEC 1000000
#define GIG (1ul << 30)
#define NGIG 32
#define SIZE (NGIG * GIG)
#define NPROC 16
void main() {
int page_size = getpagesize();
struct timeval start, end;
long duration, i;
char *p;
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
exit(1);
}
madvise(p, SIZE, MADV_NOHUGEPAGE);
/* Touch every page */
for (i = 0; i < SIZE; i += page_size)
p[i] = 0;
gettimeofday(&start, NULL);
for (i = 0; i < NPROC; i++) {
int pid = fork();
if (pid == 0) {
sleep(30);
exit(0);
}
}
gettimeofday(&end, NULL);
/* Normolize per proc and per gig */
duration = ((end.tv_sec - start.tv_sec) * USEC
+ (end.tv_usec - start.tv_usec)) / NPROC / NGIG;
printf("Fork latency per gigabyte: %ld.%06ld seconds\n",
duration / USEC, duration % USEC);
}
Powered by blists - more mailing lists