[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220106004656.126790-9-daniel.m.jordan@oracle.com>
Date: Wed, 5 Jan 2022 19:46:48 -0500
From: Daniel Jordan <daniel.m.jordan@...cle.com>
To: Alexander Duyck <alexanderduyck@...com>,
Alex Williamson <alex.williamson@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Ben Segall <bsegall@...gle.com>,
Cornelia Huck <cohuck@...hat.com>,
Dan Williams <dan.j.williams@...el.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Herbert Xu <herbert@...dor.apana.org.au>,
Ingo Molnar <mingo@...hat.com>,
Jason Gunthorpe <jgg@...dia.com>,
Johannes Weiner <hannes@...xchg.org>,
Josh Triplett <josh@...htriplett.org>,
Michal Hocko <mhocko@...e.com>, Nico Pache <npache@...hat.com>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Peter Zijlstra <peterz@...radead.org>,
Steffen Klassert <steffen.klassert@...unet.com>,
Steve Sistare <steven.sistare@...cle.com>,
Tejun Heo <tj@...nel.org>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Vincent Guittot <vincent.guittot@...aro.org>
Cc: linux-mm@...ck.org, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org,
Daniel Jordan <daniel.m.jordan@...cle.com>
Subject: [RFC 08/16] vfio/type1: Cache locked_vm to ease mmap_lock contention
padata threads hold mmap_lock as reader for the majority of their
runtime in order to call pin_user_pages_remote(), but they also
periodically take mmap_lock as writer for short periods to adjust
mm->locked_vm, hurting parallelism.
Alleviate the write-side contention with a per-thread cache of locked_vm
which allows taking mmap_lock as writer far less frequently.
Failure to refill the cache due to insufficient locked_vm will not cause
the entire pinning operation to error out. This avoids spurious failure
in case some pinned pages aren't accounted to locked_vm.
Cache size is limited to provide some protection in the unlikely event
of a concurrent locked_vm accounting operation in the same address space
needlessly failing in case the cache takes more locked_vm than it needs.
Performance Testing
===================
The tests measure the time from qemu invocation to roughly the end of
qemu guest initialization, and cover all combinations of these
parameters:
- guest memory type (hugetlb, THP)
- guest memory size (16, 128, 360-or-980 G)
- number of qemu prealloc threads (0, 16)
The goal is to find reasonable values for
- number of padata threads (0, 8, 16, 24, 32)
- locked_vm cache size in pages (0, 32768, 65536, 131072)
The winning compromises seem to be 16 threads and 65536 pages. They
both balance between performance on the one hand and threading
efficiency or needless locked_vm accounting failures on the other.
Hardware info:
- Intel Xeon Platinum 8167M (Skylake)
2 nodes * 26 cores * 2 threads = 104 CPUs
2.00GHz, performance scaling governor, turbo enabled
384G/node = 768G memory
- AMD EPYC 7J13 (Milan)
2 nodes * 64 cores * 2 threads = 256 CPUs
2.50GHz, performance scaling governor, turbo enabled
~1T/node = ~2T memory
The base kernel is 5.14. I had to downgrade from 5.15 because of an
intel iommu bug that's since been fixed. The qemu version is 6.2.0-rc4.
Key:
qthr: number of qemu prealloc threads
mem: guest memory size
pin: wall time of the largest VFIO page pin (qemu does several)
qemu init: wall time of qemu invocation to roughly the end of qemu init
thr: number of padata threads
Summary Data
============
All tests in the summary section use 16 padata threads and 65536 pages
of locked_vm cache.
With these settings, there's some contention on pmd lock. When
increasing the padata min chunk size from 128M to 1G to align threads on
PMD page table boundaries, the contention drops significantly but the
times get worse (don't you hate it when that happens?). I'm planning to
look into this more.
qemu prealloc significantly reduces the pinning time, as expected, but
counterintuitively makes qemu on the base kernel take longer to
initialize a THP-backed guest than when qemu prealloc isn't used.
That's something to investigate, but not this series.
Intel
~~~~~
base test
...................... ..........................................
qemu qemu qemu
mem pin init pin pin init init
qthr (G) (s) (std) (s) (std) speedup (s) (std) speedup (s) (std)
hugetlb
0 16 2.9 (0.0) 3.8 (0.0) 11.2x 0.3 (0.0) 5.2x 0.7 (0.0)
0 128 26.6 (0.1) 28.0 (0.1) 12.0x 2.2 (0.0) 8.7x 3.2 (0.0)
0 360 75.1 (0.5) 77.5 (0.5) 11.9x 6.3 (0.0) 9.2x 8.4 (0.0)
16 16 0.1 (0.0) 0.7 (0.0) 2.5x 0.0 (0.0) 1.1x 0.7 (0.0)
16 128 0.6 (0.0) 3.6 (0.0) 7.9x 0.1 (0.0) 1.2x 3.0 (0.0)
16 360 1.8 (0.0) 9.4 (0.0) 8.5x 0.2 (0.0) 1.2x 7.8 (0.0)
THP
0 16 3.3 (0.0) 4.2 (0.0) 7.3x 0.4 (0.0) 4.2x 1.0 (0.0)
0 128 29.5 (0.2) 30.5 (0.2) 11.8x 2.5 (0.0) 9.6x 3.2 (0.0)
0 360 83.8 (0.6) 85.1 (0.6) 11.9x 7.0 (0.0) 10.7x 8.0 (0.0)
16 16 0.6 (0.0) 6.1 (0.0) 4.0x 0.1 (0.0) 1.1x 5.6 (0.1)
16 128 5.1 (0.0) 44.5 (0.0) 9.6x 0.5 (0.0) 1.1x 40.3 (0.4)
16 360 14.4 (0.0) 125.4 (0.3) 9.7x 1.5 (0.0) 1.1x 111.5 (0.8)
AMD
~~~
base test
....................... ..........................................
qemu qemu qemu
mem pin init pin pin init init
qthr (G) (s) (std) (s) (std) speedup (s) (std) speedup (s) (std)
hugetlb
0 16 1.1 (0.0) 1.5 (0.0) 4.3x 0.2 (0.0) 2.6x 0.6 (0.0)
0 128 9.6 (0.1) 10.2 (0.1) 4.3x 2.2 (0.0) 3.6x 2.8 (0.0)
0 980 74.1 (0.7) 75.7 (0.7) 4.3x 17.1 (0.0) 3.9x 19.2 (0.0)
16 16 0.0 (0.0) 0.6 (0.0) 3.2x 0.0 (0.0) 1.0x 0.6 (0.0)
16 128 0.3 (0.0) 2.7 (0.0) 8.5x 0.0 (0.0) 1.1x 2.4 (0.0)
16 980 2.0 (0.0) 18.2 (0.1) 8.1x 0.3 (0.0) 1.1x 16.4 (0.0)
THP
0 16 1.2 (0.0) 1.7 (0.0) 4.0x 0.3 (0.0) 2.3x 0.7 (0.0)
0 128 10.9 (0.1) 11.4 (0.1) 4.1x 2.7 (0.2) 3.7x 3.1 (0.2)
0 980 85.3 (0.6) 86.1 (0.6) 4.7x 18.3 (0.0) 4.5x 19.0 (0.0)
16 16 0.5 (0.3) 6.2 (0.4) 5.1x 0.1 (0.0) 1.1x 5.7 (0.1)
16 128 3.4 (0.8) 45.5 (1.0) 8.5x 0.4 (0.1) 1.1x 42.1 (0.2)
16 980 19.6 (0.9) 337.9 (0.7) 6.5x 3.0 (0.2) 1.1x 320.4 (0.7)
All Data
========
The first row in each table is the base kernel (0 threads). The
remaining rows are all the test kernel and are sorted by fastest time.
Intel
~~~~~
hugetlb
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 16 -- 0 -- 2.9 (0.0) -- 3.8 (0.0)
65536 16 11.2x 0.3 (0.0) 5.2x 0.7 (0.0)
65536 24 11.2x 0.3 (0.0) 5.2x 0.7 (0.0)
131072 16 11.2x 0.3 (0.0) 5.1x 0.7 (0.0)
131072 24 10.9x 0.3 (0.0) 5.1x 0.7 (0.0)
32768 16 10.2x 0.3 (0.0) 5.1x 0.7 (0.0)
131072 32 10.4x 0.3 (0.0) 5.1x 0.7 (0.0)
65536 32 10.4x 0.3 (0.0) 5.1x 0.7 (0.0)
32768 32 10.5x 0.3 (0.0) 5.1x 0.7 (0.0)
32768 24 10.0x 0.3 (0.0) 5.1x 0.7 (0.0)
131072 8 7.4x 0.4 (0.0) 4.2x 0.9 (0.0)
65536 8 7.1x 0.4 (0.0) 4.1x 0.9 (0.0)
32768 8 6.8x 0.4 (0.0) 4.1x 0.9 (0.0)
0 8 2.7x 1.1 (0.3) 2.3x 1.6 (0.3)
0 16 1.9x 1.6 (0.0) 1.7x 2.2 (0.0)
0 32 1.9x 1.6 (0.0) 1.7x 2.2 (0.0)
0 24 1.8x 1.6 (0.0) 1.7x 2.2 (0.0)
131072 1 1.0x 2.9 (0.0) 1.0x 3.8 (0.0)
0 1 1.0x 2.9 (0.0) 1.0x 3.8 (0.0)
65536 1 1.0x 2.9 (0.0) 1.0x 3.8 (0.0)
32768 1 1.0x 3.0 (0.0) 1.0x 3.8 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 128 -- 0 -- 26.6 (0.1) -- 28.0 (0.1)
131072 24 13.1x 2.0 (0.0) 9.2x 3.0 (0.0)
131072 32 12.9x 2.1 (0.0) 9.2x 3.1 (0.0)
131072 16 12.7x 2.1 (0.0) 9.1x 3.1 (0.0)
65536 24 12.3x 2.2 (0.0) 8.9x 3.1 (0.0)
65536 32 12.3x 2.2 (0.0) 8.9x 3.2 (0.0)
65536 16 12.0x 2.2 (0.0) 8.7x 3.2 (0.0)
32768 24 11.1x 2.4 (0.0) 8.3x 3.4 (0.0)
32768 32 11.0x 2.4 (0.0) 8.2x 3.4 (0.0)
32768 16 11.0x 2.4 (0.0) 8.2x 3.4 (0.0)
131072 8 7.5x 3.6 (0.0) 6.1x 4.6 (0.0)
65536 8 7.1x 3.7 (0.1) 5.9x 4.8 (0.0)
32768 8 6.8x 3.9 (0.1) 5.7x 4.9 (0.1)
0 8 3.0x 8.9 (0.6) 2.8x 10.0 (0.7)
0 16 1.9x 13.8 (0.3) 1.9x 14.9 (0.3)
0 32 1.9x 14.1 (0.2) 1.8x 15.2 (0.3)
0 24 1.8x 14.4 (0.1) 1.8x 15.6 (0.1)
131072 1 1.0x 26.4 (0.2) 1.0x 27.8 (0.2)
65536 1 1.0x 26.5 (0.0) 1.0x 27.9 (0.0)
0 1 1.0x 26.6 (0.3) 1.0x 27.9 (0.3)
32768 1 1.0x 26.6 (0.2) 1.0x 28.0 (0.2)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 360 -- 0 -- 75.1 (0.5) -- 77.5 (0.5)
131072 24 13.0x 5.8 (0.0) 9.9x 7.8 (0.0)
131072 32 12.9x 5.8 (0.0) 9.8x 7.9 (0.0)
131072 16 12.6x 6.0 (0.0) 9.6x 8.1 (0.0)
65536 24 12.4x 6.0 (0.0) 9.6x 8.1 (0.0)
65536 32 12.1x 6.2 (0.0) 9.4x 8.3 (0.0)
65536 16 11.9x 6.3 (0.0) 9.2x 8.4 (0.0)
32768 24 11.3x 6.6 (0.0) 8.9x 8.7 (0.0)
32768 16 10.9x 6.9 (0.0) 8.7x 9.0 (0.0)
32768 32 10.7x 7.0 (0.1) 8.6x 9.1 (0.1)
131072 8 7.4x 10.1 (0.0) 6.3x 12.3 (0.0)
65536 8 7.2x 10.5 (0.1) 6.2x 12.6 (0.1)
32768 8 6.8x 11.1 (0.1) 5.9x 13.2 (0.1)
0 8 3.2x 23.6 (0.3) 3.0x 25.7 (0.3)
0 32 1.9x 39.2 (0.2) 1.9x 41.5 (0.2)
0 16 1.9x 39.8 (0.4) 1.8x 42.0 (0.4)
0 24 1.8x 40.9 (0.4) 1.8x 43.1 (0.4)
32768 1 1.0x 74.9 (0.5) 1.0x 77.3 (0.5)
131072 1 1.0x 75.3 (0.6) 1.0x 77.7 (0.6)
0 1 1.0x 75.6 (0.2) 1.0x 78.1 (0.2)
65536 1 1.0x 75.9 (0.1) 1.0x 78.4 (0.1)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 16 -- 0 -- 0.1 (0.0) -- 0.7 (0.0)
65536 24 8.3x 0.0 (0.0) 1.1x 0.7 (0.0)
65536 32 6.3x 0.0 (0.0) 1.1x 0.7 (0.0)
131072 32 4.2x 0.0 (0.0) 1.1x 0.7 (0.0)
65536 8 4.2x 0.0 (0.0) 1.1x 0.7 (0.0)
131072 24 4.2x 0.0 (0.0) 1.1x 0.7 (0.0)
32768 16 3.9x 0.0 (0.0) 1.1x 0.7 (0.0)
32768 32 3.5x 0.0 (0.0) 1.1x 0.7 (0.0)
32768 24 4.0x 0.0 (0.0) 1.1x 0.7 (0.0)
32768 8 2.6x 0.0 (0.0) 1.1x 0.7 (0.0)
0 16 3.1x 0.0 (0.0) 1.1x 0.7 (0.0)
131072 16 2.7x 0.0 (0.0) 1.1x 0.7 (0.0)
65536 16 2.5x 0.0 (0.0) 1.1x 0.7 (0.0)
0 24 2.5x 0.0 (0.0) 1.1x 0.7 (0.0)
0 8 2.8x 0.0 (0.0) 1.1x 0.7 (0.0)
131072 8 2.2x 0.0 (0.0) 1.1x 0.7 (0.0)
0 32 2.3x 0.0 (0.0) 1.1x 0.7 (0.0)
32768 1 0.9x 0.1 (0.0) 1.0x 0.8 (0.0)
131072 1 0.9x 0.1 (0.0) 1.0x 0.8 (0.0)
65536 1 0.9x 0.1 (0.0) 1.0x 0.8 (0.0)
0 1 0.9x 0.1 (0.0) 1.0x 0.8 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 128 -- 0 -- 0.6 (0.0) -- 3.6 (0.0)
131072 24 13.4x 0.0 (0.0) 1.2x 3.0 (0.0)
65536 32 11.8x 0.1 (0.0) 1.2x 3.0 (0.0)
131072 32 11.8x 0.1 (0.0) 1.2x 3.0 (0.0)
32768 32 10.4x 0.1 (0.0) 1.2x 3.0 (0.0)
32768 24 9.3x 0.1 (0.0) 1.2x 3.0 (0.0)
131072 16 8.7x 0.1 (0.0) 1.2x 3.0 (0.0)
65536 16 7.9x 0.1 (0.0) 1.2x 3.0 (0.0)
32768 16 7.7x 0.1 (0.0) 1.2x 3.0 (0.0)
65536 24 7.6x 0.1 (0.0) 1.2x 3.0 (0.0)
131072 8 5.7x 0.1 (0.0) 1.2x 3.0 (0.0)
65536 8 4.9x 0.1 (0.0) 1.2x 3.1 (0.0)
32768 8 4.6x 0.1 (0.0) 1.2x 3.1 (0.0)
0 8 3.9x 0.2 (0.0) 1.1x 3.1 (0.0)
0 16 3.1x 0.2 (0.1) 1.1x 3.1 (0.1)
0 24 2.9x 0.2 (0.0) 1.1x 3.2 (0.0)
0 32 2.6x 0.2 (0.0) 1.1x 3.2 (0.0)
131072 1 0.9x 0.7 (0.0) 1.0x 3.6 (0.0)
65536 1 0.9x 0.7 (0.0) 1.0x 3.6 (0.0)
32768 1 0.9x 0.7 (0.0) 1.0x 3.6 (0.0)
0 1 0.9x 0.7 (0.0) 1.0x 3.6 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 360 -- 0 -- 1.8 (0.0) -- 9.4 (0.0)
131072 32 15.1x 0.1 (0.0) 1.2x 7.7 (0.0)
65536 32 13.5x 0.1 (0.0) 1.2x 7.7 (0.0)
65536 24 11.6x 0.2 (0.0) 1.2x 7.8 (0.0)
131072 16 11.5x 0.2 (0.0) 1.2x 7.8 (0.0)
32768 32 11.3x 0.2 (0.0) 1.2x 7.8 (0.0)
32768 24 10.5x 0.2 (0.0) 1.2x 7.8 (0.0)
131072 24 10.4x 0.2 (0.0) 1.2x 7.8 (0.0)
32768 16 8.8x 0.2 (0.0) 1.2x 7.8 (0.0)
65536 16 8.5x 0.2 (0.0) 1.2x 7.8 (0.0)
131072 8 6.1x 0.3 (0.0) 1.2x 7.9 (0.1)
65536 8 5.5x 0.3 (0.0) 1.2x 7.9 (0.0)
32768 8 5.3x 0.3 (0.0) 1.2x 7.9 (0.0)
0 8 4.8x 0.4 (0.1) 1.2x 8.0 (0.1)
0 16 3.3x 0.5 (0.1) 1.2x 8.1 (0.1)
0 24 3.1x 0.6 (0.0) 1.1x 8.2 (0.0)
0 32 2.7x 0.7 (0.0) 1.1x 8.3 (0.0)
131072 1 0.9x 1.9 (0.0) 1.0x 9.5 (0.0)
32768 1 0.9x 1.9 (0.0) 1.0x 9.5 (0.0)
65536 1 0.9x 1.9 (0.0) 1.0x 9.6 (0.0)
0 1 0.9x 1.9 (0.0) 1.0x 9.5 (0.0)
THP
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 16 -- 0 -- 3.3 (0.0) -- 4.2 (0.0)
32768 32 7.5x 0.4 (0.0) 4.3x 1.0 (0.0)
131072 32 7.6x 0.4 (0.0) 4.3x 1.0 (0.0)
65536 16 7.3x 0.4 (0.0) 4.2x 1.0 (0.0)
65536 32 7.5x 0.4 (0.0) 4.3x 1.0 (0.0)
131072 16 7.2x 0.5 (0.0) 4.2x 1.0 (0.0)
65536 24 7.0x 0.5 (0.0) 4.1x 1.0 (0.0)
131072 24 6.9x 0.5 (0.0) 4.1x 1.0 (0.0)
32768 16 6.3x 0.5 (0.0) 3.9x 1.1 (0.0)
32768 24 5.7x 0.6 (0.0) 3.8x 1.1 (0.0)
32768 8 5.0x 0.7 (0.0) 3.5x 1.2 (0.0)
65536 8 5.4x 0.6 (0.0) 3.4x 1.2 (0.1)
131072 8 5.7x 0.6 (0.0) 3.5x 1.2 (0.1)
0 32 2.0x 1.6 (0.1) 1.8x 2.3 (0.1)
0 24 1.9x 1.7 (0.0) 1.7x 2.5 (0.1)
0 16 1.8x 1.8 (0.3) 1.6x 2.6 (0.3)
0 8 1.9x 1.7 (0.3) 1.7x 2.5 (0.3)
0 1 1.0x 3.3 (0.0) 1.0x 4.2 (0.0)
131072 1 1.0x 3.3 (0.0) 1.0x 4.2 (0.0)
65536 1 1.0x 3.3 (0.0) 1.0x 4.2 (0.0)
32768 1 1.0x 3.3 (0.0) 1.0x 4.2 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 128 -- 0 -- 29.5 (0.2) -- 30.5 (0.2)
131072 24 12.9x 2.3 (0.0) 10.3x 2.9 (0.0)
131072 32 12.8x 2.3 (0.0) 10.2x 3.0 (0.0)
131072 16 12.5x 2.4 (0.0) 10.0x 3.0 (0.0)
65536 24 12.1x 2.4 (0.0) 9.8x 3.1 (0.0)
65536 32 12.0x 2.4 (0.0) 9.8x 3.1 (0.0)
65536 16 11.8x 2.5 (0.0) 9.6x 3.2 (0.0)
32768 24 11.1x 2.7 (0.0) 9.1x 3.3 (0.0)
32768 32 10.7x 2.7 (0.0) 8.9x 3.4 (0.0)
32768 16 10.6x 2.8 (0.0) 8.8x 3.5 (0.0)
131072 8 7.3x 4.0 (0.0) 6.4x 4.8 (0.0)
65536 8 7.1x 4.2 (0.0) 6.2x 4.9 (0.0)
32768 8 6.6x 4.4 (0.0) 5.8x 5.2 (0.0)
0 8 3.6x 8.1 (0.7) 3.4x 9.0 (0.7)
0 32 2.2x 13.6 (1.9) 2.1x 14.5 (1.9)
0 16 2.1x 14.0 (3.2) 2.1x 14.8 (3.2)
0 24 2.1x 14.1 (3.1) 2.0x 15.0 (3.1)
0 1 1.0x 29.6 (0.2) 1.0x 30.6 (0.2)
32768 1 1.0x 29.6 (0.2) 1.0x 30.7 (0.2)
131072 1 1.0x 29.7 (0.0) 1.0x 30.7 (0.0)
65536 1 1.0x 29.8 (0.1) 1.0x 30.8 (0.1)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 360 -- 0 -- 83.8 (0.6) -- 85.1 (0.6)
131072 24 13.6x 6.2 (0.0) 12.0x 7.1 (0.0)
131072 32 13.4x 6.2 (0.0) 11.9x 7.2 (0.0)
65536 24 12.8x 6.6 (0.1) 11.3x 7.5 (0.1)
131072 16 12.7x 6.6 (0.0) 11.3x 7.5 (0.0)
65536 32 12.4x 6.8 (0.0) 11.0x 7.7 (0.0)
65536 16 11.9x 7.0 (0.0) 10.7x 8.0 (0.0)
32768 24 11.4x 7.4 (0.0) 10.3x 8.3 (0.0)
32768 32 11.0x 7.6 (0.0) 10.0x 8.5 (0.0)
32768 16 10.7x 7.8 (0.0) 9.7x 8.8 (0.0)
131072 8 7.4x 11.4 (0.0) 6.8x 12.4 (0.0)
65536 8 7.2x 11.7 (0.0) 6.7x 12.7 (0.0)
32768 8 6.7x 12.6 (0.1) 6.2x 13.6 (0.1)
0 8 3.1x 27.2 (6.1) 3.0x 28.3 (6.1)
0 32 2.1x 39.9 (6.4) 2.1x 41.0 (6.4)
0 24 2.1x 40.6 (6.6) 2.0x 41.7 (6.6)
0 16 2.0x 42.6 (0.1) 1.9x 43.8 (0.1)
131072 1 1.0x 83.9 (0.5) 1.0x 85.2 (0.5)
65536 1 1.0x 84.2 (0.3) 1.0x 85.5 (0.3)
32768 1 1.0x 84.6 (0.1) 1.0x 85.9 (0.1)
0 1 1.0x 84.9 (0.1) 1.0x 86.2 (0.1)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 16 -- 0 -- 0.6 (0.0) -- 6.1 (0.0)
65536 32 3.9x 0.1 (0.0) 1.1x 5.5 (0.0)
32768 32 3.9x 0.1 (0.0) 1.1x 5.6 (0.0)
131072 24 3.9x 0.1 (0.0) 1.1x 5.6 (0.0)
131072 32 3.9x 0.1 (0.0) 1.1x 5.5 (0.0)
65536 24 3.9x 0.1 (0.0) 1.1x 5.6 (0.0)
32768 24 3.9x 0.1 (0.0) 1.1x 5.6 (0.1)
65536 16 4.0x 0.1 (0.0) 1.1x 5.6 (0.1)
32768 16 3.9x 0.1 (0.0) 1.1x 5.6 (0.0)
131072 16 3.9x 0.1 (0.0) 1.1x 5.6 (0.1)
65536 8 4.0x 0.1 (0.0) 1.1x 5.6 (0.0)
131072 8 4.0x 0.1 (0.0) 1.1x 5.7 (0.1)
32768 8 4.0x 0.1 (0.0) 1.1x 5.6 (0.0)
0 32 1.6x 0.4 (0.0) 1.0x 5.9 (0.1)
0 24 1.6x 0.4 (0.0) 1.0x 5.9 (0.0)
0 16 1.5x 0.4 (0.0) 1.0x 6.0 (0.0)
0 8 1.5x 0.4 (0.0) 1.0x 5.9 (0.1)
65536 1 1.0x 0.6 (0.0) 1.0x 6.1 (0.1)
32768 1 1.0x 0.6 (0.0) 1.0x 6.1 (0.0)
0 1 1.0x 0.6 (0.0) 1.0x 6.2 (0.0)
131072 1 1.0x 0.6 (0.0) 1.0x 6.2 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 128 -- 0 -- 5.1 (0.0) -- 44.5 (0.0)
131072 32 16.5x 0.3 (0.0) 1.1x 40.4 (0.3)
65536 32 15.7x 0.3 (0.0) 1.1x 40.4 (0.6)
131072 24 13.9x 0.4 (0.0) 1.1x 39.8 (0.3)
32768 32 14.1x 0.4 (0.0) 1.1x 40.0 (0.5)
65536 24 12.9x 0.4 (0.0) 1.1x 39.8 (0.5)
32768 24 12.2x 0.4 (0.0) 1.1x 40.1 (0.1)
65536 16 9.6x 0.5 (0.0) 1.1x 40.3 (0.4)
131072 16 9.7x 0.5 (0.0) 1.1x 40.4 (0.5)
32768 16 9.2x 0.5 (0.0) 1.1x 40.8 (0.5)
65536 8 5.5x 0.9 (0.0) 1.1x 40.5 (0.5)
131072 8 5.5x 0.9 (0.0) 1.1x 40.7 (0.6)
32768 8 5.2x 1.0 (0.0) 1.1x 40.7 (0.3)
0 32 1.6x 3.1 (0.0) 1.0x 43.5 (0.8)
0 24 1.6x 3.2 (0.0) 1.0x 42.9 (0.5)
0 16 1.5x 3.3 (0.0) 1.0x 43.5 (0.4)
0 8 1.5x 3.5 (0.0) 1.0x 43.4 (0.5)
65536 1 1.0x 5.0 (0.0) 1.0x 44.6 (0.1)
32768 1 1.0x 5.0 (0.0) 1.0x 44.9 (0.2)
131072 1 1.0x 5.0 (0.0) 1.0x 44.8 (0.2)
0 1 1.0x 5.0 (0.0) 1.0x 44.8 (0.3)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 360 -- 0 -- 14.4 (0.0) -- 125.4 (0.3)
131072 32 16.5x 0.9 (0.0) 1.1x 112.0 (0.7)
65536 32 14.9x 1.0 (0.0) 1.1x 113.3 (1.3)
32768 32 14.0x 1.0 (0.0) 1.1x 112.6 (1.0)
131072 24 13.5x 1.1 (0.0) 1.1x 111.3 (0.9)
65536 24 13.3x 1.1 (0.0) 1.1x 112.3 (0.8)
32768 24 12.4x 1.2 (0.0) 1.1x 111.1 (0.8)
65536 16 9.7x 1.5 (0.0) 1.1x 111.5 (0.8)
131072 16 9.7x 1.5 (0.0) 1.1x 112.1 (1.2)
32768 16 9.3x 1.5 (0.0) 1.1x 113.2 (0.4)
131072 8 5.5x 2.6 (0.0) 1.1x 114.8 (1.3)
32768 8 5.5x 2.6 (0.0) 1.1x 114.1 (1.0)
65536 8 5.4x 2.6 (0.0) 1.1x 115.0 (3.3)
0 32 1.6x 8.8 (0.0) 1.0x 120.7 (0.7)
0 24 1.6x 8.9 (0.0) 1.1x 119.4 (0.1)
0 16 1.5x 9.5 (0.0) 1.0x 120.1 (0.7)
0 8 1.4x 10.1 (0.2) 1.0x 123.6 (1.9)
32768 1 1.0x 14.3 (0.0) 1.0x 126.2 (0.9)
65536 1 1.0x 14.3 (0.0) 1.0x 125.4 (0.6)
131072 1 1.0x 14.3 (0.0) 1.0x 126.5 (1.0)
0 1 1.0x 14.3 (0.0) 1.0x 124.7 (1.2)
AMD
~~~
hugetlb
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 16 -- 0 -- 1.1 (0.0) -- 1.5 (0.0)
131072 8 4.3x 0.2 (0.0) 2.5x 0.6 (0.0)
65536 16 4.3x 0.2 (0.0) 2.6x 0.6 (0.0)
65536 8 4.0x 0.3 (0.0) 2.5x 0.6 (0.0)
65536 24 3.8x 0.3 (0.0) 2.4x 0.6 (0.0)
32768 32 3.6x 0.3 (0.0) 2.3x 0.6 (0.0)
131072 32 3.6x 0.3 (0.0) 2.3x 0.6 (0.0)
65536 32 3.5x 0.3 (0.0) 2.3x 0.6 (0.0)
32768 8 3.4x 0.3 (0.0) 2.3x 0.7 (0.0)
131072 24 3.0x 0.3 (0.0) 2.1x 0.7 (0.0)
131072 16 2.8x 0.4 (0.0) 2.0x 0.8 (0.1)
32768 16 2.6x 0.4 (0.0) 1.9x 0.8 (0.0)
32768 24 2.6x 0.4 (0.0) 1.9x 0.8 (0.0)
0 32 1.3x 0.8 (0.0) 1.2x 1.2 (0.0)
0 24 1.3x 0.8 (0.0) 1.2x 1.3 (0.0)
0 16 1.2x 0.9 (0.0) 1.2x 1.3 (0.0)
0 8 1.1x 0.9 (0.0) 1.1x 1.4 (0.0)
32768 1 1.0x 1.1 (0.0) 1.0x 1.5 (0.0)
131072 1 1.0x 1.1 (0.0) 1.0x 1.5 (0.0)
0 1 1.0x 1.1 (0.0) 1.0x 1.5 (0.0)
65536 1 1.0x 1.1 (0.0) 1.0x 1.5 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 128 -- 0 -- 9.6 (0.1) -- 10.2 (0.1)
131072 32 4.5x 2.1 (0.0) 3.9x 2.6 (0.0)
131072 8 4.4x 2.2 (0.0) 3.7x 2.8 (0.1)
65536 16 4.3x 2.2 (0.0) 3.6x 2.8 (0.0)
131072 16 4.2x 2.3 (0.1) 3.6x 2.9 (0.0)
65536 8 4.1x 2.3 (0.0) 3.6x 2.8 (0.0)
131072 24 4.1x 2.4 (0.1) 3.5x 3.0 (0.1)
65536 24 3.8x 2.5 (0.0) 3.4x 3.0 (0.0)
65536 32 3.8x 2.5 (0.0) 3.3x 3.1 (0.0)
32768 32 3.6x 2.6 (0.0) 3.3x 3.1 (0.0)
32768 8 3.3x 2.9 (0.1) 2.9x 3.5 (0.1)
32768 16 3.2x 3.0 (0.3) 2.9x 3.5 (0.3)
32768 24 2.5x 3.8 (0.0) 2.3x 4.4 (0.0)
0 16 1.2x 7.8 (0.1) 1.2x 8.4 (0.1)
0 8 1.2x 8.3 (0.1) 1.1x 8.9 (0.1)
32768 1 1.0x 9.6 (0.1) 1.0x 10.3 (0.1)
65536 1 1.0x 9.6 (0.0) 1.0x 10.3 (0.1)
131072 1 1.0x 9.7 (0.0) 1.0x 10.3 (0.0)
0 1 1.0x 9.7 (0.0) 1.0x 10.4 (0.0)
0 24 0.9x 10.2 (0.6) 0.9x 10.8 (0.6)
0 32 0.9x 10.5 (0.5) 0.9x 11.2 (0.5)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 980 -- 0 -- 74.1 (0.7) -- 75.7 (0.7)
131072 16 4.7x 15.9 (0.1) 4.3x 17.4 (0.1)
131072 24 4.6x 16.0 (0.0) 4.2x 18.1 (0.0)
131072 32 4.6x 16.3 (0.0) 4.1x 18.4 (0.0)
131072 8 4.4x 16.9 (0.1) 4.1x 18.5 (0.1)
65536 16 4.3x 17.1 (0.0) 3.9x 19.2 (0.0)
65536 24 4.3x 17.4 (0.0) 3.9x 19.5 (0.0)
65536 32 4.2x 17.7 (0.0) 3.8x 19.9 (0.1)
65536 8 4.1x 18.2 (0.0) 3.7x 20.4 (0.0)
32768 24 3.7x 19.8 (0.1) 3.4x 22.0 (0.1)
32768 16 3.7x 20.2 (0.2) 3.5x 21.8 (0.2)
32768 32 3.6x 20.4 (0.1) 3.4x 22.5 (0.1)
32768 8 3.4x 21.6 (0.5) 3.3x 23.1 (0.5)
0 16 1.2x 60.4 (0.6) 1.2x 62.0 (0.6)
0 8 1.1x 65.3 (1.0) 1.1x 67.6 (1.0)
0 24 1.0x 73.1 (2.7) 1.0x 75.4 (2.6)
131072 1 1.0x 75.0 (0.7) 1.0x 77.3 (0.7)
65536 1 1.0x 75.4 (0.7) 1.0x 77.7 (0.7)
0 1 1.0x 75.6 (0.7) 1.0x 77.8 (0.7)
32768 1 1.0x 75.8 (0.0) 1.0x 78.0 (0.0)
0 32 0.8x 92.9 (1.2) 0.8x 95.3 (1.1)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 16 -- 0 -- 0.0 (0.0) -- 0.6 (0.0)
131072 24 5.6x 0.0 (0.0) 1.0x 0.6 (0.0)
32768 16 4.6x 0.0 (0.0) 1.0x 0.6 (0.0)
32768 32 4.8x 0.0 (0.0) 0.9x 0.6 (0.0)
131072 16 4.6x 0.0 (0.0) 1.0x 0.6 (0.0)
131072 32 4.3x 0.0 (0.0) 1.0x 0.6 (0.0)
131072 8 4.5x 0.0 (0.0) 1.0x 0.6 (0.0)
32768 8 4.4x 0.0 (0.0) 1.0x 0.6 (0.0)
65536 24 3.7x 0.0 (0.0) 1.0x 0.6 (0.0)
65536 16 3.2x 0.0 (0.0) 1.0x 0.6 (0.0)
65536 8 2.8x 0.0 (0.0) 1.0x 0.6 (0.0)
32768 24 3.0x 0.0 (0.0) 1.0x 0.6 (0.0)
65536 32 2.6x 0.0 (0.0) 1.0x 0.6 (0.0)
0 32 2.1x 0.0 (0.0) 1.0x 0.6 (0.0)
0 16 2.3x 0.0 (0.0) 0.9x 0.6 (0.0)
0 8 2.2x 0.0 (0.0) 0.9x 0.6 (0.0)
0 24 1.2x 0.0 (0.0) 1.0x 0.6 (0.0)
131072 1 1.0x 0.0 (0.0) 0.9x 0.6 (0.0)
65536 1 1.0x 0.0 (0.0) 0.9x 0.7 (0.0)
32768 1 0.8x 0.0 (0.0) 0.9x 0.6 (0.0)
0 1 0.9x 0.0 (0.0) 1.0x 0.6 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 128 -- 0 -- 0.3 (0.0) -- 2.7 (0.0)
131072 24 10.4x 0.0 (0.0) 1.1x 2.4 (0.0)
65536 16 8.5x 0.0 (0.0) 1.1x 2.4 (0.0)
32768 24 7.7x 0.0 (0.0) 1.1x 2.4 (0.0)
32768 32 7.6x 0.0 (0.0) 1.1x 2.4 (0.0)
65536 24 6.1x 0.0 (0.0) 1.1x 2.4 (0.0)
131072 16 5.8x 0.0 (0.0) 1.1x 2.4 (0.0)
131072 32 5.6x 0.0 (0.0) 1.1x 2.4 (0.0)
32768 8 5.2x 0.1 (0.0) 1.1x 2.4 (0.0)
65536 32 4.8x 0.1 (0.0) 1.1x 2.5 (0.0)
32768 16 4.9x 0.1 (0.0) 1.1x 2.4 (0.0)
131072 8 4.4x 0.1 (0.0) 1.1x 2.4 (0.0)
65536 8 4.2x 0.1 (0.0) 1.1x 2.4 (0.0)
0 8 2.9x 0.1 (0.0) 1.1x 2.4 (0.0)
0 16 2.9x 0.1 (0.0) 1.1x 2.5 (0.0)
0 24 2.8x 0.1 (0.0) 1.1x 2.4 (0.0)
0 32 1.2x 0.2 (0.0) 1.0x 2.6 (0.0)
32768 1 1.0x 0.3 (0.0) 1.0x 2.7 (0.0)
131072 1 1.0x 0.3 (0.0) 1.0x 2.7 (0.0)
65536 1 1.0x 0.3 (0.0) 1.0x 2.7 (0.0)
0 1 0.9x 0.3 (0.0) 1.0x 2.7 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 980 -- 0 -- 2.0 (0.0) -- 18.2 (0.1)
131072 32 11.2x 0.2 (0.0) 1.2x 15.7 (0.0)
131072 16 9.4x 0.2 (0.0) 1.2x 15.7 (0.0)
65536 24 9.2x 0.2 (0.0) 1.1x 16.3 (0.0)
65536 16 8.1x 0.3 (0.0) 1.1x 16.4 (0.0)
32768 16 7.1x 0.3 (0.0) 1.1x 15.8 (0.0)
131072 24 7.1x 0.3 (0.0) 1.1x 15.8 (0.0)
65536 32 6.2x 0.3 (0.0) 1.1x 16.4 (0.0)
65536 8 5.7x 0.4 (0.0) 1.1x 16.5 (0.1)
32768 32 5.6x 0.4 (0.0) 1.1x 16.5 (0.0)
32768 24 5.6x 0.4 (0.0) 1.1x 15.9 (0.0)
131072 8 5.0x 0.4 (0.0) 1.1x 16.0 (0.0)
32768 8 3.0x 0.7 (0.0) 1.1x 16.3 (0.1)
0 8 2.8x 0.7 (0.0) 1.1x 16.2 (0.0)
0 16 2.7x 0.8 (0.1) 1.1x 16.9 (0.1)
0 24 1.6x 1.2 (0.4) 1.0x 17.4 (0.4)
32768 1 1.0x 2.1 (0.0) 1.0x 18.1 (0.0)
0 32 1.0x 2.1 (0.0) 1.0x 17.7 (0.0)
65536 1 1.0x 2.1 (0.0) 1.0x 18.2 (0.1)
131072 1 1.0x 2.1 (0.0) 1.0x 18.3 (0.0)
0 1 0.9x 2.2 (0.0) 1.0x 17.7 (0.0)
THP
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 16 -- 0 -- 1.2 (0.0) -- 1.7 (0.0)
131072 8 4.3x 0.3 (0.0) 2.4x 0.7 (0.0)
131072 32 3.1x 0.4 (0.0) 2.1x 0.8 (0.0)
65536 16 4.0x 0.3 (0.0) 2.3x 0.7 (0.0)
65536 8 3.9x 0.3 (0.0) 2.3x 0.7 (0.0)
65536 24 3.3x 0.4 (0.0) 2.1x 0.8 (0.0)
65536 32 3.3x 0.4 (0.0) 2.2x 0.8 (0.0)
32768 16 2.6x 0.5 (0.0) 1.9x 0.9 (0.0)
131072 24 3.3x 0.4 (0.0) 2.1x 0.8 (0.0)
32768 32 3.3x 0.4 (0.0) 2.1x 0.8 (0.0)
131072 16 3.1x 0.4 (0.0) 2.0x 0.8 (0.0)
32768 24 2.5x 0.5 (0.0) 1.9x 0.9 (0.0)
32768 8 3.2x 0.4 (0.0) 1.9x 0.9 (0.0)
0 24 1.3x 1.0 (0.0) 1.2x 1.4 (0.0)
0 32 1.2x 1.0 (0.0) 1.1x 1.5 (0.1)
0 8 1.2x 1.0 (0.0) 1.1x 1.5 (0.0)
0 16 1.2x 1.0 (0.0) 1.1x 1.5 (0.0)
131072 1 1.0x 1.2 (0.0) 1.0x 1.7 (0.0)
65536 1 1.0x 1.2 (0.0) 1.0x 1.7 (0.0)
0 1 1.0x 1.2 (0.0) 1.0x 1.7 (0.0)
32768 1 1.0x 1.2 (0.0) 1.0x 1.7 (0.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 128 -- 0 -- 10.9 (0.1) -- 11.4 (0.1)
131072 16 5.0x 2.2 (0.0) 4.3x 2.7 (0.0)
131072 32 4.8x 2.3 (0.0) 4.2x 2.7 (0.0)
131072 24 4.6x 2.4 (0.0) 4.1x 2.8 (0.1)
131072 8 4.7x 2.3 (0.0) 4.1x 2.8 (0.0)
65536 24 4.4x 2.5 (0.0) 3.9x 2.9 (0.0)
65536 32 4.1x 2.7 (0.1) 3.7x 3.1 (0.1)
65536 16 4.1x 2.7 (0.2) 3.7x 3.1 (0.2)
65536 8 4.0x 2.7 (0.1) 3.6x 3.2 (0.1)
32768 24 3.8x 2.9 (0.0) 3.4x 3.4 (0.0)
32768 32 3.6x 3.0 (0.1) 3.3x 3.5 (0.1)
32768 8 3.5x 3.1 (0.0) 3.2x 3.6 (0.1)
32768 16 3.3x 3.3 (0.2) 3.1x 3.7 (0.2)
0 16 1.3x 8.3 (0.4) 1.3x 8.8 (0.4)
0 8 1.2x 8.8 (0.4) 1.2x 9.3 (0.4)
0 24 1.1x 10.1 (1.2) 1.1x 10.7 (1.3)
0 32 1.1x 10.3 (1.3) 1.1x 10.8 (1.3)
131072 1 1.0x 10.9 (0.0) 1.0x 11.5 (0.0)
32768 1 1.0x 11.0 (0.1) 1.0x 11.6 (0.1)
65536 1 1.0x 11.1 (0.0) 1.0x 11.6 (0.0)
0 1 1.0x 11.1 (0.2) 1.0x 11.6 (0.2)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
0 980 -- 0 -- 85.3 (0.6) -- 86.1 (0.6)
131072 16 5.2x 16.4 (0.0) 5.0x 17.1 (0.0)
131072 24 5.1x 16.7 (0.1) 4.9x 17.4 (0.1)
131072 32 5.0x 17.1 (0.0) 4.8x 17.8 (0.0)
131072 8 4.7x 18.3 (0.1) 4.5x 19.0 (0.1)
65536 16 4.7x 18.3 (0.0) 4.5x 19.0 (0.0)
65536 24 4.6x 18.5 (0.0) 4.5x 19.2 (0.0)
65536 32 4.5x 18.8 (0.0) 4.4x 19.6 (0.0)
65536 8 4.3x 19.6 (0.0) 4.2x 20.4 (0.0)
32768 16 3.9x 21.6 (0.0) 3.9x 22.4 (0.0)
32768 24 3.9x 22.1 (0.3) 3.8x 22.8 (0.3)
32768 32 3.8x 22.4 (0.1) 3.7x 23.1 (0.1)
32768 8 3.8x 22.7 (0.0) 3.7x 23.5 (0.0)
0 16 1.3x 64.6 (2.7) 1.3x 65.4 (2.7)
0 8 1.2x 70.0 (2.7) 1.2x 70.8 (2.7)
0 32 1.0x 82.4 (5.7) 1.0x 83.2 (5.7)
0 24 1.0x 83.4 (6.9) 1.0x 84.1 (6.9)
131072 1 1.0x 84.2 (0.3) 1.0x 85.0 (0.3)
0 1 1.0x 84.8 (1.3) 1.0x 85.6 (1.3)
65536 1 1.0x 84.9 (0.4) 1.0x 85.7 (0.4)
32768 1 1.0x 85.6 (1.3) 1.0x 86.4 (1.3)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 16 -- 0 -- 0.5 (0.3) -- 6.2 (0.4)
131072 32 4.9x 0.1 (0.0) 1.1x 5.6 (0.0)
65536 16 5.1x 0.1 (0.0) 1.1x 5.7 (0.1)
65536 32 5.0x 0.1 (0.0) 1.1x 5.6 (0.1)
32768 16 5.0x 0.1 (0.0) 1.1x 5.7 (0.0)
32768 8 5.8x 0.1 (0.0) 1.1x 5.6 (0.0)
65536 24 5.7x 0.1 (0.0) 1.1x 5.7 (0.0)
32768 32 3.9x 0.1 (0.0) 1.0x 5.9 (0.1)
131072 16 3.7x 0.1 (0.1) 1.0x 6.0 (0.3)
65536 8 4.0x 0.1 (0.1) 1.1x 5.9 (0.1)
131072 24 3.6x 0.1 (0.1) 1.0x 5.9 (0.5)
131072 8 2.5x 0.2 (0.1) 1.0x 6.0 (0.6)
32768 24 1.7x 0.3 (0.1) 1.0x 6.5 (0.2)
131072 1 1.8x 0.3 (0.0) 1.1x 5.9 (0.0)
0 32 1.6x 0.3 (0.0) 1.0x 6.2 (0.2)
0 8 1.0x 0.5 (0.0) 1.0x 6.2 (0.1)
0 24 0.9x 0.5 (0.3) 1.0x 6.3 (0.5)
0 1 0.9x 0.5 (0.4) 1.0x 6.2 (0.5)
32768 1 0.8x 0.6 (0.3) 1.0x 6.5 (0.4)
0 16 0.7x 0.7 (0.7) 0.9x 6.6 (0.8)
65536 1 0.6x 0.9 (0.5) 0.9x 6.7 (0.7)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 128 -- 0 -- 3.4 (0.8) -- 45.5 (1.0)
131072 32 11.7x 0.3 (0.1) 1.1x 42.1 (0.2)
65536 16 8.5x 0.4 (0.1) 1.1x 42.1 (0.2)
32768 32 8.6x 0.4 (0.2) 1.1x 43.0 (0.2)
65536 32 8.9x 0.4 (0.1) 1.0x 43.6 (0.3)
32768 24 7.9x 0.4 (0.1) 1.1x 42.3 (0.3)
32768 16 6.5x 0.5 (0.2) 1.1x 42.5 (0.5)
65536 24 6.7x 0.5 (0.2) 1.1x 42.6 (0.5)
131072 24 5.8x 0.6 (0.5) 1.1x 42.5 (0.6)
131072 16 5.0x 0.7 (0.6) 1.1x 42.4 (0.8)
131072 8 3.8x 0.9 (0.4) 1.1x 42.7 (0.5)
65536 8 3.2x 1.1 (0.5) 1.1x 42.9 (0.6)
32768 8 3.1x 1.1 (0.4) 1.1x 43.3 (1.0)
0 32 1.1x 3.0 (0.2) 1.0x 45.1 (0.2)
0 24 1.2x 2.9 (0.1) 1.0x 44.6 (0.2)
0 8 1.0x 3.5 (1.1) 1.0x 45.5 (1.2)
32768 1 1.0x 3.6 (0.9) 1.0x 45.5 (0.7)
131072 1 1.0x 3.5 (1.1) 1.0x 45.6 (1.4)
0 1 0.9x 3.6 (0.5) 1.0x 45.6 (0.4)
0 16 0.9x 3.6 (0.2) 1.0x 45.7 (0.2)
65536 1 0.9x 3.6 (1.0) 1.0x 45.8 (1.0)
lockedvm qemu qemu
mem cache pin pin init init
qthr (G) (pages) thr speedup (s) (std) speedup (s) (std)
16 980 -- 0 -- 19.6 (0.9) -- 337.9 (0.7)
131072 32 9.7x 2.0 (0.4) 1.0x 323.0 (0.7)
131072 24 8.8x 2.2 (0.4) 1.0x 324.6 (0.8)
65536 32 8.4x 2.3 (0.2) 1.0x 323.1 (0.5)
32768 24 7.9x 2.5 (0.1) 1.1x 319.4 (1.0)
65536 24 8.1x 2.4 (0.1) 1.0x 322.3 (0.8)
32768 32 7.4x 2.6 (0.2) 1.1x 321.2 (0.8)
131072 16 6.9x 2.8 (0.3) 1.0x 331.0 (8.8)
65536 16 6.5x 3.0 (0.2) 1.1x 320.4 (0.7)
32768 16 5.9x 3.3 (0.5) 1.0x 328.3 (1.5)
65536 8 5.3x 3.7 (0.4) 1.1x 320.8 (1.0)
32768 8 4.8x 4.1 (0.2) 1.0x 328.9 (0.8)
131072 8 4.7x 4.1 (0.2) 1.1x 319.4 (0.9)
0 8 1.2x 16.9 (0.7) 1.0x 333.9 (3.1)
0 32 1.1x 18.0 (0.7) 1.0x 336.1 (0.8)
0 24 1.1x 18.0 (1.6) 1.0x 336.7 (1.7)
65536 1 1.0x 19.0 (0.5) 1.0x 341.0 (0.3)
131072 1 1.0x 19.7 (1.0) 1.0x 335.7 (1.0)
0 16 1.0x 19.8 (1.8) 1.0x 338.8 (1.8)
32768 1 0.9x 20.7 (1.5) 1.0x 337.6 (1.9)
0 1 0.9x 21.3 (1.4) 1.0x 339.5 (1.8)
Signed-off-by: Daniel Jordan <daniel.m.jordan@...cle.com>
---
drivers/vfio/vfio_iommu_type1.c | 51 +++++++++++++++++++++++++++++----
1 file changed, 45 insertions(+), 6 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index faee849f1cce..c2edc5a4c727 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -651,7 +651,7 @@ static int vfio_wait_all_valid(struct vfio_iommu *iommu)
static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
long npage, unsigned long *pfn_base,
unsigned long limit, struct vfio_batch *batch,
- struct mm_struct *mm)
+ struct mm_struct *mm, long *lock_cache)
{
unsigned long pfn;
long ret, pinned = 0, lock_acct = 0;
@@ -709,15 +709,25 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
* the user.
*/
if (!rsvd && !vfio_find_vpfn(dma, iova)) {
- if (!dma->lock_cap &&
+ if (!dma->lock_cap && *lock_cache == 0 &&
mm->locked_vm + lock_acct + 1 > limit) {
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
__func__, limit << PAGE_SHIFT);
ret = -ENOMEM;
goto unpin_out;
}
- lock_acct++;
- }
+ /*
+ * Draw from the cache if possible to avoid
+ * taking the write-side mmap_lock in
+ * vfio_lock_acct(), which will alleviate
+ * contention with the read-side mmap_lock in
+ * vaddr_get_pfn().
+ */
+ if (*lock_cache > 0)
+ (*lock_cache)--;
+ else
+ lock_acct++;
+ }
pinned++;
npage--;
@@ -1507,6 +1517,13 @@ static void vfio_pin_map_dma_undo(unsigned long start_vaddr,
vfio_unmap_unpin(args->iommu, args->dma, iova, end, true);
}
+/*
+ * Relieve mmap_lock contention when multithreading page pinning by caching
+ * locked_vm locally. Bound the locked_vm that a thread will cache but not use
+ * with this constant, which compromises between performance and overaccounting.
+ */
+#define LOCKED_VM_CACHE_PAGES 65536
+
static int vfio_pin_map_dma_chunk(unsigned long start_vaddr,
unsigned long end_vaddr, void *arg)
{
@@ -1515,6 +1532,7 @@ static int vfio_pin_map_dma_chunk(unsigned long start_vaddr,
dma_addr_t iova = dma->iova + (start_vaddr - dma->vaddr);
unsigned long unmapped_size = end_vaddr - start_vaddr;
unsigned long pfn, mapped_size = 0;
+ long cache_size, lock_cache = 0;
struct vfio_batch batch;
long npage;
int ret = 0;
@@ -1522,11 +1540,29 @@ static int vfio_pin_map_dma_chunk(unsigned long start_vaddr,
vfio_batch_init(&batch);
while (unmapped_size) {
+ if (lock_cache == 0) {
+ cache_size = min_t(long, unmapped_size >> PAGE_SHIFT,
+ LOCKED_VM_CACHE_PAGES);
+ ret = vfio_lock_acct(dma, cache_size, false);
+ /*
+ * More locked_vm is cached than might be used, so
+ * don't fail on -ENOMEM, i.e. exceeding RLIMIT_MEMLOCK.
+ */
+ if (ret) {
+ if (ret != -ENOMEM) {
+ vfio_batch_unpin(&batch, dma);
+ break;
+ }
+ cache_size = 0;
+ }
+ lock_cache = cache_size;
+ }
+
/* Pin a contiguous chunk of memory */
npage = vfio_pin_pages_remote(dma, start_vaddr + mapped_size,
unmapped_size >> PAGE_SHIFT,
&pfn, args->limit, &batch,
- args->mm);
+ args->mm, &lock_cache);
if (npage <= 0) {
WARN_ON(!npage);
ret = (int)npage;
@@ -1548,6 +1584,7 @@ static int vfio_pin_map_dma_chunk(unsigned long start_vaddr,
}
vfio_batch_fini(&batch);
+ vfio_lock_acct(dma, -lock_cache, false);
/*
* Undo the successfully completed part of this chunk now. padata will
@@ -1771,6 +1808,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
struct rb_node *n;
unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
int ret;
+ long lock_cache = 0;
ret = vfio_wait_all_valid(iommu);
if (ret < 0)
@@ -1832,7 +1870,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
n >> PAGE_SHIFT,
&pfn, limit,
&batch,
- current->mm);
+ current->mm,
+ &lock_cache);
if (npage <= 0) {
WARN_ON(!npage);
ret = (int)npage;
--
2.34.1
Powered by blists - more mailing lists