[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <28b421af-838d-e70a-ec95-2f14f21e3a90@igel.co.jp>
Date: Wed, 11 Jan 2023 13:10:15 +0900
From: Shunsuke Mie <mie@...l.co.jp>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: Jason Wang <jasowang@...hat.com>,
Rusty Russell <rusty@...tcorp.com.au>, kvm@...r.kernel.org,
virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 4/9] vringh: unify the APIs for all accessors
On 2022/12/28 16:20, Michael S. Tsirkin wrote:
> On Wed, Dec 28, 2022 at 11:24:10AM +0900, Shunsuke Mie wrote:
>> 2022年12月27日(火) 23:37 Michael S. Tsirkin <mst@...hat.com>:
>>> On Tue, Dec 27, 2022 at 07:22:36PM +0900, Shunsuke Mie wrote:
>>>> 2022年12月27日(火) 16:49 Shunsuke Mie <mie@...l.co.jp>:
>>>>> 2022年12月27日(火) 16:04 Michael S. Tsirkin <mst@...hat.com>:
>>>>>> On Tue, Dec 27, 2022 at 11:25:26AM +0900, Shunsuke Mie wrote:
>>>>>>> Each vringh memory accessors that are for user, kern and iotlb has own
>>>>>>> interfaces that calls common code. But some codes are duplicated and that
>>>>>>> becomes loss extendability.
>>>>>>>
>>>>>>> Introduce a struct vringh_ops and provide a common APIs for all accessors.
>>>>>>> It can bee easily extended vringh code for new memory accessor and
>>>>>>> simplified a caller code.
>>>>>>>
>>>>>>> Signed-off-by: Shunsuke Mie <mie@...l.co.jp>
>>>>>>> ---
>>>>>>> drivers/vhost/vringh.c | 667 +++++++++++------------------------------
>>>>>>> include/linux/vringh.h | 100 +++---
>>>>>>> 2 files changed, 225 insertions(+), 542 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
>>>>>>> index aa3cd27d2384..ebfd3644a1a3 100644
>>>>>>> --- a/drivers/vhost/vringh.c
>>>>>>> +++ b/drivers/vhost/vringh.c
>>>>>>> @@ -35,15 +35,12 @@ static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
>>>>>>> }
>>>>>>>
>>>>>>> /* Returns vring->num if empty, -ve on error. */
>>>>>>> -static inline int __vringh_get_head(const struct vringh *vrh,
>>>>>>> - int (*getu16)(const struct vringh *vrh,
>>>>>>> - u16 *val, const __virtio16 *p),
>>>>>>> - u16 *last_avail_idx)
>>>>>>> +static inline int __vringh_get_head(const struct vringh *vrh, u16 *last_avail_idx)
>>>>>>> {
>>>>>>> u16 avail_idx, i, head;
>>>>>>> int err;
>>>>>>>
>>>>>>> - err = getu16(vrh, &avail_idx, &vrh->vring.avail->idx);
>>>>>>> + err = vrh->ops.getu16(vrh, &avail_idx, &vrh->vring.avail->idx);
>>>>>>> if (err) {
>>>>>>> vringh_bad("Failed to access avail idx at %p",
>>>>>>> &vrh->vring.avail->idx);
>>>>>> I like that this patch removes more lines of code than it adds.
>>>>>>
>>>>>> However one of the design points of vringh abstractions is that they were
>>>>>> carefully written to be very low overhead.
>>>>>> This is why we are passing function pointers to inline functions -
>>>>>> compiler can optimize that out.
>>>>>>
>>>>>> I think that introducing ops indirect functions calls here is going to break
>>>>>> these assumptions and hurt performance.
>>>>>> Unless compiler can somehow figure it out and optimize?
>>>>>> I don't see how it's possible with ops pointer in memory
>>>>>> but maybe I'm wrong.
>>>>> I think your concern is correct. I have to understand the compiler
>>>>> optimization and redesign this approach If it is needed.
>>>>>> Was any effort taken to test effect of these patches on performance?
>>>>> I just tested vringh_test and already faced little performance reduction.
>>>>> I have to investigate that, as you said.
>>>> I attempted to test with perf. I found that the performance of patched code
>>>> is almost the same as the upstream one. However, I have to investigate way
>>>> this patch leads to this result, also the profiling should be run on
>>>> more powerful
>>>> machines too.
>>>>
>>>> environment:
>>>> $ grep 'model name' /proc/cpuinfo
>>>> model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
>>>> model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
>>>> model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
>>>> model name : Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
>>>>
>>>> results:
>>>> * for patched code
>>>> Performance counter stats for 'nice -n -20 ./vringh_test_patched
>>>> --parallel --eventidx --fast-vringh --indirect --virtio-1' (20 runs):
>>>>
>>>> 3,028.05 msec task-clock # 0.995 CPUs
>>>> utilized ( +- 0.12% )
>>>> 78,150 context-switches # 25.691 K/sec
>>>> ( +- 0.00% )
>>>> 5 cpu-migrations # 1.644 /sec
>>>> ( +- 3.33% )
>>>> 190 page-faults # 62.461 /sec
>>>> ( +- 0.41% )
>>>> 6,919,025,222 cycles # 2.275 GHz
>>>> ( +- 0.13% )
>>>> 8,990,220,160 instructions # 1.29 insn per
>>>> cycle ( +- 0.04% )
>>>> 1,788,326,786 branches # 587.899 M/sec
>>>> ( +- 0.05% )
>>>> 4,557,398 branch-misses # 0.25% of all
>>>> branches ( +- 0.43% )
>>>>
>>>> 3.04359 +- 0.00378 seconds time elapsed ( +- 0.12% )
>>>>
>>>> * for upstream code
>>>> Performance counter stats for 'nice -n -20 ./vringh_test_base
>>>> --parallel --eventidx --fast-vringh --indirect --virtio-1' (10 runs):
>>>>
>>>> 3,058.41 msec task-clock # 0.999 CPUs
>>>> utilized ( +- 0.14% )
>>>> 78,149 context-switches # 25.545 K/sec
>>>> ( +- 0.00% )
>>>> 5 cpu-migrations # 1.634 /sec
>>>> ( +- 2.67% )
>>>> 194 page-faults # 63.414 /sec
>>>> ( +- 0.43% )
>>>> 6,988,713,963 cycles # 2.284 GHz
>>>> ( +- 0.14% )
>>>> 8,512,533,269 instructions # 1.22 insn per
>>>> cycle ( +- 0.04% )
>>>> 1,638,375,371 branches # 535.549 M/sec
>>>> ( +- 0.05% )
>>>> 4,428,866 branch-misses # 0.27% of all
>>>> branches ( +- 22.57% )
>>>>
>>>> 3.06085 +- 0.00420 seconds time elapsed ( +- 0.14% )
>>>
>>> How you compiled it also matters. ATM we don't enable retpolines
>>> and it did not matter since we didn't have indirect calls,
>>> but we should. Didn't yet investigate how to do that for virtio tools.
>> I think the retpolines certainly affect performance. Thank you for pointing
>> it out. I'd like to start the investigation that how to apply the
>> retpolines to the
>> virtio tools.
>>>>> Thank you for your comments.
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>> Best,
>>>>> Shunsuke.
> This isn't all that trivial if we want this at runtime.
> But compile time is kind of easy.
> See Documentation/admin-guide/hw-vuln/spectre.rst
Thank you for showing it.
I followed the document and added options to CFLAGS to the tools Makefile.
That is
---
diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index 1b25cc7c64bb..7b7139d97d74 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -4,7 +4,7 @@ test: virtio_test vringh_test
virtio_test: virtio_ring.o virtio_test.o
vringh_test: vringh_test.o vringh.o virtio_ring.o
-CFLAGS += -g -O2 -Werror -Wno-maybe-uninitialized -Wall -I.
-I../include/ -I ../../usr/include/ -Wno-pointer-sign
-fno-strict-overflow -fno-strict-aliasing -fno-common -MMD
-U_FORTIFY_SOURCE -include ../../include/linux/kconfig.h
+CFLAGS += -g -O2 -Werror -Wno-maybe-uninitialized -Wall -I.
-I../include/ -I ../../usr/include/ -Wno-pointer-sign
-fno-strict-overflow -fno-strict-aliasing -fno-common -MMD
-U_FORTIFY_SOURCE -include ../../include/linux/kconfig.h
-mfunction-return=thunk -fcf-protection=none -mindirect-branch-register
CFLAGS += -pthread
LDFLAGS += -pthread
vpath %.c ../../drivers/virtio ../../drivers/vhost
---
And results of evaluation are following:
- base with retpoline
$ sudo perf stat --repeat 20 -- nice -n -20 ./vringh_test_retp_origin
--parallel --eventidx --fast-vringh
Using CPUS 0 and 3
Guest: notified 0, pinged 98040
Host: notified 98040, pinged 0
...
Performance counter stats for 'nice -n -20 ./vringh_test_retp_origin
--parallel --eventidx --fast-vringh' (20 runs):
6,228.33 msec task-clock # 1.004 CPUs
utilized ( +- 0.05% )
196,110 context-switches # 31.616
K/sec ( +- 0.00% )
6 cpu-migrations # 0.967
/sec ( +- 2.39% )
205 page-faults # 33.049
/sec ( +- 0.46% )
14,218,527,987 cycles # 2.292
GHz ( +- 0.05% )
10,342,897,254 instructions # 0.73 insn per
cycle ( +- 0.02% )
2,310,572,989 branches # 372.500
M/sec ( +- 0.03% )
178,273,068 branch-misses # 7.72% of all
branches ( +- 0.04% )
6.20406 +- 0.00308 seconds time elapsed ( +- 0.05% )
- patched (unified APIs) with retpoline
$ sudo perf stat --repeat 20 -- nice -n -20 ./vringh_test_retp_patched
--parallel --eventidx --fast-vringh
Using CPUS 0 and 3
Guest: notified 0, pinged 98040
Host: notified 98040, pinged 0
...
Performance counter stats for 'nice -n -20 ./vringh_test_retp_patched
--parallel --eventidx --fast-vringh' (20 runs):
6,103.94 msec task-clock # 1.001 CPUs
utilized ( +- 0.03% )
196,125 context-switches # 32.165
K/sec ( +- 0.00% )
7 cpu-migrations # 1.148
/sec ( +- 1.56% )
196 page-faults # 32.144
/sec ( +- 0.41% )
13,933,055,778 cycles # 2.285
GHz ( +- 0.03% )
10,309,004,718 instructions # 0.74 insn per
cycle ( +- 0.03% )
2,368,447,519 branches # 388.425
M/sec ( +- 0.04% )
211,364,886 branch-misses # 8.94% of all
branches ( +- 0.05% )
6.09888 +- 0.00155 seconds time elapsed ( +- 0.03% )
As a result, at the patched code, the branch-misses was increased but
elapsed time became faster than the based code. The number of
page-faults was
a little different. I'm suspicious of that the page-fault penalty leads the
performance result.
I think that a pattern of memory access for data is same with those, but
for instruction is different. Actually a code size (.text segment) was a
little smaller. 0x6a65 and 0x63f5.
$ readelf -a ./vringh_test_retp_origin |grep .text -1
0000000000000008 0000000000000008 AX 0 0 8
[14] .text PROGBITS 0000000000001230 00001230
0000000000006a65 0000000000000000 AX 0 0 16
--
02 .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym
.dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
03 .init .plt .plt.got .text .fini
04 .rodata .eh_frame_hdr .eh_frame
$ readelf -a ./vringh_test_retp_patched |grep .text -1
0000000000000008 0000000000000008 AX 0 0 8
[14] .text PROGBITS 0000000000001230 00001230
00000000000063f5 0000000000000000 AX 0 0 16
--
02 .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym
.dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
03 .init .plt .plt.got .text .fini
04 .rodata .eh_frame_hdr .eh_frame
I'll keep this investigation. I was wondering if you could comment me.
Best
>
Powered by blists - more mailing lists