netdev - Re: recvfrom/recvmsg performance and CONFIG_HARDENED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8b8a3cc1c3341912e0db5c55cd0e504dd4371588.camel@redhat.com>
Date:   Fri, 06 Dec 2019 17:09:35 +0100
From:   Paolo Abeni <pabeni@...hat.com>
To:     Eric Dumazet <eric.dumazet@...il.com>,
        David Laight <David.Laight@...LAB.COM>,
        network dev <netdev@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: recvfrom/recvmsg performance and CONFIG_HARDENED_USERCOPY

On Fri, 2019-12-06 at 06:21 -0800, Eric Dumazet wrote:
> 
> On 12/6/19 5:39 AM, David Laight wrote:
> > Some tests I've done seem to show that recvmsg() is much slower that recvfrom()
> > even though most of what they do is the same.
> 
> Not really.
> 
> > One thought is that the difference is all the extra copy_from_user() needed by
> > recvmsg. CONFIG_HARDENED_USERCOPY can add a significant cost.
> > 
> > I've built rebuilt my 5.4-rc7 kernel with all the copy_to/from_user() in net/socket.c
> > replaced with the '_' prefixed versions (that don't call check_object()).
> > And also changed rw_copy_check_uvector() in fs/read_write.c.
> > 
> > Schedviz then showed the time spent by the application thread that calls
> > recvmsg() (about) 225 times being reduced from 0.9ms to 0.75ms.
> > 
> > I've now instrumented the actual recv calls. It show some differences,
> > but now enough to explain the 20% difference above.
> > (This is all made more difficult because my Ivy Bridge i7-3770 refuses
> > to run at a fixed frequency.)
> > 
> > Anyway using PERF_COUNT_HW_CPU_CYCLES I've got the following
> > histograms for the number of cycles in each recv call.
> > There are about the same number (2.8M) in each column over
> > an elapsed time of 20 seconds.
> > There are 450 active UDP sockets, each receives 1 message every 20ms.
> > Every 10ms a RT thread that is pinned to a cpu reads all the pending messages.
> > This is a 4 core hyperthreading (8 cpu) system.
> > During these tests 5 other threads are also busy.
> > There are no sends (on those sockets).
> > 
> >          |       recvfrom      |       recvmsg
> >  cycles  |   unhard  |    hard |   unhard  |    hard
> > -----------------------------------------------------
> >    1472:         29          1          0          0
> >    1600:       8980       4887          3          0
> >    1728:     112540     159518       5393       2895
> >    1856:     174555     270148     119054     111230
> >    1984:     126007     168383     152310     195288
> >    2112:      80249      87045     118941     168801
> >    2240:      61570      54790      81847     110561
> >    2368:      95088      61796      57496      71732
> >    2496:     193633     155870      54020      54801
> >    2624:     274997     284921     102465      74626
> >    2752:     276661     295715     160492     119498
> >    2880:     248751     264174     206327     186028
> >    3008:     207532     213067     230704     229232
> >    3136:     167976     164804     226493     238555
> >    3264:     133708     124857     202639     220574
> >    3392:     107859      95696     172949     189475
> >    3520:      88599      75943     141056     153524
> >    3648:      74290      61586     115873     120994
> >    3776:      62253      50891      96061      95040
> >    3904:      52213      42482      81113      76577
> >    4032:      42920      34632      69077      63131
> >    4160:      35472      28327      60074      53631
> >    4288:      28787      22603      51345      46620
> >    4416:      24072      18496      44006      40325
> >    4544:      20107      14886      37185      34516
> >    4672:      16759      12206      31408      29031
> >    4800:      14195       9991      26843      24396
> >    4928:      12356       8167      22775      20165
> >    5056:      10387       6931      19404      16591
> >    5184:       9284       5916      16817      13743
> >    5312:       7994       5116      14737      11452
> >    5440:       7152       4495      12592       9607
> >    5568:       6300       3969      11117       8592
> >    5696:       5445       3421       9988       7237
> >    5824:       4683       2829       8839       6368
> >    5952:       3959       2643       7652       5652
> >    6080:       3454       2377       6442       4814
> >    6208:       3041       2219       5735       4170
> >    6336:       2840       2060       5059       3615
> >    6464:       2428       1975       4433       3201
> >    6592:       2109       1794       4078       2823
> >    6720:       1871       1382       3549       2558
> >    6848:       1706       1262       3110       2328
> >    6976:       1567       1001       2733       1991
> >    7104:       1436        873       2436       1819
> >    7232:       1417        860       2102       1652
> >    7360:       1414        741       1823       1429
> >    7488:       1372        814       1663       1239
> >    7616:       1201        896       1430       1152
> >    7744:       1275       1008       1364       1049
> >    7872:       1382       1120       1367        925
> >    8000:       1316       1282       1253        815
> >    8128:       1264       1266       1313        792
> >   8256+:      19252      19450      34703      30228
> > ----------------------------------------------------
> >   Total:    2847707    2863582    2853688    2877088
> > 
> > This does show a few interesting things:
> > 1) The 'hardened' kernel is slower, especially for recvmsg.
> > 2) The difference for recvfrom isn't enough for the 20% reduction I saw.
> > 3) There are two peaks at the top a 'not insubstantial' number are a lot
> >    faster than the main peak.
> > 4) There is second peak way down at 8000 cycles.
> >    This is repeatable.
> > 
> > Any idea what is actually going on??
> > 
> 
> Real question is : Do you actually need to use recvmsg() instead of recvfrom() ?
> 
> If recvmsg() provides additional cmsg, this is not surprising it is more expensive.
> 
> recvmsg() also uses an indirect call, so CONFIG_RETPOLINE=y is probably hurting.
> 
> err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys, flags);
> 
> Maybe a INDIRECT_CALL annotation could help, or rewriting this to not let gcc
> use an indirect call.
> 
> 
> diff --git a/net/socket.c b/net/socket.c
> index ea28cbb9e2e7a7180ee63de2d09a81aacb001ab7..752714281026dab6db850ec7fa75b7aa6240661f 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -2559,7 +2559,10 @@ static int ____sys_recvmsg(struct socket *sock, struct msghdr *msg_sys,
>  
>         if (sock->file->f_flags & O_NONBLOCK)
>                 flags |= MSG_DONTWAIT;
> -       err = (nosec ? sock_recvmsg_nosec : sock_recvmsg)(sock, msg_sys, flags);
> +       if (nosec)
> +               err = sock_recvmsg_nosec(sock, msg_sys, flags);
> +       else
> +               err = sock_recvmsg(sock, msg_sys, flags);
>         if (err < 0)
>                 goto out;
>         len = err;

Oh, nice! I though the compiler was smart enough to avoid the indirect
call with the current code, but it looks like that least gcc 9.2.1 is
not.

Thanks for pointing that out!

In this specific scenario I think the code you propose above is better
than INDIRECT_CALL.

Would you submit the patch formally?

Thank you!

Paolo