netdev - Re: [PATCH v3 net-next 3/3] tcp: add one skb cache for rx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6a065bc7-ea28-79f8-1479-29261366721a@gmail.com>
Date:   Wed, 3 Apr 2019 01:15:54 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Jakub Kicinski <jakub.kicinski@...ronome.com>,
        Eric Dumazet <edumazet@...gle.com>
Cc:     "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Willem de Bruijn <willemb@...gle.com>
Subject: Re: [PATCH v3 net-next 3/3] tcp: add one skb cache for rx



On 04/02/2019 06:17 PM, Jakub Kicinski wrote:
> On Fri, 22 Mar 2019 08:56:40 -0700, Eric Dumazet wrote:
>> Often times, recvmsg() system calls and BH handling for a particular
>> TCP socket are done on different cpus.
>>
>> This means the incoming skb had to be allocated on a cpu,
>> but freed on another.
>>
>> This incurs a high spinlock contention in slab layer for small rpc,
>> but also a high number of cache line ping pongs for larger packets.
>>
>> A full size GRO packet might use 45 page fragments, meaning
>> that up to 45 put_page() can be involved.
>>
>> More over performing the __kfree_skb() in the recvmsg() context
>> adds a latency for user applications, and increase probability
>> of trapping them in backlog processing, since the BH handler
>> might found the socket owned by the user.
>>
>> This patch, combined with the prior one increases the rpc
>> performance by about 10 % on servers with large number of cores.
>>
>> (tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
>>  instead of 8 Mpps)
>>
>> This also increases single bulk flow performance on 40Gbit+ links,
>> since in this case there are often two cpus working in tandem :
>>
>>  - CPU handling the NIC rx interrupts, feeding the receive queue,
>>   and (after this patch) freeing the skbs that were consumed.
>>
>>  - CPU in recvmsg() system call, essentially 100 % busy copying out
>>   data to user space.
>>
>> Having at most one skb in a per-socket cache has very little risk
>> of memory exhaustion, and since it is protected by socket lock,
>> its management is essentially free.
>>
>> Note that if rps/rfs is used, we do not enable this feature, because
>> there is high chance that the same cpu is handling both the recvmsg()
>> system call and the TCP rx path, but that another cpu did the skb
>> allocations in the device driver right before the RPS/RFS logic.
>>
>> To properly handle this case, it seems we would need to record
>> on which cpu skb was allocated, and use a different channel
>> to give skbs back to this cpu.
>>
>> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
>> Acked-by: Soheil Hassas Yeganeh <soheil@...gle.com>
>> Acked-by: Willem de Bruijn <willemb@...gle.com>
> 
> Hi Eric!
> 
> Somehow this appears to make ktls run out of stack:

Are you sure this is this commit, and not another one ?

The tx part was buggy, (recycling is harder), rx part is simply deferring the freeing.

> 
> [  132.022746][ T1597] BUG: stack guard page was hit at 00000000d40fad41 (stack is 0000000029dde9f4..000000008cce03d5)
> [  132.034492][ T1597] kernel stack overflow (double-fault): 0000 [#1] PREEMPT SMP
> [  132.042733][ T1597] CPU: 1 PID: 1597 Comm: hurl Not tainted 5.1.0-rc2-perf-00642-g179e7e21995d-dirty #683
> [  132.053500][ T1597] Hardware name: ...
> [  132.062714][ T1597] RIP: 0010:free_one_page+0x2b/0x490
> [  132.068526][ T1597] Code: 1f 44 00 00 41 57 48 8d 87 40 05 00 00 49 89 f7 41 56 49 89 d6 41 55 41 54 49 89 fc 48 89 c7 55 89 cd 532
> [  132.090369][ T1597] RSP: 0018:ffffb46c03d9fff8 EFLAGS: 00010092
> [  132.097054][ T1597] RAX: ffff91ed7fffd240 RBX: 0000000000000000 RCX: 0000000000000003
> [  132.105874][ T1597] RDX: 0000000000469c68 RSI: ffffd6e151a71a00 RDI: ffff91ed7fffd240
> [  132.114697][ T1597] RBP: 0000000000000003 R08: 0000000000000000 R09: dead000000000200
> [  132.123521][ T1597] R10: ffffd6e151a71808 R11: 0000000000000000 R12: ffff91ed7fffcd00
> [  132.132344][ T1597] R13: ffffd6e140000000 R14: 0000000000469c68 R15: ffffd6e151a71a00
> [  132.141209][ T1597] FS:  00007f1545154700(0000) GS:ffff91f16f600000(0000) knlGS:0000000000000000
> [  132.151143][ T1597] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  132.158433][ T1597] CR2: ffffb46c03d9ffe8 CR3: 00000004587e6006 CR4: 00000000003606e0
> [  132.167299][ T1597] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  132.176166][ T1597] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  132.185027][ T1597] Call Trace:
> [  132.188628][ T1597]  __free_pages_ok+0x143/0x2c0
> [  132.193881][ T1597]  skb_release_data+0x8e/0x140
> [  132.199131][ T1597]  ? skb_release_data+0xad/0x140
> [  132.204566][ T1597]  kfree_skb+0x32/0xb0
> 
> [...]
> 
> [  135.889113][ T1597]  skb_release_data+0xad/0x140
> [  135.894363][ T1597]  ? skb_release_data+0xad/0x140
> [  135.899806][ T1597]  kfree_skb+0x32/0xb0
> [  135.904279][ T1597]  skb_release_data+0xad/0x140
> [  135.909528][ T1597]  ? skb_release_data+0xad/0x140
> [  135.914972][ T1597]  kfree_skb+0x32/0xb0
> [  135.919444][ T1597]  skb_release_data+0xad/0x140
> [  135.924694][ T1597]  ? skb_release_data+0xad/0x140
> [  135.930138][ T1597]  kfree_skb+0x32/0xb0
> [  135.934610][ T1597]  skb_release_data+0xad/0x140
> [  135.939860][ T1597]  ? skb_release_data+0xad/0x140
> [  135.945295][ T1597]  kfree_skb+0x32/0xb0
> [  135.949767][ T1597]  skb_release_data+0xad/0x140
> [  135.955017][ T1597]  __kfree_skb+0xe/0x20
> [  135.959578][ T1597]  tcp_disconnect+0xd6/0x4d0
> [  135.964632][ T1597]  tcp_close+0xf4/0x430
> [  135.969200][ T1597]  ? tcp_check_oom+0xf0/0xf0
> [  135.974255][ T1597]  tls_sk_proto_close+0xe4/0x1e0 [tls]
> [  135.980283][ T1597]  inet_release+0x36/0x60
> [  135.985047][ T1597]  __sock_release+0x37/0xa0
> [  135.990004][ T1597]  sock_close+0x11/0x20
> [  135.994574][ T1597]  __fput+0xa2/0x1d0
> [  135.998853][ T1597]  task_work_run+0x89/0xb0
> [  136.003715][ T1597]  exit_to_usermode_loop+0x9a/0xa0
> [  136.009345][ T1597]  do_syscall_64+0xc0/0xf0
> [  136.014207][ T1597]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  136.020710][ T1597] RIP: 0033:0x7f1546cb5447
> [  136.025570][ T1597] Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 c4 fb ff ff 89 df 89 c24
> [  136.047476][ T1597] RSP: 002b:00007f1545153ba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
> [  136.056827][ T1597] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007f1546cb5447
> [  136.065692][ T1597] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000008
> [  136.074556][ T1597] RBP: 00007f1538000b20 R08: 0000000000000008 R09: 0000000000000000
> [  136.083419][ T1597] R10: 00007f1545153bc0 R11: 0000000000000293 R12: 00005631f41cf1a0
> [  136.092285][ T1597] R13: 00005631f41cf1b8 R14: 00007f1538003330 R15: 00007f1538003330
> [  136.101151][ T1597] Modules linked in: ctr ghash_generic gf128mul gcm rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache bis
> [  136.150271][ T1597] ---[ end trace 67081a0c8ea38611 ]---
> 
> 
> This is hurl <> nginx running over loopback doing a 100 MB GET.
> 
> 🙄
>