[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOesGMi31UA2d-Bj2jo53Wz_YV424-rD3qk9rS5_-Yng0VC=0w@mail.gmail.com>
Date: Sat, 8 Sep 2018 10:02:42 -0700
From: Olof Johansson <olof@...om.net>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Herbert Xu <herbert@...dor.apana.org.au>,
David Miller <davem@...emloft.net>,
Neil Horman <nhorman@...driver.com>,
Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
Vladislav Yasevich <vyasevich@...il.com>,
Alexey Kuznetsov <kuznet@....inr.ac.ru>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
linux-crypto@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
linux-sctp@...r.kernel.org, netdev <netdev@...r.kernel.org>,
linux-decnet-user@...ts.sourceforge.net,
kernel-team <kernel-team@...com>,
Yuchung Cheng <ycheng@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>
Subject: Re: [PATCH] net/sock: move memory_allocated over to percpu_counter variables
Hi,
On Fri, Sep 7, 2018 at 12:21 AM, Eric Dumazet <edumazet@...gle.com> wrote:
> On Fri, Sep 7, 2018 at 12:03 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
>> Problem is : we have platforms with more than 100 cpus, and
>> sk_memory_allocated() cost will be too expensive,
>> especially if the host is under memory pressure, since all cpus will
>> touch their private counter.
>>
>> per cpu variables do not really scale, they were ok 10 years ago when
>> no more than 16 cpus were the norm.
>>
>> I would prefer change TCP to not aggressively call
>> __sk_mem_reduce_allocated() from tcp_write_timer()
>>
>> Ideally only tcp_retransmit_timer() should attempt to reduce forward
>> allocations, after recurring timeout.
>>
>> Note that after 20c64d5cd5a2bdcdc8982a06cb05e5e1bd851a3d ("net: avoid
>> sk_forward_alloc overflows")
>> we have better control over sockets having huge forward allocations.
>>
>> Something like :
>
> Or something less risky :
I gave both of these patches a run, and neither do as well on the
system that has slower atomics. :(
The percpu version:
8.05% workload [kernel.vmlinux]
[k] __do_softirq
7.04% swapper [kernel.vmlinux]
[k] cpuidle_enter_state
5.54% workload [kernel.vmlinux]
[k] _raw_spin_unlock_irqrestore
1.66% swapper [kernel.vmlinux]
[k] __do_softirq
1.55% workload [kernel.vmlinux]
[k] finish_task_switch
1.24% swapper [kernel.vmlinux]
[k] finish_task_switch
1.07% workload [kernel.vmlinux]
[k] net_rx_action
The first patch from you still has significant amount of time spent in
the atomics paths (non-inlined versions used):
7.87% workload [kernel.vmlinux]
[k] __ll_sc_atomic64_sub
7.48% workload [kernel.vmlinux]
[k] __do_softirq
5.05% workload [kernel.vmlinux]
[k] _raw_spin_unlock_irqrestore
2.42% workload [kernel.vmlinux]
[k] __ll_sc_atomic64_add_return
1.49% swapper [kernel.vmlinux]
[k] cpuidle_enter_state
1.31% workload [kernel.vmlinux]
[k] finish_task_switch
1.09% workload [kernel.vmlinux]
[k] tcp_sendmsg_locked
1.08% workload [kernel.vmlinux]
[k] __arch_copy_from_user
1.02% workload [kernel.vmlinux]
[k] net_rx_action
I think a lot of the overhead from percpu approach can be alleviated
if we can use percpu_counter_read() instead of _sum() (i.e. no need to
iterate through the local per-cpu recent delta). I don't know the TCP
stack well enough to tell where it's OK to use a bit of slack in the
numbers though -- by default count will at most be off by 32*online
cpus. Might not be a significant number in reality.
-Olof
Powered by blists - more mailing lists