[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQyk9Vt-EV6yovMHkRaXsUYwDXpA=iBheUEaH71SLzMA03g@mail.gmail.com>
Date: Thu, 14 Oct 2021 13:13:58 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: "David S . Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
netdev <netdev@...r.kernel.org>,
Eric Dumazet <edumazet@...gle.com>,
Stefan Bach <sfb@...gle.com>
Subject: Re: [PATCH v2 net-next] tcp: switch orphan_count to bare per-cpu counters
On Thu, Oct 14, 2021 at 9:41 AM Eric Dumazet <eric.dumazet@...il.com> wrote:
>
> From: Eric Dumazet <edumazet@...gle.com>
>
> Use of percpu_counter structure to track count of orphaned
> sockets is causing problems on modern hosts with 256 cpus
> or more.
>
> Stefan Bach reported a serious spinlock contention in real workloads,
> that I was able to reproduce with a netfilter rule dropping
> incoming FIN packets.
>
> 53.56% server [kernel.kallsyms] [k] queued_spin_lock_slowpath
> |
> ---queued_spin_lock_slowpath
> |
> --53.51%--_raw_spin_lock_irqsave
> |
> --53.51%--__percpu_counter_sum
> tcp_check_oom
> |
> |--39.03%--__tcp_close
> | tcp_close
> | inet_release
> | inet6_release
> | sock_close
> | __fput
> | ____fput
> | task_work_run
> | exit_to_usermode_loop
> | do_syscall_64
> | entry_SYSCALL_64_after_hwframe
> | __GI___libc_close
> |
> --14.48%--tcp_out_of_resources
> tcp_write_timeout
> tcp_retransmit_timer
> tcp_write_timer_handler
> tcp_write_timer
> call_timer_fn
> expire_timers
> __run_timers
> run_timer_softirq
> __softirqentry_text_start
>
> As explained in commit cf86a086a180 ("net/dst: use a smaller percpu_counter
> batch for dst entries accounting"), default batch size is too big
> for the default value of tcp_max_orphans (262144).
>
> But even if we reduce batch sizes, there would still be cases
> where the estimated count of orphans is beyond the limit,
> and where tcp_too_many_orphans() has to call the expensive
> percpu_counter_sum_positive().
>
> One solution is to use plain per-cpu counters, and have
> a timer to periodically refresh this cache.
>
> Updating this cache every 100ms seems about right, tcp pressure
> state is not radically changing over shorter periods.
>
> percpu_counter was nice 15 years ago while hosts had less
> than 16 cpus, not anymore by current standards.
>
> v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
> reported by kernel test robot <lkp@...el.com>
> Remove unused socket argument from tcp_too_many_orphans()
>
> Fixes: dd24c00191d5 ("net: Use a percpu_counter for orphan_count")
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Reported-by: Stefan Bach <sfb@...gle.com>
> Cc: Neal Cardwell <ncardwell@...gle.com>
> ---
Fantastic. Thanks, Eric!
Acked-by: Neal Cardwell <ncardwell@...gle.com>
neal
Powered by blists - more mailing lists