netdev - Re: [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220429161810.GA175@qian>
Date:   Fri, 29 Apr 2022 12:18:10 -0400
From:   Qian Cai <quic_qiancai@...cinc.com>
To:     Eric Dumazet <eric.dumazet@...il.com>
CC:     "David S . Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        netdev <netdev@...r.kernel.org>,
        Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH v2 net-next] net: generalize skb freeing deferral to
 per-cpu lists

On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@...gle.com>
> 
> Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
> lock is released") helped bulk TCP flows to move the cost of skbs
> frees outside of critical section where socket lock was held.
> 
> But for RPC traffic, or hosts with RFS enabled, the solution is far from
> being ideal.
> 
> For RPC traffic, recvmsg() has to return to user space right after
> skb payload has been consumed, meaning that BH handler has no chance
> to pick the skb before recvmsg() thread. This issue is more visible
> with BIG TCP, as more RPC fit one skb.
> 
> For RFS, even if BH handler picks the skbs, they are still picked
> from the cpu on which user thread is running.
> 
> Ideally, it is better to free the skbs (and associated page frags)
> on the cpu that originally allocated them.
> 
> This patch removes the per socket anchor (sk->defer_list) and
> instead uses a per-cpu list, which will hold more skbs per round.
> 
> This new per-cpu list is drained at the end of net_action_rx(),
> after incoming packets have been processed, to lower latencies.
> 
> In normal conditions, skbs are added to the per-cpu list with
> no further action. In the (unlikely) cases where the cpu does not
> run net_action_rx() handler fast enough, we use an IPI to raise
> NET_RX_SOFTIRQ on the remote cpu.
> 
> Also, we do not bother draining the per-cpu list from dev_cpu_dead()
> This is because skbs in this list have no requirement on how fast
> they should be freed.
> 
> Note that we can add in the future a small per-cpu cache
> if we see any contention on sd->defer_lock.

Hmm, we started to see some memory leak reports from kmemleak that have
been around for hours without being freed since yesterday's linux-next
tree which included this commit. Any thoughts?

unreferenced object 0xffff400610f55cc0 (size 216):
  comm "git-remote-http", pid 781180, jiffies 4314091475 (age 4323.740s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00  ..~..?..........
  backtrace:
     kmem_cache_alloc_node
     __alloc_skb
     __tcp_send_ack.part.0
     tcp_send_ack
     tcp_cleanup_rbuf
     tcp_recvmsg_locked
     tcp_recvmsg
     inet_recvmsg
     __sys_recvfrom
     __arm64_sys_recvfrom
     invoke_syscall
     el0_svc_common.constprop.0
     do_el0_svc
     el0_svc
     el0t_64_sync_handler
     el0t_64_sync
unreferenced object 0xffff4001e58f0c40 (size 216):
  comm "git-remote-http", pid 781180, jiffies 4314091483 (age 4323.968s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 c0 7e 87 ff 3f ff ff 00 00 00 00 00 00 00 00  ..~..?..........
  backtrace:
     kmem_cache_alloc_node
     __alloc_skb
     __tcp_send_ack.part.0
     tcp_send_ack
     tcp_cleanup_rbuf
     tcp_recvmsg_locked
     tcp_recvmsg
     inet_recvmsg
     __sys_recvfrom
     __arm64_sys_recvfrom
     invoke_syscall
     el0_svc_common.constprop.0
     do_el0_svc
     el0_svc
     el0t_64_sync_handler
     el0t_64_sync