netdev - Re: Extensible hashing and RCU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <200702201209.52388.dada1@cosmosbay.com>
Date:	Tue, 20 Feb 2007 12:09:51 +0100
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Evgeniy Polyakov <johnpol@....mipt.ru>
Cc:	akepner@....com, linux@...izon.com, davem@...emloft.net,
	netdev@...r.kernel.org, bcrl@...ck.org
Subject: Re: Extensible hashing and RCU

On Tuesday 20 February 2007 11:44, Evgeniy Polyakov wrote:
> On Tue, Feb 20, 2007 at 11:04:15AM +0100, Eric Dumazet (dada1@...mosbay.com) 
wrote:
> > You totally miss the fact that the 1-2-4 MB cache is not available for
> > you at all. It is filled by User accesses. I dont care about DOS. I care
> > about real servers, servicing tcp clients. The TCP service/stack should
> > not take more than 10% of CPU (cycles and caches). The User application
> > is certainly more important because it hosts the real added value.
>
> TCP socket is 4k in size, one tree entry can be reduced to 200 bytes?
>
> No one says about _that_ cache miss, it is considered OK to have, but
> tree cache miss becomes the worst thing ever.
> In softirq we process socket's state, lock, reference counter several
> pointer, and if we are happy - the whole TCP state machine fields - and
> most of it stasy there when kernel is over - userspace issues syscalls
> which must populate it back. Why don't we see that it is moved into
> cache each time syscall is invoked? Because it is in the cache as long
> as part of the hash table assotiated with last recently used hash
> entries, which should not be there, and instead part of the tree can be.

No I see cache misses everywhere...

This is because my machines are doing real work in user land. They are not lab 
machines. Even if I had cpus with 16-32MB cache, it would be the same, 
because User land wants GBs ... 

For example, sock_wfree() uses 1.6612 % of cpu because of false sharing of 
sk_flags (dirtied each time SOCK_QUEUE_SHRUNK is set :(

ffffffff803c2850 <sock_wfree>: /* sock_wfree total: 714241  1.6613 */
  1307  0.0030 :ffffffff803c2850:       push   %rbp
 55056  0.1281 :ffffffff803c2851:       mov    %rsp,%rbp
    94 2.2e-04 :ffffffff803c2854:       push   %rbx
               :ffffffff803c2855:       sub    $0x8,%rsp
  1090  0.0025 :ffffffff803c2859:       mov    0x10(%rdi),%rbx
     3 7.0e-06 :ffffffff803c285d:       mov    0xb8(%rdi),%eax
    38 8.8e-05 :ffffffff803c2863:       lock sub %eax,0x90(%rbx)

/* HOT : access to sk_flags */
 81979  0.1907 :ffffffff803c286a:       mov    0x100(%rbx),%eax
512119  1.1912 :ffffffff803c2870:       test   $0x2,%ah

   262 6.1e-04 :ffffffff803c2873:       jne    ffffffff803c2880 
<sock_wfree+0x30>
   142 3.3e-04 :ffffffff803c2875:       mov    %rbx,%rdi
 14467  0.0336 :ffffffff803c2878:       callq  *0x200(%rbx)
    63 1.5e-04 :ffffffff803c287e:       data16
               :ffffffff803c287f:       nop    
  9046  0.0210 :ffffffff803c2880:       lock decl 0x28(%rbx)
 29792  0.0693 :ffffffff803c2884:       sete   %al
    56 1.3e-04 :ffffffff803c2887:       test   %al,%al
   789  0.0018 :ffffffff803c2889:       je     ffffffff803c2893 
<sock_wfree+0x43>
               :ffffffff803c288b:       mov    %rbx,%rdi
   144 3.3e-04 :ffffffff803c288e:       callq  ffffffff803c0f90 <sk_free>
  1685  0.0039 :ffffffff803c2893:       add    $0x8,%rsp
  2462  0.0057 :ffffffff803c2897:       pop    %rbx
   684  0.0016 :ffffffff803c2898:       leaveq 
  2963  0.0069 :ffffffff803c2899:       retq   


This is why tcp lookups should not take more than 1% themselves : other parts 
of the stack *want* to make many cache misses too.

If we want to optimize tcp, we should reorder fields to reduce number of cache 
lines, not change algos. struct sock fields are currently placed to reduce 
holes, while they should be grouped by related fields sharing cache lines.



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html