netdev - Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central spinlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130524151647.18388e27@redhat.com>
Date:	Fri, 24 May 2013 15:16:47 +0200
From:	Jesper Dangaard Brouer <jbrouer@...hat.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	Pablo Neira Ayuso <pablo@...filter.org>,
	netfilter-devel@...r.kernel.org, netdev <netdev@...r.kernel.org>,
	Tom Herbert <therbert@...gle.com>,
	Patrick McHardy <kaber@...sh.net>
Subject: Re: [PATCH v2 nf-next] netfilter: conntrack: remove the central
 spinlock

On Wed, 22 May 2013 10:47:48 -0700
Eric Dumazet <eric.dumazet@...il.com> wrote:

> nf_conntrack_lock is a monolithic lock and suffers from huge
> contention on current generation servers (8 or more core/threads).
> 
[...]
> Results on a 32 threads machine, 200 concurrent instances of "netperf
> -t TCP_CRR" : 
> 
> ~390000 tps instead of ~300000 tps.

Tested-by: Jesper Dangaard Brouer <brouer@...hat.com>

I gave the patch a quick run in my testlab, and the results are
amazing, you are amazing Eric! :-)

Basic testlab setup:
 I'm generating a 2700 Kpps SYN-flood against port 80 (with trafgen)

Baseline result from a  3.9.0-rc5 kernel:
- With nf_conntrack my performance is 749 Kpps.

If removing all iptables and nf_contrack modules:
- the performance hits 1095 Kpps.
But it looks like we are hitting a new spin_lock in ip_send_reply()

If start a LISTEN process on the port, then we hit the "old" SYN
scalability issues again, performance drops tp 227 Kpps.

On a patched net-next (close to 3.10.0-rc1) kernel, with Eric's new
locking scheme patch:
- I measured an amazing 2431 Kpps.

 13.45%  [kernel]                [k] fib_table_lookup
  9.07%  [nf_conntrack]          [k] __nf_conntrack_alloc
  6.50%  [nf_conntrack]          [k] nf_conntrack_free
  5.24%  [ip_tables]             [k] ipt_do_table
  3.66%  [nf_conntrack]          [k] nf_conntrack_in
  3.54%  [kernel]                [k] inet_getpeer
  3.52%  [nf_conntrack]          [k] tcp_packet
  2.44%  [ixgbe]                 [k] ixgbe_poll
  2.30%  [kernel]                [k] __ip_route_output_key
  2.04%  [nf_conntrack]          [k] nf_conntrack_tuple_taken
  1.98%  [kernel]                [k] icmp_send

Then, I realized that I didn't have any iptables rules that accepted
port 80 on my testlab system, thus this were basically a drop packets
test with a nf_conntrack lookup.

If I add a rule that accept new connection to that port e.g:
 iptables -I INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j
ACCEPT

New ruleset:
 -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT 
 -A INPUT -p icmp -j ACCEPT 
 -A INPUT -i lo -j ACCEPT 
 -A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT 
 -A INPUT -p tcp -m state --state NEW -m tcp --dport 80 -j ACCEPT 
 -A INPUT -j REJECT --reject-with icmp-host-prohibited 

Then, performance drops again:
- to approx 883 Kpps.

Discover that the NAT stuff is to blame:

-  17.71%        swapper  [kernel.kallsyms]       [k] _raw_spin_lock_bh
   - _raw_spin_lock_bh
      + 47.17% nf_nat_cleanup_conntrack
      + 45.81% nf_nat_setup_info
      + 6.43% nf_nat_get_offset

Removing the nat modules, improves the performance:
- to 1182 Kpps (not listen on port 80)

 sudo iptables -t nat -F
 sudo rmmod iptable_nat nf_nat_ipv4

And the perf output looks more like what I would expect:

-  14.85%       swapper  [kernel.kallsyms]        [k] _raw_spin_lock
   - _raw_spin_lock
      + 82.86% mod_timer
      + 11.14% nf_conntrack_double_lock
      + 2.50% nf_ct_del_from_dying_or_unconfirmed_list
      + 1.48% nf_conntrack_in
      + 1.30% nf_ct_delete_from_lists
-  12.78%       swapper  [kernel.kallsyms]        [k]
  _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
      - 99.44% lock_timer_base
         + 99.07% del_timer
         + 0.93% mod_timer
+   2.69%       swapper  [ip_tables]              [k] ipt_do_table
+   2.28%   ksoftirqd/0  [kernel.kallsyms]        [k]
  _raw_spin_lock_irqsave
+   2.18%       swapper  [nf_conntrack]           [k] tcp_packet
+   2.16%       swapper  [kernel.kallsyms]        [k] fib_table_lookup


Again if I start a LISTEN process on the port, performance drops to
169Kpps, due to the LISTEN and SYN-cookie scalability issues.

I'm amazed, this patch will actually make it a viable choice to load
the conntrack modules on a DDoS based filtering box, and use the
conntracks to protect against ACK and SYN+ACK attacks.

Simply by not accepting the ACK or SYN+ACK to create a conntrack entry.
Via the command:
 sysctl -w net/netfilter/nf_conntrack_tcp_loose=0

A quick test show; now I can run a LISTEN process on the port, and
handle an SYN+ACK attack of approx 2580Kpps (and the same for ACK
attacks), while running a LISTEN process on the port.

Thanks for the great work Eric!

ps. also tested resizing the hash tables, both:
 /proc/sys/net/netfilter/nf_conntrack_max
and resizing the buckets via:
 /sys/module/nf_conntrack/parameters/hashsize

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html