lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <CACSApva=N_+qKLfS3Uw0AM=fzXV6vuBtctH+Gwb4xqAAS=4nKQ@mail.gmail.com> Date: Mon, 15 May 2017 23:49:53 -0400 From: Soheil Hassas Yeganeh <soheil@...gle.com> To: Eric Dumazet <edumazet@...gle.com> Cc: "David S . Miller" <davem@...emloft.net>, netdev <netdev@...r.kernel.org>, Eric Dumazet <eric.dumazet@...il.com>, Neal Cardwell <ncardwell@...gle.com>, Yuchung Cheng <ycheng@...gle.com>, Van Jacobson <vanj@...gle.com>, Jerry Chu <hkchu@...gle.com> Subject: Re: [PATCH net-next] tcp: internal implementation for pacing On Mon, May 15, 2017 at 11:43 PM, Eric Dumazet <edumazet@...gle.com> wrote: > BBR congestion control depends on pacing, and pacing is > currently handled by sch_fq packet scheduler for performance reasons, > and also because implemening pacing with FQ was convenient to truly > avoid bursts. > > However there are many cases where this packet scheduler constraint > is not practical. > - Many linux hosts are not focusing on handling thousands of TCP > flows in the most efficient way. > - Some routers use fq_codel or other AQM, but still would like > to use BBR for the few TCP flows they initiate/terminate. > > This patch implements an automatic fallback to internal pacing. > > Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option. > > If sch_fq happens to be in the egress path, pacing is delegated to > the qdisc, otherwise pacing is done by TCP itself. > > One advantage of pacing from TCP stack is to get more precise rtt > estimations, and less work done from TX completion, since TCP Small > queue limits are not generally hit. Setups with single TX queue but > many cpus might even benefit from this. > > Note that unlike sch_fq, we do not take into account header sizes. > Taking care of these headers would add additional complexity for > no practical differences in behavior. > > Some performance numbers using 800 TCP_STREAM flows rate limited to > ~48 Mbit per second on 40Gbit NIC. > > If MQ+pfifo_fast is used on the NIC : > > $ sar -n DEV 1 5 | grep eth > 14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00 > 14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00 > 14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00 > 14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00 > 14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00 > Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40 > $ vmstat 1 5 > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- > r b swpd free buff cache si so bi bo in cs us sy id wa st > 4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0 > 4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0 > 0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0 > 1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0 > 1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0 > > Now use MQ+FQ : > > lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc > lpaa23:~# tc qdisc replace dev eth0 root mq > > $ sar -n DEV 1 5 | grep eth > 14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00 > 14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00 > 14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00 > 14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00 > 14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00 > Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40 > $ vmstat 1 5 > procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- > r b swpd free buff cache si so bi bo in cs us sy id wa st > 2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0 > 1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0 > 0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0 > 4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0 > 3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0 > > As expected, number of interrupts per second is very different. > > Signed-off-by: Eric Dumazet <edumazet@...gle.com> Acked-by: Soheil Hassas Yeganeh <soheil@...gle.com> > Cc: Neal Cardwell <ncardwell@...gle.com> > Cc: Yuchung Cheng <ycheng@...gle.com> > Cc: Soheil Hassas Yeganeh <soheil@...gle.com> > Cc: Van Jacobson <vanj@...gle.com> > Cc: Jerry Chu <hkchu@...gle.com> > --- > include/linux/tcp.h | 2 ++ > include/net/sock.h | 8 +++++- > include/net/tcp.h | 3 ++ > net/core/sock.c | 4 +++ > net/ipv4/tcp_bbr.c | 9 +++--- > net/ipv4/tcp_output.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++ > net/ipv4/tcp_timer.c | 3 ++ > net/sched/sch_fq.c | 8 ++++++ > 8 files changed, 112 insertions(+), 5 deletions(-) > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index b6d5adcee8fcb611de202993623cc80274d262e4..22854f0284347a3bb047709478525ee5a9dd9b36 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -293,6 +293,8 @@ struct tcp_sock { > u32 sacked_out; /* SACK'd packets */ > u32 fackets_out; /* FACK'd packets */ > > + struct hrtimer pacing_timer; > + > /* from STCP, retrans queue hinting */ > struct sk_buff* lost_skb_hint; > struct sk_buff *retransmit_skb_hint; > diff --git a/include/net/sock.h b/include/net/sock.h > index f33e3d134e0b7f66329f2122d7acc8b396c1787b..f21e07563991fecb2b67e092617aef0d63954c86 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -396,7 +396,7 @@ struct sock { > __s32 sk_peek_off; > int sk_write_pending; > __u32 sk_dst_pending_confirm; > - /* Note: 32bit hole on 64bit arches */ > + u32 sk_pacing_status; /* see enum sk_pacing */ > long sk_sndtimeo; > struct timer_list sk_timer; > __u32 sk_priority; > @@ -475,6 +475,12 @@ struct sock { > struct rcu_head sk_rcu; > }; > > +enum sk_pacing { > + SK_PACING_NONE = 0, > + SK_PACING_NEEDED = 1, > + SK_PACING_FQ = 2, > +}; > + > #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data))) > > #define rcu_dereference_sk_user_data(sk) rcu_dereference(__sk_user_data((sk))) > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 38a7427ae902e35973a8b7fa0e95ff602ede0e87..b4dc93dae98c2d175ccadce150083705d237555e 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -574,6 +574,7 @@ void tcp_fin(struct sock *sk); > void tcp_init_xmit_timers(struct sock *); > static inline void tcp_clear_xmit_timers(struct sock *sk) > { > + hrtimer_cancel(&tcp_sk(sk)->pacing_timer); > inet_csk_clear_xmit_timers(sk); > } > > @@ -1945,4 +1946,6 @@ static inline void tcp_listendrop(const struct sock *sk) > __NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS); > } > > +enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer); > + > #endif /* _TCP_H */ > diff --git a/net/core/sock.c b/net/core/sock.c > index e43e71d7856b385111cd4c4b1bd835a78c670c60..93d011e35b8349954db6918055c2f90ae473d254 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1041,6 +1041,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname, > #endif > > case SO_MAX_PACING_RATE: > + if (val != ~0U) > + cmpxchg(&sk->sk_pacing_status, > + SK_PACING_NONE, > + SK_PACING_NEEDED); > sk->sk_max_pacing_rate = val; > sk->sk_pacing_rate = min(sk->sk_pacing_rate, > sk->sk_max_pacing_rate); > diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c > index b89bce4c721eed530f5cfc725b759147b38cef42..92b045c72163def1c1d6aa0f2002760186aa5dc3 100644 > --- a/net/ipv4/tcp_bbr.c > +++ b/net/ipv4/tcp_bbr.c > @@ -52,10 +52,9 @@ > * There is a public e-mail list for discussing BBR development and testing: > * https://groups.google.com/forum/#!forum/bbr-dev > * > - * NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing enabled, > - * since pacing is integral to the BBR design and implementation. > - * BBR without pacing would not function properly, and may incur unnecessary > - * high packet loss rates. > + * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled, > + * otherwise TCP stack falls back to an internal pacing using one high > + * resolution timer per TCP socket and may use more resources. > */ > #include <linux/module.h> > #include <net/tcp.h> > @@ -830,6 +829,8 @@ static void bbr_init(struct sock *sk) > bbr->cycle_idx = 0; > bbr_reset_lt_bw_sampling(sk); > bbr_reset_startup_mode(sk); > + > + cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED); > } > > static u32 bbr_sndbuf_expand(struct sock *sk) > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 4858e190f6ac130c9441f58cb8944cc82bf67270..a32172d69a03cbe76b45ec3094222f6c3a73e27d 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -904,6 +904,72 @@ void tcp_wfree(struct sk_buff *skb) > sk_free(sk); > } > > +/* Note: Called under hard irq. > + * We can not call TCP stack right away. > + */ > +enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer) > +{ > + struct tcp_sock *tp = container_of(timer, struct tcp_sock, pacing_timer); > + struct sock *sk = (struct sock *)tp; > + unsigned long nval, oval; > + > + for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) { > + struct tsq_tasklet *tsq; > + bool empty; > + > + if (oval & TSQF_QUEUED) > + break; > + > + nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED; > + nval = cmpxchg(&sk->sk_tsq_flags, oval, nval); > + if (nval != oval) > + continue; > + > + if (!atomic_inc_not_zero(&sk->sk_wmem_alloc)) > + break; > + /* queue this socket to tasklet queue */ > + tsq = this_cpu_ptr(&tsq_tasklet); > + empty = list_empty(&tsq->head); > + list_add(&tp->tsq_node, &tsq->head); > + if (empty) > + tasklet_schedule(&tsq->tasklet); > + break; > + } > + return HRTIMER_NORESTART; > +} > + > +/* BBR congestion control needs pacing. > + * Same remark for SO_MAX_PACING_RATE. > + * sch_fq packet scheduler is efficiently handling pacing, > + * but is not always installed/used. > + * Return true if TCP stack should pace packets itself. > + */ > +static bool tcp_needs_internal_pacing(const struct sock *sk) > +{ > + return smp_load_acquire(&sk->sk_pacing_status) == SK_PACING_NEEDED; > +} > + > +static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb) > +{ > + u64 len_ns; > + u32 rate; > + > + if (!tcp_needs_internal_pacing(sk)) > + return; > + rate = sk->sk_pacing_rate; > + if (!rate || rate == ~0U) > + return; > + > + /* Should account for header sizes as sch_fq does, > + * but lets make things simple. > + */ > + len_ns = (u64)skb->len * NSEC_PER_SEC; > + do_div(len_ns, rate); > + hrtimer_start(&tcp_sk(sk)->pacing_timer, > + ktime_add_ns(ktime_get(), len_ns), > + HRTIMER_MODE_ABS_PINNED); > +} > + > /* This routine actually transmits TCP packets queued in by > * tcp_do_sendmsg(). This is used by both the initial > * transmission and possible later retransmissions. > @@ -1034,6 +1100,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, > if (skb->len != tcp_header_size) { > tcp_event_data_sent(tp, sk); > tp->data_segs_out += tcp_skb_pcount(skb); > + tcp_internal_pacing(sk, skb); > } > > if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq) > @@ -2086,6 +2153,12 @@ static int tcp_mtu_probe(struct sock *sk) > return -1; > } > > +static bool tcp_pacing_check(const struct sock *sk) > +{ > + return tcp_needs_internal_pacing(sk) && > + hrtimer_active(&tcp_sk(sk)->pacing_timer); > +} > + > /* TCP Small Queues : > * Control number of packets in qdisc/devices to two packets / or ~1 ms. > * (These limits are doubled for retransmits) > @@ -2210,6 +2283,9 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, > while ((skb = tcp_send_head(sk))) { > unsigned int limit; > > + if (tcp_pacing_check(sk)) > + break; > + > tso_segs = tcp_init_tso_segs(skb, mss_now); > BUG_ON(!tso_segs); > > @@ -2878,6 +2954,10 @@ void tcp_xmit_retransmit_queue(struct sock *sk) > > if (skb == tcp_send_head(sk)) > break; > + > + if (tcp_pacing_check(sk)) > + break; > + > /* we could do better than to assign each time */ > if (!hole) > tp->retransmit_skb_hint = skb; > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c > index 14672543cf0bd27bc59976d5cec38d2d3bbcdd2c..86934bcf685a65ec3af3d22f1801ffa33eea76e2 100644 > --- a/net/ipv4/tcp_timer.c > +++ b/net/ipv4/tcp_timer.c > @@ -710,4 +710,7 @@ void tcp_init_xmit_timers(struct sock *sk) > { > inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer, > &tcp_keepalive_timer); > + hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC, > + HRTIMER_MODE_ABS_PINNED); > + tcp_sk(sk)->pacing_timer.function = tcp_pace_kick; > } > diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c > index b488721a0059adb24aea47240afa0164a6e467a9..147fde73a0f566e8f6a26718adf176ef3943afa0 100644 > --- a/net/sched/sch_fq.c > +++ b/net/sched/sch_fq.c > @@ -390,9 +390,17 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch, > q->stat_tcp_retrans++; > qdisc_qstats_backlog_inc(sch, skb); > if (fq_flow_is_detached(f)) { > + struct sock *sk = skb->sk; > + > fq_flow_add_tail(&q->new_flows, f); > if (time_after(jiffies, f->age + q->flow_refill_delay)) > f->credit = max_t(u32, f->credit, q->quantum); > + if (sk && q->rate_enable) { > + if (unlikely(smp_load_acquire(&sk->sk_pacing_status) != > + SK_PACING_FQ)) > + smp_store_release(&sk->sk_pacing_status, > + SK_PACING_FQ); > + } > q->inactive_flows--; > } > > -- > 2.13.0.303.g4ebf302169-goog > This is superb! Thank you, Eric!
Powered by blists - more mailing lists