lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Fri, 4 Dec 2020 11:27:30 +0800 From: "dust.li" <dust.li@...ux.alibaba.com> To: Julian Anastasov <ja@....bg> Cc: yunhong-cgl jiang <xintian1976@...il.com>, horms@...ge.net.au, netdev@...r.kernel.org, lvs-devel@...r.kernel.org, Yunhong Jiang <yunhjiang@...y.com> Subject: Re: Long delay on estimation_timer causes packet latency On 12/3/20 4:48 PM, Julian Anastasov wrote: > Hello, > > On Thu, 3 Dec 2020, dust.li wrote: > >> Hi Yunhong & Julian, any updates ? >> >> >> We've encountered the same problem. With lots of ipvs >> >> services plus many CPUs, it's easy to reproduce this issue. >> >> I have a simple script to reproduce: >> >> First add many ipvs services: >> >> for((i=0;i<50000;i++)); do >> ipvsadm -A -t 10.10.10.10:$((2000+$i)) >> done >> >> >> Then, check the latency of estimation_timer() using bpftrace: >> >> #!/usr/bin/bpftrace >> >> kprobe:estimation_timer { >> @enter = nsecs; >> } >> >> kretprobe:estimation_timer { >> $exit = nsecs; >> printf("latency: %ld us\n", (nsecs - @enter)/1000); >> } >> >> I observed about 268ms delay on my 104 CPUs test server. >> >> Attaching 2 probes... >> latency: 268807 us >> latency: 268519 us >> latency: 269263 us >> >> >> And I tried moving estimation_timer() into a delayed >> >> workqueue, this do make things better. But since the >> >> estimation won't give up CPU, it can run for pretty >> >> long without scheduling on a server which don't have >> >> preempt enabled, so tasks on that CPU can't get executed >> >> during that period. >> >> >> Since the estimation repeated every 2s, we can't call >> >> cond_resched() to give up CPU in the middle of iterating the >> >> est_list, or the estimation will be quite inaccurate. >> >> Besides the est_list needs to be protected. >> >> >> I haven't found any ideal solution yet, currently, we just >> >> moved the estimation into kworker and add sysctl to allow >> >> us to disable the estimation, since we don't need the >> >> estimation anyway. >> >> >> Our patches is pretty simple now, if you think it's useful, >> >> I can paste them >> >> >> Do you guys have any suggestions or solutions ? > When I saw the report first time, I thought on this > issue and here is what I think: > > - single delayed work (slow to stop them if using many) > > - the delayed work walks say 64 lists with estimators and > reschedules itself for the next 30ms, as result, 30ms*64=1920ms, > 80ms will be idle up to the 2000ms period for estimation > for every list. As result, every list is walked once per 2000ms. > If 30ms is odd for all kind of HZ values, this can be > 20ms * 100. > > - work will use spin_lock_bh(&s->lock) to protect the > entries, we do not want delays between /proc readers and > the work if using mutex. But _bh locks stop timers and > networking for short time :( Not sure yet if just spin_lock > is safe for both /proc and estimator's work. > > - while walking one of the 64 lists we should use just > rcu read locking for the current list, no cond_resched_rcu > because we do not know how to synchronize while entries are > removed at the same time. For now using array with 64 lists > solves this problem. > > - the algorith will look like this: > > int row = 0; > > for () { > rcu_read_lock(); > list_for_each_entry_rcu(e, &ipvs->est_list[row], list) { > > ... > > /* Should we stop immediately ? */ > if (!ipvs->enable || stopping delayed work) { > rcu_read_unlock(); > goto out; > } > } > /* rcu_read_unlock will reschedule if needed > * but we know where to start from the next time, > * i.e. from next slot > */ > rcu_read_unlock(); > reschedule delayed work for +30ms or +110ms if last row > by using queue_delayed_work*() > row ++; > if (row >= 64) > row = 0; > } > > out: > > - the drawback is that without cond_resched_rcu we are not > fair to other processes, solution is needed, we just reduce > the problem by using 64 lists which can be imbalanced. > > - next goal: entries should be removed with list_del_rcu, > without any locks, we do not want to delay configurations, > ipvs->est_lock should be removed. > > - now the problem is how to spread the entries to 64 lists. > One way is randomly, in this case first estimation may be > for less than 2000ms. In this case, less entries means > less work for all 64 steps. But what if entries are removed > and some slots become empty and others too large? > > - if we want to make the work equal for all 64 passes, we > can rebalance the lists, say on every 2000ms move some > entries to neighbour list. As result, one entry can be > estimated after 1970ms or 2030ms. But this is complication > not possible with the RCU locking, we need a safe way > to move entries to neighbour list, say if work walks > row 0 we can rebalance between rows 32 and 33 which are > 1 second away of row 0. And not all list primitives allow > it for _rcu. > > - next options is to insert entries in some current list, > if their count reaches, say 128, then move to the next list > for inserting. This option tries to provide exact 2000ms > delay for the first estimation for the newly added entry. > > We can start with some implementation and see if > your tests are happy. Thanks for sharing your thoughts ! I think it's a good idea to split the est_list into different slots, I believe it will dramatically reduce the delay brought by estimation. My only concern is the cost of the estimation when the number of services is large. Splitting the est_list won't reduce the real work to do. In our case, each estimation cost at most 268ms/2000ms, which is about 13% of one CPU hyper-thread, and this should be a common case in a large K8S cluster with lots of services. Since the estimation is not needed in our environment at all, it's just a waste of CPU resource. Have you ever consider add a switch to let the user turn the estimator off ? Thanks ! Dust
Powered by blists - more mailing lists