netdev - Re: [*v3 PATCH 00/22] IPVS Network Name Space aware

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201101021727.42759.hans@schillstrom.com>
Date:	Sun, 2 Jan 2011 17:27:42 +0100
From:	Hans Schillstrom <hans@...illstrom.com>
To:	Julian Anastasov <ja@....bg>
Cc:	horms@...ge.net.au, daniel.lezcano@...e.fr, wensong@...ux-vs.org,
	lvs-devel@...r.kernel.org, netdev@...r.kernel.org,
	netfilter-devel@...r.kernel.org,
	Hans Schillstrom <hans.schillstrom@...csson.com>
Subject: Re: [*v3 PATCH 00/22] IPVS Network Name Space aware

On Saturday, January 01, 2011 13:27:00 Julian Anastasov wrote:
> 
>  	Hello,
> 
> On Thu, 30 Dec 2010, hans@...illstrom.com wrote:
> 
> > From: Hans Schillstrom <hans.schillstrom@...csson.com>
> >
> > This patch series adds network name space support to the LVS.
> >
> > REVISION
> >
> > This is version 3
> >
> > OVERVIEW
> >
> > The patch doesn't remove or add any functionality except for netns.
> > For users that don't use network name space (netns) this patch is
> > completely transparent.
> >
> > Now it's possible to run LVS in a Linux container (see lxc-tools)
> > i.e.  a light weight visualization. For example it's possible to run
> > one or several lvs on a real server in their own network name spaces.
> >> From the LVS point of view it looks like it runs on it's own machine.
> 
>  	Only some comments for two of the patches:
> 
> v3 PATCH 10/22 use ip_vs_proto_data as param
> 
>  	- Can ip_vs_protocol_timeout_change() walk
>  	proto_data_table instead of ip_vs_proto_table to avoid the
>  	__ipvs_proto_data_get call?
> 

I just forget to do that ...

> v3 PATCH 15/22 ip_vs_stats
> 
>  	- Is ustats_seq allocated with alloc_percpu?
> 
>  	Such reader sections should be changed to use tmp vars because
>  	on retry we risk to add the values multiple times. For example:
> 
>  		do {
>  			start = read_seqcount_begin(seq_count);
>  			ipvs->ctl_stats->ustats.inbytes += u->inbytes;
>  			ipvs->ctl_stats->ustats.outbytes += u->outbytes;
>  		} while (read_seqcount_retry(seq_count, start));
> 
>  	should be changed as follows:
> 
>  		u64 inbytes, outbytes;
> 
>  		do {
>  			start = read_seqcount_begin(seq_count);
>  			inbytes = u->inbytes;
>  			outbytes = u->outbytes;
>  		} while (read_seqcount_retry(seq_count, start));
>  		ipvs->ctl_stats->ustats.inbytes += inbytes;
>  		ipvs->ctl_stats->ustats.outbytes += outbytes;
> 

Ooops, that was a bug :-(

>  	Or it is better to create new struct for percpu stats,
>  	they will have their own syncp, because we can not
>  	change struct ip_vs_stats_user. syncp should be percpu
>  	because we remove locks.
> 
>  	For example:
> 
>  	struct ip_vs_cpu_stats {
>  		struct ip_vs_stats_user	ustats;
>  		struct u64_stats_sync	syncp;
>  	};
> 
>  	Then we can add this in struct netns_ipvs:
> 
>  	struct ip_vs_cpu_stats __percpu *stats;	/* Statistics */
> 
>  	without the seqcount_t * ustats_seq;
> 
>  	Then syncp does not need any initialization, it seems
>  	alloc_percpu returns zeroed area.
> 
>  	When we use percpu stats for all places (dest and svc) we
>  	can create new struct struct ip_vs_counters, so that we
>  	can reduce the memory usage from percpu data. Now stats
>  	include counters and estimated values. The estimated
>  	values should not be percpu. Then ip_vs_cpu_stats
>  	will be shorter (it is not visible to user space):
> 
>  	struct ip_vs_cpu_stats {
>  		struct ip_vs_counters	ustats;
>  		struct u64_stats_sync	syncp;
>  	};
> 
>  	For writer side in softirq context we should protect the whole
>  	section with u64 counters, for example this code:
> 

There is only two u64,  inbytes and outbytes

>  	spin_lock(&ip_vs_stats.lock);
>  	ip_vs_stats.ustats.outpkts++;
>  	ip_vs_stats.ustats.outbytes += skb->len;
>  	spin_unlock(&ip_vs_stats.lock);
> 
>  	should be changed to:
> 
>  	struct ip_vs_cpu_stats *s = this_cpu_ptr(ipvs->stats);
> 
>  	u64_stats_update_begin(&s->syncp);
>  	s->ustats.outpkts++;
>  	s->ustats.outbytes += skb->len;
>  	u64_stats_update_end(&s->syncp);
> 
>  	Readers should look in similar way, only that _bh fetching
>  	should be used only for the u64 counters because writer is in
>  	softirq context:
> 
>  	u64 inbytes, outbytes;
> 
>  	for_each_possible_cpu(i) {
>  		const struct ip_vs_cpu_stats *s = per_cpu_ptr(ipvs->stats, i);
>  		unsigned int start;
> 
>  		...
>  		do {
>  			start = u64_stats_fetch_begin_bh(&s->syncp);
>  			inbytes = s->ustats.inbytes;
>  			outbytes = s->ustats.outbytes;
>  		} while (u64_stats_fetch_retry_bh(&s->syncp, start));
>  		ipvs->ctl_stats->ustats.inbytes += inbytes;
>  		ipvs->ctl_stats->ustats.outbytes += outbytes;
>  	}
> 
>  	Then IPVS_STAT_ADD and IPVS_STAT_INC can not access
>  	the syncp. They do not look generic enough, in case we
>  	decide to use struct ip_vs_cpu_stats for dest and svc stats.
>  	I think, these macros are not needed.
>  	It is possible to have other macros for write side,
>  	for percpu stats:
> 
>  	#define IP_VS_STATS_UPDATE_BEGIN(stats)		\
>  		{ struct ip_vs_cpu_stats *s = this_cpu_ptr(stats); \
>  		u64_stats_update_begin(&s->syncp)
> 
>  	#define IP_VS_STATS_UPDATE_END()		\
>  		u64_stats_update_end(&s->syncp);	\
>  		}
> 
>  	Then usage can be:
> 
>  	IP_VS_STATS_UPDATE_BEGIN(ipvs->stats);
>  	s->ustats.outpkts++;
>  	s->ustats.outbytes += skb->len;
>  	IP_VS_STATS_UPDATE_END();
> 
>  	Then we can rename ctl_stats to total_stats or full_stats.
>  	get_stats() can be changed to ip_vs_read_cpu_stats(), it
>  	will be used to read all percpu stats as sum in the
>  	total_stats/full_stats structure or tmp space:
> 
>  	/* Get sum of percpu stats */
>  	... ip_vs_read_cpu_stats(struct ip_vs_stats_user *sum,
>  		struct ip_vs_cpu_stats *stats)
>  	{
>  		...
>  		for_each_possible_cpu(i) {
>  			const struct ip_vs_cpu_stats *s = per_cpu_ptr(stats, i);
>  			unsigned int start;
> 
>  			if (i) {...
>  			...
>  			do {
>  				start = u64_stats_fetch_begin_bh(&s->syncp);
>  				inbytes = s->ustats.inbytes;
>  				outbytes = s->ustats.outbytes;
>  			} while (u64_stats_fetch_retry_bh(&s->syncp, start));
>  			sum->inbytes += inbytes;
>  			sum->outbytes += outbytes;
>  		}
>  	}
> 
>  	As result, ip_vs_read_cpu_stats() will be used in
>  	estimation_timer() instead of get_stats():
> 
>  	ip_vs_read_cpu_stats(&ipvs->total_stats->ustats, ipvs->stats);
> 
>  	but also in ip_vs_stats_show()
>  	{
>  		// I hope struct ip_vs_stats_user can fit in stack:
>  		struct ip_vs_stats_user sum;
> 
>  		here we do not need to use spin_lock_bh(&ctl_stats->lock);
>  		because now the stats are lockless, estimation_timer
>  		does not use ctl_stats->lock, so we have to call
>  		ip_vs_read_cpu_stats to avoid problems with
>  		reading incorrect u64 values directly from
>  		ipvs->total_stats->ustats.
> 
>  		ip_vs_read_cpu_stats(&sum, ipvs->stats);
>  		seq_printf for sum
>  	}
> 
>  	ip_vs_read_cpu_stats can be used also in ip_vs_copy_stats()
>  	which copies values to user space and needs proper u64 reading.
>  	But it is used only for svc and dest stats which are not
>  	percpu yet.
> 
>  	Now this code does not look ok:
> 
>  	ip_vs_zero_stats(net_ipvs(net)->ctl_stats);
> 
>  	May be we need new func ip_vs_zero_percpu_stats that will
>  	reset stats for all CPUs, i.e. ipvs->stats in our case?
>  	Even if this operation is not very safe if done from
>  	user space while the softirq can modify u64 counters
>  	on another CPU.

OK, I will take a look at this when I'm back from my vacation at 20 Jan.
I guess it might be worth the work to make all the stat counters per_cpu.

My tests shows that the big "CPU cycle bandit" was the one that I made per_cpu.

> 
> 
> Happy New Year!
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html