lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090426213954.GA825@Krystal>
Date:	Sun, 26 Apr 2009 17:39:54 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	Stephen Hemminger <shemminger@...tta.com>,
	David Miller <davem@...emloft.net>,
	Jarek Poplawski <jarkao2@...il.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ingo Molnar <mingo@...e.hu>, Paul Mackerras <paulus@...ba.org>,
	paulmck@...ux.vnet.ibm.com, Evgeniy Polyakov <zbr@...emap.net>,
	kaber@...sh.net, jeff.chua.linux@...il.com, laijs@...fujitsu.com,
	jengelh@...ozas.de, r000n@...0n.net, linux-kernel@...r.kernel.org,
	netfilter-devel@...r.kernel.org, netdev@...r.kernel.org,
	benh@...nel.crashing.org
Subject: Re: [PATCH] netfilter: use per-CPU recursive lock {XV}

* Eric Dumazet (dada1@...mosbay.com) wrote:
> Mathieu Desnoyers a écrit :
> > * Eric Dumazet (dada1@...mosbay.com) wrote:
> >> From: Stephen Hemminger <shemminger@...tta.com>
> >>
> >>> Epilogue due to master Jarek. Lockdep carest not about the locking
> >>> doth bestowed. Therefore no keys are needed.
> >>>
> >>> Signed-off-by: Stephen Hemminger <shemminger@...tta.com>
> >> So far, so good, should be ready for inclusion now, nobody complained :)
> >>
> >> I include the final patch, merge of your last two patches.
> >>
> >> David, could you please review it once again and apply it if it's OK ?
> >>
> > [...]
> >> +/*
> >> + * Per-CPU read/write lock associated with per-cpu table entries.
> >> + * This is not a general solution but makes reader locking fast since
> >> + * there is no shared variable to cause cache ping-pong; but adds an
> >> + * additional write-side penalty since update must lock all
> >> + * possible CPU's.
> >> + *
> >> + * Read lock is used by ip/arp/ip6 tables rule processing which runs per-cpu.
> >> + * It needs to ensure that the rules are not being changed while packet
> >> + * is being processed. In some cases, the read lock will be acquired
> >> + * twice on the same CPU; this is okay because read locks handle nesting.
> >> + *
> >> + * Write lock is used in two cases:
> >> + *    1. reading counter values
> >> + *       all readers need to be stopped and the per-CPU values are summed.
> >> + *
> >> + *    2. replacing tables
> >> + *       any readers that are using the old tables have to complete
> >> + *       before freeing the old table. This is handled by reading
> >> + *	  as a side effect of reading counters
> >> + */
> >> +DECLARE_PER_CPU(rwlock_t, xt_info_locks);
> >> +
> >> +static inline void xt_info_rdlock_bh(void)
> >> +{
> >> +	/*
> >> +	 * Note: can not use read_lock_bh(&__get_cpu_var(xt_info_locks))
> >> +	 * because need to ensure that preemption is disable before
> >> +	 * acquiring per-cpu-variable, so do it as a two step process
> >> +	 */
> >> +	local_bh_disable();
> > 
> > Why do you need to disable bottom halves on the read-side ? You could
> > probably just disable preemption, given this lock is nestable on the
> > read-side anyway. Or I'm missing something obvious ?
> 
> It may not be obvious, but subject already raised on this list, so I'll
> try to be as precise as possible (But may be wrong on some points, I'll
> let Patrick correct me if necessary)
> 
> ipt_do_table() is not a readonly function returning a verdict.
> 
> 1) It handles a stack (check how is used next->comefrom) that seems to
> be stored on rules themselves. (This is how I understand this code)
> This is safe as each cpu has its own copy of rules/counters, and BH protected.
> 
> 2) It also updates two 64 bit counters (bytes/packets) on each matched rule.
> 
> 3) Some netfilter matches/targets probably rely on the fact their handlers
> are run with BH disabled by their caller (ipt_do_table()/arp/ip6...)
> 
> These must be BH protected (and preempt disabled too), or else :
> 
> 1) A softirq could interrupt a process in the middle of ipt_do_table()
> and corrupt its "stack".
> 
> 2) A softirq could interrupt a process in ipt_do_table() in the middle
>  of the ADD_COUNTER(). Some counters could be corrupted.
> 
> 3) Some netfiler extensions would break.
> 
> Previous linux versions already used a read_lock_bh() here, on a single
> and shared rwlock, there is nothing new on this BH locking AFAIK.
> 
> Thank you

Thanks for the explanation. It might help to document the role of bh
disabling for the reader in a supplementary code comment, otherwise one
might think it's been put there to match the bottom half disabling used
on the write-side, which has the supplementary role of making sure bh
will not deadlock (and this precise behavior is not needed usually on
the read-side).

One more point :

 *    1. reading counter values
 *       all readers need to be stopped and the per-CPU values are summed.

Maybe it's just me, but this sentence does not seem to clearly indicate
that we have :

for_each_cpu()
  write lock()
  read data
  write unlock()

One might interpret it as :

for_each_cpu()
  write lock()

read data

for_each_cpu()
  write unlock()

Or maybe it's just my understanding of English that's not perfect.
Anyhow, rewording this sentence might not hurt. Something along the
lines of :

"reading counter values
 all readers are iteratively stopped to have their per-CPU values
 summed"

This is an important difference, as this behaves more like a RCU-based
mechanism than a global per-cpu read/write lock where all the write
locks would be taken at once.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ