linux-kernel - Re: [ltt-dev] cli/sti vs local_cmpxchg and local_add

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <200903182243.34090.nickpiggin@yahoo.com.au>
Date:	Wed, 18 Mar 2009 22:43:33 +1100
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Mathieu Desnoyers <compudj@...stal.dyndns.org>
Cc:	ltt-dev@...ts.casi.polymtl.ca, Ingo Molnar <mingo@...e.hu>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Josh Boyer <jwboyer@...ux.vnet.ibm.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [ltt-dev] cli/sti vs local_cmpxchg and local_add_return

On Wednesday 18 March 2009 02:14:37 Mathieu Desnoyers wrote:
> * Nick Piggin (nickpiggin@...oo.com.au) wrote:
> > On Tuesday 17 March 2009 12:32:20 Mathieu Desnoyers wrote:
> > > Hi,
> > >
> > > I am trying to get access to some non-x86 hardware to run some atomic
> > > primitive benchmarks for a paper on LTTng I am preparing. That should
> > > be useful to argue about performance benefit of per-cpu atomic
> > > operations vs interrupt disabling. I would like to run the following
> > > benchmark module on CONFIG_SMP :
> > >
> > > - PowerPC
> > > - MIPS
> > > - ia64
> > > - alpha
> > >
> > > usage :
> > > make
> > > insmod test-cmpxchg-nolock.ko
> > > insmod: error inserting 'test-cmpxchg-nolock.ko': -1 Resource
> > > temporarily unavailable dmesg (see dmesg output)
> > >
> > > If some of you would be kind enough to run my test module provided
> > > below and provide the results of these tests on a recent kernel
> > > (2.6.26~2.6.29 should be good) along with their cpuinfo, I would
> > > greatly appreciate.
> > >
> > > Here are the CAS results for various Intel-based architectures :
> > >
> > > Architecture         | Speedup                      |      CAS     |
> > >  Interrupts         |
> > >
> > >                      | (cli + sti) / local cmpxchg  | local | sync |
> > >                      | Enable (sti) | Disable (cli)
> > >
> > > -----------------------------------------------------------------------
> > >---- ---------------------- Intel Pentium 4      | 5.24                 
> > >        | 25   | 81   | 70           | 61          | AMD Athlon(tm)64 X2
> > >  | 4.57
> > >
> > >                     |  7    | 17   | 17           | 15          | Intel
> > >
> > > Core2          | 6.33                         |  6    | 30   | 20
> > >
> > > | 18          | Intel Xeon E5405     | 5.25                         | 
> > > | 8 24   | 20           | 22          |
> > >
> > > The benefit expected on PowerPC, ia64 and alpha should principally come
> > > from removed memory barriers in the local primitives.
> >
> > Benefit versus what? I think all of those architectures can do SMP
> > atomic compare exchange sequences without barriers, can't they?
>
> Hi Nick,
>
> I want to compare if it is faster to use SMP cas without barriers to
> perform synchronization of the tracing hot path wrt interrupts or if it
> is faster to disable interrupts. These decisions will depend on the
> benchmark I propose, because it is comparing the time it takes to
> perform both.
>
> Overall, the benchmarks will allow to choose between those two
> simplified hotpath pseudo-codes (offset is global to the buffer,
> commit_count is per-subbuffer).
>
>
> * lockless :
>
> do {
>   old_offset = local_read(&offset);
>   get_cycles();
>   compute needed size.
>   new_offset = old_offset + size;
> } while (local_cmpxchg(&offset, old_offset, new_offset) != old_offset);
>
> /*
>  * note : writing to buffer is done out-of-order wrt buffer slot
>  * physical order.
>  */
> write_to_buffer(offset);
>
> /*
>  * Make sure the data is written in the buffer before commit count is
>  * incremented.
>  */
> smp_wmb();
>
> /* note : incrementing the commit count is also done out-of-order */
> count = local_add_return(size, &commit_count[subbuf_index]);
> if (count is filling a subbuffer)
>   allow to wake up readers

Ah OK, so you just mean the benefit of using local atomics is avoiding
the barriers that you get with atomic_t.

I'd thought you were referring to some benefit over irq disable pattern.


> * irq off :
>
> (note : offset and commit count would each be written to atomically
> (type unsigned long))
>
> local_irq_save(flags);
>
> get_cycles();
> compute needed size;
> offset += size;
>
> write_to_buffer(offset);
>
> /*
>  * Make sure the data is written in the buffer before commit count is
>  * incremented.
>  */
> smp_wmb();
>
> commit_count[subbuf_index] += size;
> if (count is filling a subbuffer)
>   allow to wake up readers
>
> local_irq_restore(flags);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/