linux-kernel - Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20111221131511.GA31186@elte.hu>
Date:	Wed, 21 Dec 2011 14:15:11 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	Vince Weaver <vweaver1@...s.utk.edu>,
	William Cohen <wcohen@...hat.com>,
	Stephane Eranian <eranian@...gle.com>,
	Arun Sharma <asharma@...com>, Vince Weaver <vince@...ter.net>,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC][PATCH 0/6] perf: x86 RDPMC and RDTSC support


* Peter Zijlstra <a.p.zijlstra@...llo.nl> wrote:

> > I used the mmap_read_self() routine from your example as the 
> > "read" performance that I measured.
> 
> Yeah that's about it, if you want to discard the overload 
> scenario, eg you use pinned counters or so, you can optimize 
> it further by stripping out the tsc and scaling muck.

With the TSC scaling muck it a bit above 50 cycles here.

Without that, by optimizing it further for pinned counters, the 
overhead of mmap_read_self() gets down to 36 cycles on a Nehalem 
box.

A PEBS profile run of it shows that 90% of the overhead is in 
the RDPMC instruction:

    2.92 :          42a919:       lea    -0x1(%rax),%ecx                                          ▒
         :                                                                                        ▒
         :        static u64 rdpmc(unsigned int counter)                                          ▒
         :        {                                                                               ▒
         :                unsigned int low, high;                                                 ▒
         :                                                                                        ▒
         :                asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));        ▒
   89.20 :          42a91c:       rdpmc                                                           ▒
         :                        count = pc->offset;                                             ▒
         :                        if (idx)                                                        ▒
         :                                count += rdpmc(idx - 1);                                ▒
         :                                                                                        ▒

So the RDPMC instruction alone is 32 cycles. The perf way of 
reading it adds another 4 cycles to it.

So the measured 'perf overhead' is so ridiculously low that it's 
virtually non-existent.

Here's "pinned events" variant i've measured:

static u64 mmap_read_self(void *addr)
{
        struct perf_event_mmap_page *pc = addr;
        u32 seq, idx;
        u64 count;

        do {
                seq = pc->lock;
                barrier();

                idx = pc->index;
                count = pc->offset;
                if (idx)
                        count += rdpmc(idx - 1);

                barrier();
        } while (pc->lock != seq);

        return count;
}

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/