lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 1 Jun 2009 16:53:26 +0200
From:	Borislav Petkov <petkovbb@...glemail.com>
To:	"H. Peter Anvin" <hpa@...or.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Borislav Petkov <borislav.petkov@....com>, greg@...ah.com,
	mingo@...e.hu, norsk5@...oo.com, tglx@...utronix.de,
	mchehab@...hat.com, aris@...hat.com, edt@....ca,
	linux-kernel@...r.kernel.org, randy.dunlap@...cle.com
Subject: Re: [PATCH 0/4] amd64_edac: misc fixes

> Obviously not, since it's a relatively new opcode.  However, it is
> supported by both Intel and AMD with the opcode F3 0F B8 /r.
> 
> The "/r" is the real problem ... it means one can't just mimic it with
> hard-coding .byte directives without fixing the arguments (which means a
> performance hit.)  Furthermore, the 0F B8 opcode is JMPE, which doesn't
> take the same arguments either.

How about we pin the src/dst into a register:

#define popcnt_spelled(x)                                       \
({                                                              \
        typeof(x) __ret;                                        \
        __asm__(".byte 0xf3\n\t.byte 0x48\n\t.byte 0x0f\n\t"    \
                ".byte 0xb8\n\t.byte 0xc0\n\t"                  \
                : "=a" (__ret)                                  \
                : "0" (x));                                     \
        __ret;                                                  \
})

which generates

  40055e:       48 8b 45 e8             mov    -0x18(%rbp),%rax
  400562:       f3 48 0f b8 c0          popcnt %rax,%rax
  400567:       48 89 45 f8             mov    %rax,-0x8(%rbp)

here.

For < 64bit operand sizes, the operands get zero-extended so that
garbage in the high 32/48 bits of %rax doesn't corrupt the result.
We might even want to do the movzwq explicitly so that some compiler
doesn't decide to take the version with the "0f b6" opcode which
zero-extends only the 16-/32-bit register. This way, you can popcnt even
single bytes although the popcnt implementation doesn't allow single
byte operands.

  400572:       0f b7 45 f2             movzwl -0xe(%rbp),%eax
  400579:       f3 48 0f b8 c0          popcnt %rax,%rax
  40057e:       66 89 45 f6             mov    %ax,-0xa(%rbp)


So, in addition to popcnt itself, we have two movs added. This is still
less than the 30+ ops (+ function call overhead) that hweight* get
translated into. I'll redo my kernel build benchmarks tomorrow to get
some more recent numbers on the performance gain.

-- 
Regards/Gruss,
    Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ