lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <m1bp8ypcyb.fsf@fess.ebiederm.org>
Date:	Thu, 19 Aug 2010 13:25:16 -0700
From:	ebiederm@...ssion.com (Eric W. Biederman)
To:	"Zhang\, Yanmin" <yanmin_zhang@...ux.intel.com>
Cc:	LKML <linux-kernel@...r.kernel.org>, alex.shi@...el.com,
	Pavel Emelyanov <xemul@...nvz.org>,
	"David S. Miller" <davem@...emloft.net>
Subject: Re: hackbench regression with 2.6.36-rc1

"Zhang, Yanmin" <yanmin_zhang@...ux.intel.com> writes:

> On Wed, 2010-08-18 at 03:56 -0700, Eric W. Biederman wrote:
>> "Zhang, Yanmin" <yanmin_zhang@...ux.intel.com> writes:
>> 
>> > Comparing with 2.6.35's result, hackbench (thread mode) has about
>> > 80% regression on dual-socket Nehalem machine and about 90% regression
>> > on 4-socket Tigerton machines.
>> 
>> That seems unfortunate.  
>
>> Do you only show a regression in the pthread
>> hackbench test?
> Yes.
>
>>   Do you show a regression when you use pipes?
> No.
>
>> 
>> Does the size of the regression very based on the number of loop
>> iterations?
> No. I tried 1000 and get the similar regression ratio.
> I choose a large 2000 loop number because I want to get a stable
> result.
>
> It's easy to reproduce it. We found it almost on all our machines.
>
>>   I ask because it appears that on the last message the
>> sender will exit necessitating that the receiver put the senders pid.
>> Which should be atypical.
> I don't agree on that. With hackbench, sender would send loops*receiver_num_per_group
> messages before exiting.
> In addition, 'perf top' shows put_pid is the hottest function in the beginning
> after I start hackbench. 

If increasing the number of loops does not improve the performance the
hypothesis that it is only the last message that has the regression
is shot.


>> > Command to start hackbench:
>> > #./hackbench 100 thread 2000
>> >
>> > process mode has no such regression.
>> >
>> > Profiling shows:
>> > #perf top
>> >              samples  pcnt function                 DSO
>> >              _______ _____ ________________________ ________________________
>> >
>> >             74415.00 29.9% put_pid                  [kernel.kallsyms]       
>> >             38395.00 15.4% unix_stream_recvmsg      [kernel.kallsyms]       
>> >             34877.00 14.0% unix_stream_sendmsg      [kernel.kallsyms]       
>> >             25204.00 10.1% pid_vnr                  [kernel.kallsyms]       
>> >             21864.00  8.8% unix_scm_to_skb          [kernel.kallsyms]       
>> >             13637.00  5.5% cred_to_ucred            [kernel.kallsyms]       
>> >              6520.00  2.6% unix_destruct_scm        [kernel.kallsyms]       
>> >              4731.00  1.9% sock_alloc_send_pskb     [kernel.kallsyms]       
>> >
>> >
>> > With 2.6.35, perf doesn't show put_pid/pid_NR.
>> 
>> Yes.  2.6.35 is imperfect and can report the wrong pid in some
>> circumstances.  I am surprised nothing related to the reference count on
>> struct cred does not show up in your profiling traces.
>> 
>
>> You are performing statistical sampling so I don't believe the
>> percentage of hits per function is the same as the percentage of
>> time per function.
> Agree. But from performance tuning point of view, percentage of hit is enough
> for helping developers to investigate.
>
> I provide 'perf top' data is to help you debug, not to prove your patches
> cause the regression. We used bisect to locate them.

Sure I was just trying to figure out how to explain why the creds
don't show a similar hit.  I still don't have a complete explanation
for the profile but the cred put and get are inline functions so they
won't be present as distinct functions in the profile.

>> Given that we are talking about a scheduler benchmark that is
>> doing something rather artificial (inter thread communication via
>> sockets), I don't know that this case is worth worrying about.
> Good question. I don't know how about below scenario:
> Start 2 processes and every process creates many threads. threads of process 1
> communicates with threads of process 2.

Maybe.  A lot depends on the timing, and what it takes to trigger
the cross cpu cache line bounce.

And we still have pipes for ultimate performance.  Grrr.

I will give it some thought to see if I can find a less expensive way
but I don't have any good ideas at the moment.


Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ