lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf3edd8d0801151505r54bfd9d0r89f57e3835f295e1@mail.gmail.com>
Date:	Tue, 15 Jan 2008 23:05:17 +0000
From:	"Colin Fowler" <elethiomel@...il.com>
To:	"Ingo Molnar" <mingo@...e.hu>
Cc:	linux-kernel@...r.kernel.org,
	"Peter Zijlstra" <a.p.zijlstra@...llo.nl>
Subject: Re: Performance loss 2.6.22->22.6.23->2.6.24-rc7 on CPU intensive benchmark on 8 Core Xeon

Hi Ingo,
      I'll get the results tomorrow as I'm now out of the office, but
I can perhaps answer some of your queries now.

On Jan 15, 2008 10:06 PM, Ingo Molnar <mingo@...e.hu> wrote:

> hm, the system has considerable idle time left:
>
>  r  b swpd   free   buff  cache   si so  bi bo   in    cs  us sy id wa
>  8  0    0 1201920 683840 1039100  0  0   3  2   27    46   1  0 99  0
>  2  0    0 1202168 683840 1039112  0  0   0  0  245 45339  80  2 17  0
>  2  0    0 1202168 683840 1039112  0  0   0  0  263 47349  84  3 14  0
>  2  0    0 1202300 683848 1039112  0  0   0 76  255 47057  84  3 13  0
>
> and context-switches 45K times a second. Do you know what is going on
> there? I thought ray-tracing is something that can be parallelized
> pretty efficiently, without having to contend and schedule too much.
>

This is a RTRT (real-time ray tracing) system and as a result differs
from traditional offline ray-tracers as it is optimised for speed. The
benchmark I ran while these data were collected renders an 80K polygon
scene to a 512x512 buffer at just over 100fps.

The context switches are most likely caused by the pthreads
synchronisation code. There are two mutexs. Each job is a 32x32 tile
and each mutex is therefore unlocked (512/32) * (512/32) * 100 (for
100fps) * 2 =~50k. There's very likely where our context switches are
coming from. Larger tile sizes would of course reduce the locking
overhead, but then the ray-tracer suffers form load imbalance as some
tiles are much quicker to render than others. Empircally we've found
that this tile-size works the best for us.

The CPU idling occurs as the system doesn't yet perform asynchronous
rendering. When all tiles in a current job queue are finished the
current frame is done. At this point all worker threads sleep while
the master thread blits the image to the screen and fills the job
queue for the next frame. The data probably shows that one CPU is kept
maxed and the others reach about 90% most of the time. This is
something on my TODO list to fix along with a myriad of other
optimisations :)

regards,
        Colin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ