[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1206122240.14638.31.camel@bobble.smo.corp.google.com>
Date: Fri, 21 Mar 2008 10:57:20 -0700
From: Frank Mayhar <fmayhar@...gle.com>
To: Roland McGrath <roland@...hat.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: posix-cpu-timers revamp
On Fri, 2008-03-21 at 00:18 -0700, Roland McGrath wrote:
> > Please take a look and let me know what you think. In the meantime I'll
> > be working on a similar patch to 2.6-head that has optimizations for
> > uniprocessor and two-CPU operation, to avoid the overhead of the percpu
> > functions when they are unneeded.
> My mention of a 2-CPU special case was just an off-hand idea. I don't
> really have any idea if that would be optimal given the tradeoff of
> increaing signal_struct size. The performance needs be analyzed.
I would really like to just ignore the 2-cpu scenario and just have two
versions, the UP version and the n-way SMP version. It would make life,
and maintenance, simpler.
> > disappeared entirely and the arm_timer() routine merely fills
> > p->signal->it_*_expires from timer->it.cpu.expires.*. The
> > cpu_clock_sample_group_locked() loses its summing loops, using the
> > the shared structure instead. Finally, set_process_cpu_timer() sets
> > tsk->signal->it_*_expires directly rather than calling the deleted
> > rebalance routine.
> I think I misled you about the use of the it_*_expires fields, sorry.
> The task_struct.it_*_expires fields are used solely as a cache of the
> head of cpu_timers[]. Despite the poor choice of the same name, the
> signal_struct.it_*_expires fields serve a different purpose. For an
> analogous cache of the soonest timer to expire, you need to add new
> fields. The signal_struct.it_{prof,virt}_{expires,incr} fields hold
> the setitimer settings for ITIMER_{PROF,VTALRM}. You can't change
> those in arm_timer. For a quick cache you need a new field that is
> the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.
Okay, I'll go back over this and make sure I got it right. It's
interesting, though, that my current patch (written without this
particular bit of knowledge) actually performs no differently from the
existing mechanism.
>>From my handy four-core AMD64 test system running 2.6.18.5, the old
kernel gets:
./nohangc-3 1300 200000
Interval timer off.
Threads: 1300
Max prime: 200000
Elapsed: 95.421s
Execution: User 356.001s, System 0.029s, Total 356.030s
Context switches: vol 1319, invol 7402
./hangc-3 1300 200000
Interval timer set to 0.010 sec.
Threads: 1300
Max prime: 200000
Elapsed: 131.457s
Execution: User 435.037s, System 59.495s, Total 494.532s
Context switches: vol 1464, invol 10123
Ticks: 22612, tics/sec 45.724, secs/tic 0.022
(More than 1300 threads hangs the old kernel with this test.)
With my patch it gets:
./nohangc-3 1300 200000
Interval timer off.
Threads: 1300
Max prime: 200000
Elapsed: 94.097s
Execution: User 366.000s, System 0.052s, Total 366.052s
Context switches: vol 1336, invol 28928
./hangc-3 1300 200000
Interval timer set to 0.010 sec.
Threads: 1300
Max prime: 200000
Elapsed: 93.583s
Execution: User 366.117s, System 0.047s, Total 366.164s
Context switches: vol 1323, invol 28875
Ticks: 12131, tics/sec 33.130, secs/tic 0.030
Also see below.
> The shared_utime_sum et al names are somewhat oblique to anyone who
> hasn't just been hacking on exactly this thing like you and I have.
> Things like thread_group_*time make more sense.
In the latest cut I've named them "process_*" but "thread_group" makes
more sense.
> There are now several places where you call both shared_utime_sum and
> shared_stime_sum. It looks simple because they're nicely encapsulated.
> But now you have two loops through all CPUs, and three loops in
> check_process_timers.
Good point, although so far it's been undetectable in my performance
testing. (I can't say that it will stay that way down the road a bit,
when we have systems with large numbers of cores.)
> I think what we want instead is this:
>
> struct task_cputime
> {
> cputime_t utime;
> cputime_t stime;
> unsigned long long schedtime;
> };
>
> Use one in task_struct to replace the utime, stime, and sum_sched_runtime
> fields, and another to replace it_*_expires. Use a single inline function
> thread_group_cputime() that fills a sum struct task_cputime using a single
> loop. For the places only one or two of the sums is actually used, the
> compiler should optimize away the extra summing from the loop.
Excellent idea! This method hadn't occurred to me since I was looking
at it from the viewpoint of the existing structure and keeping the
fields separated, but this makes more sense.
> Don't use __cacheline_aligned on this struct type itself, because most of
> the uses don't need that. When using alloc_percpu, you can rely on it to
> take care of those needs--that's what it's for. If you implement a
> variant that uses a flat array, you can use a wrapper struct with
> __cacheline_aligned for that.
Yeah, I had caught that one.
FYI, I've attached the latest version of the 2.6.18 patch; you might
want to take a look as it has changed a bit. I generated some numbers
as well (from a new README):
Testing was performed using a heavily-modified version of the test
that originally showed the problem. The test sets ITIMER_PROF (if
not run with "nohang" in the name of the executable) and catches
the SIGPROF signal (in any event), then starts some number of threads,
each of which computes the prime numbers up to a given maximum (this
function was lifted from the "cpu" benchmark of sysbench version
0.4.8). It takes as parameters the number of threads to create and
the maximum value for the prime number calculation. It starts the
threads, calls pthread_barrier_wait() to wait for them to complete and
rendezvous, then joins the threads. It uses gettimeofday() to get
the time and getrusage() to get resource usage before and after the
threads run and reports the number of threads, the difference in
elapsed time, user and system CPU time and in the number of voluntary
and involuntary context switches, and the total number of SIGPROF
signals received (this will be zero if the test is run as "nohang").
On a four-core AMD64 system (two dual-core AMD64s), for 1300 threads
(more than that hung the kernel) and a max prime of 120,000, the old
kernel averaged roughly 70s elapsed, with about 240s user cpu and 35s
system cpu, with the profile timer ticking about every 0.02s. The new
kernel averaged roughly 45s elapsed, with about 181s user cpu and .04
system CPU and with the profile timer ticking about every .01s.
On a sixteen-core system (four quad-core AMD64s), for 1300 threads as
above but with a max prime of 300,000, the old kernel averaged roughly
65s elapsed, with about 600s user cpu and 91s system cpu, with the
profile timer ticking about every .02s. The new kernel averaged
roughly 70s elapsed, with about 239s user cpu and 35s system cpu,
and with the profile timer ticking about every .02s.
On the same sixteen-core system, 100,000 threads with a max prime of
100,000 run in roughly 975s elapsed, with about 5,538s user cpu and
751s system cpu, with the profile timer ticking about every .025s.
In summary, the performance of the kernel with the fix is comparable to
the performance without it, with the advantage that many threads will
no longer hang the system.
The patch is attached.
--
Frank Mayhar <fmayhar@...gle.com>
Google, Inc.
View attachment "itimer-hang-2.patch" of type "text/x-patch" (23932 bytes)
Powered by blists - more mailing lists