[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <654964868.14006956.1454063625314.JavaMail.zimbra@redhat.com>
Date: Fri, 29 Jan 2016 05:33:45 -0500 (EST)
From: Jan Stancek <jstancek@...hat.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: alex shi <alex.shi@...el.com>, guz fnst <guz.fnst@...fujitsu.com>,
mingo@...hat.com, jolsa@...hat.com, riel@...hat.com,
linux-kernel@...r.kernel.org
Subject: Re: [BUG] scheduler doesn't balance thread to idle cpu for 3
seconds
----- Original Message -----
> From: "Peter Zijlstra" <peterz@...radead.org>
> To: "Jan Stancek" <jstancek@...hat.com>
> Cc: "alex shi" <alex.shi@...el.com>, "guz fnst" <guz.fnst@...fujitsu.com>, mingo@...hat.com, jolsa@...hat.com,
> riel@...hat.com, linux-kernel@...r.kernel.org
> Sent: Friday, 29 January, 2016 11:15:22 AM
> Subject: Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
>
> On Thu, Jan 28, 2016 at 01:43:13PM -0500, Jan Stancek wrote:
> > > How long should I have to wait for a fail?
> >
> > It's about 1000-2000 iterations for me, which I think you covered
> > by now in those 2 hours.
>
> So I've been running:
>
> while ! ./pthread_cond_wait_1 ; do sleep 1; done
>
> overnight on the machine, and have yet to hit a wobbly -- that is, its
> still running.
I have seen similar result.
Then, instead of turning CPUs off, I spawned more low prio threads to scale
with number of CPUs on system:
@@ -213,10 +213,14 @@
printf(ERROR_PREFIX "pthread_attr_setschedparam\n");
exit(PTS_UNRESOLVED);
}
- rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL);
- if (rc != 0) {
- printf(ERROR_PREFIX "pthread_create\n");
- exit(PTS_UNRESOLVED);
+
+ int i, ncpus = sysconf(_SC_NPROCESSORS_ONLN);
+ for (i = 0; i < ncpus - 1; i++) {
+ rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL);
+ if (rc != 0) {
+ printf(ERROR_PREFIX "pthread_create\n");
+ exit(PTS_UNRESOLVED);
+ }
and let this ran on 3 bare metal x86 systems over night (v4.5-rc1). It
failed on 2 systems (12 and 24 CPUs) with 1:1000 chance, it never failed
on 3rd one (4 CPUs).
>
> Also note that I don't think failing this test is a bug per se.
> Undesirable maybe, but within spec, since SIGALRM is process wide, so it
> being delivered to the SCHED_OTHER task is accepted, and SCHED_OTHER has
> no timeliness guarantees.
>
> That said; if I could reliably reproduce I'd have a go at fixing this, I
> suspect there's a 'fun' problem at the bottom of this.
Thanks for trying, I'll see if I can find some more reliable way.
Regards,
Jan
Powered by blists - more mailing lists