[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAB=NE6Wti1RpAFk5q_YeZn2F9Rd=wsiwhyPszu74nG9fXwH5vQ@mail.gmail.com>
Date: Fri, 5 Sep 2014 00:47:16 -0700
From: "Luis R. Rodriguez" <mcgrof@...not-panic.com>
To: Tejun Heo <tj@...nel.org>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Dmitry Torokhov <dmitry.torokhov@...il.com>,
Wu Zhangjin <falcon@...zu.com>, Takashi Iwai <tiwai@...e.de>,
Arjan van de Ven <arjan@...ux.intel.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Oleg Nesterov <oleg@...hat.com>, hare@...e.com,
Andrew Morton <akpm@...ux-foundation.org>,
Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
Joseph Salisbury <joseph.salisbury@...onical.com>,
Benjamin Poirier <bpoirier@...e.de>,
Santosh Rastapur <santosh@...lsio.com>,
Kay Sievers <kay@...y.org>,
One Thousand Gnomes <gnomes@...rguk.ukuu.org.uk>,
Tim Gardner <tim.gardner@...onical.com>,
Pierre Fersing <pierre-fersing@...rref.org>,
Nagalakshmi Nandigama <nagalakshmi.nandigama@...gotech.com>,
Praveen Krishnamoorthy <praveen.krishnamoorthy@...gotech.com>,
Sreekanth Reddy <sreekanth.reddy@...gotech.com>,
Abhijit Mahajan <abhijit.mahajan@...gotech.com>,
Casey Leedom <leedom@...lsio.com>,
Hariprasad S <hariprasad@...lsio.com>,
MPT-FusionLinux.pdl@...gotech.com,
Linux SCSI List <linux-scsi@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [RFC v2 3/6] kthread: warn on kill signal if not OOM
On Fri, Sep 5, 2014 at 12:19 AM, Tejun Heo <tj@...nel.org> wrote:
> On Thu, Sep 04, 2014 at 11:37:24PM -0700, Luis R. Rodriguez wrote:
> ...
>> + /*
>> + * I got SIGKILL, but wait for 60 more seconds for completion
>> + * unless chosen by the OOM killer. This delay is there as a
>> + * workaround for boot failure caused by SIGKILL upon device
>> + * driver initialization timeout.
>> + *
>> + * N.B. this will actually let the thread complete regularly,
>> + * wait_for_completion() will be used eventually, the 60 second
>> + * try here is just to check for the OOM over that time.
>> + */
>> + WARN_ONCE(!test_thread_flag(TIF_MEMDIE),
>> + "Got SIGKILL but not from OOM, if this issue is on probe use .driver.async_probe\n");
>> + for (i = 0; i < 60 && !test_thread_flag(TIF_MEMDIE); i++)
>> + if (wait_for_completion_timeout(&done, HZ))
>> + goto wait_done;
>> +
>
> Ugh... Jesus, this is way too hacky, so now we fail on 90s timeout
> instead of 30?
Nope! I fell into the same trap and only with tons of patience by part
of Tetsuo with me was I able to grok that the 60 seconds here are not
for increasing the timeout, this is just time spent checking to ensure
that the OOM wasn't the one who triggered the SIGKILL. Even if the
drivers took eons it should be fine now, I tried it :D
> Why do we even need this with the proposed async
> probing changes?
Ah -- well without it the way we "find" drivers that need this new
"async feature" is by a bug report and folks saying their system can't
boot, or they say their device doesn't come up. That's all. Tracing
this to systemd and a timeout was one of the most ugliest things ever.
There two insane bug reports you can go check:
mptsas was the first:
http://article.gmane.org/gmane.linux.kernel/1669550
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1297248
Then cxgb4:
https://bugzilla.novell.com/show_bug.cgi?id=877622
I only had Cc'd you on the newest gem pata_marvell :
https://bugzilla.kernel.org/show_bug.cgi?id=59581
We can't seriously expect to be doing all this work for every driver.
a WARN_ONCE() would enable us to find the drivers that need this new
async probe "feature".
Luis
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists