[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4D3C1FF5.2010607@gmail.com>
Date: Sun, 23 Jan 2011 13:32:53 +0100
From: Nicolas de Pesloüan
<nicolas.2p.debian@...il.com>
To: Patrick Schaaf <netdev@....de>
CC: netdev@...r.kernel.org
Subject: Re: RFC: pid "ownership" of ip config information
Le 23/01/2011 11:24, Patrick Schaaf a écrit :
> On Fri, 2011-01-21 at 11:17 +0100, Nicolas de Pesloüan wrote:
>> Le 21/01/2011 10:28, Patrick Schaaf a écrit :
>>> The alternative to such a feature, would be to have an additional
>>> monitoring process, which would watch the PID somehow, and need to
>>> be configured to know what to withdraw when it dies.
>
>> There exists some user space clustering system that should provide the same functionalities. Did you
>> had a look at http://www.linux-ha.org/ ?
>
> Those would be the more complex instances of "an additional monitoring
> process", right?
>
> What happens when heartbeat is "kill -9"ed? Assume that I want to avoid
> STOMITH like approaches.
>
> My proposal could be _used_ by such complex clustering managers, too.
>
> Or, did I overlook there a kernel based solution to "withdraw IP config
> when processes die"?
>
> Can you provide a direct link on linux-ha?
Do you consider "withdraw IP config" the only feature that is needed when a process die ? Or shall
we instead design a more generic framework to run a command or call a system call when a process die
? /sbin/init is probably already doing something similar. Arguably, even init mail hang...
If your point is to provide a safety net for very sick but not really died node, then, no userland
system would help. As such, I agree with you that an automatic withdraw of IP config might help.
However, how would you protect against a simple never ending loop in the process or against very
slow process due to high load on the node? You probably also need to guard against process not
reading the network receive queue anymore.
This might end up with some sort of local heart beating monitoring of userland process, in the
kernel, and I'm not sure if someone would support this.
And whatever you do locally to a node to ensure proper operation, you need a way to also check for
proper operation from outside of the node. A STOMITH system is always required, in order to kill a
totally mad node. Even the kernel may become mad.
Nicolas.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists