linux-kernel - Re: Kernel 3.4.X NFS server regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4FD60839.3000508@panasas.com>
Date:	Mon, 11 Jun 2012 18:01:13 +0300
From:	Boaz Harrosh <bharrosh@...asas.com>
To:	Jeff Layton <jlayton@...hat.com>
CC:	bfields <bfields@...ldses.org>, Steve Dickson <steved@...hat.com>,
	"Myklebust, Trond" <Trond.Myklebust@...app.com>,
	Joerg Platte <jplatte@...sa.net>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-nfs@...r.kernel.org" <linux-nfs@...r.kernel.org>,
	Hans de Bruin <jmdebruin@...net.nl>
Subject: Re: Kernel 3.4.X NFS server regression

On 06/11/2012 05:29 PM, Jeff Layton wrote:

> On Mon, 11 Jun 2012 16:44:09 +0300
> Boaz Harrosh <bharrosh@...asas.com> wrote:
> 
>> On 06/11/2012 04:32 PM, Boaz Harrosh wrote:
>>
>>> On 06/11/2012 03:39 PM, Jeff Layton wrote:
>>>
>>>>>
>>>>> But I'm guessing we were wrong to assume that existing setups that
>>>>> people perceived as working would have that path, because the failures
>>>>> in the absence of that path were probably less obvious.
>>>>>
>>
>>
>> One more thing, the most important one. We have already fixed that in the
>> past and I was hoping the lesson was learned. Apparently it was not, and
>> we are doomed to do this mistake for ever!!
>>
>> What ever crap fails times out and crashes, in the recovery code, we don't
>> give a dam. It should never affect any Server-client communication.
>>
>> When the grace periods ends the clients gates opens period. *Any* error
>> return from state recovery code must be carefully ignored and normal
>> operations resumed. At most on error, we move into a mode where any
>> recovery request from client is accepted, since we don't have any better
>> data to verify it.
>>
>> Please comb recovery code to make sure any catastrophe is safely ignored.
>> We already did that before and it used to work.
>>
> 
> That's not the case, and hasn't ever been AFAICT. The code has changed
> a bit recently, but the existing behavior in this regard was preserved.
> From nfs4_check_open_reclaim:
> 
>         return nfsd4_client_record_check(clp) ? nfserr_reclaim_bad : nfs_ok;
> 
> ...if there is no client record, then the reclaim request fails. Doesn't
> the RFC mandate that?
> 


Regardless of what RFC mandates and what is returned to client, (Which sounds
very unrobust to me) I'm sure the client handles nfserr_reclaim_bad just
fine.

It's the server that's getting stuck in its own feet and stops responding.
That's what I meant. We should always resume normal operations after
the grace period ends.

I did not see any reports of client getting into trouble because of
unexpected nfserr_reclaim_bad, did you?

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/