linux-kernel - Re: [PATCH] CIFS: Decrease reconnection delay when switching nics

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH2r5mvw8ZR3sH9LamdpauyCeV7hniRkdE1n9HM7Y4aSWVACpw@mail.gmail.com>
Date:	Thu, 28 Feb 2013 11:45:53 -0600
From:	Steve French <smfrench@...il.com>
To:	Dave Chiluk <dave.chiluk@...onical.com>
Cc:	Jeff Layton <jlayton@...ba.org>,
	"Stefan (metze) Metzmacher" <metze@...ba.org>,
	Steve French <sfrench@...ba.org>, linux-cifs@...r.kernel.org,
	samba-technical@...ts.samba.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] CIFS: Decrease reconnection delay when switching nics

On Thu, Feb 28, 2013 at 11:31 AM, Dave Chiluk <dave.chiluk@...onical.com> wrote:
> On 02/28/2013 10:47 AM, Jeff Layton wrote:
>> On Thu, 28 Feb 2013 10:04:36 -0600
>> Steve French <smfrench@...il.com> wrote:
>>
>>> On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton <jlayton@...ba.org> wrote:
>>>> On Wed, 27 Feb 2013 16:24:07 -0600
>>>> Dave Chiluk <dave.chiluk@...onical.com> wrote:
>>>>
>>>>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
>>>>>> On Wed, 27 Feb 2013 12:06:14 +0100
>>>>>> "Stefan (metze) Metzmacher" <metze@...ba.org> wrote:
>>>>>>
>>>>>>> Hi Dave,
>>>>>>>
>>>>>>>> When messages are currently in queue awaiting a response, decrease amount of
>>>>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>>>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>>>>>>> seconds) since the last response was recieved.  This does not take into account
>>>>>>>> the fact that messages waiting for a response should be serviced within a
>>>>>>>> reasonable round trip time.
>>>>>>>
>>>>>>> Wouldn't that mean that the client will disconnect a good connection,
>>>>>>> if the server doesn't response within 10 seconds?
>>>>>>> Reads and Writes can take longer than 10 seconds...
>>>>>>>
>>>>>>
>>>>>> Where does this magic value of 10s come from? Note that a slow server
>>>>>> can take *minutes* to respond to writes that are long past the EOF.
>>>>> It comes from the desire to decrease the reconnection delay to something
>>>>> better than a random number between 60 and 120 seconds.  I am not
>>>>> committed to this number, and it is open for discussion.  Additionally
>>>>> if you look closely at the logic it's not 10 seconds per request, but
>>>>> actually when requests have been in flight for more than 10 seconds make
>>>>> sure we've heard from the server in the last 10 seconds.
>>>>>
>>>>> Can you explain more fully your use case of writes that are long past
>>>>> the EOF?  Perhaps with a test-case or script that I can test?  As far as
>>>>> I know writes long past EOF will just result in a sparse file, and
>>>>> return in a reasonable round trip time *(that's at least what I'm seeing
>>>>> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
>>>>> seek=100000, starts receiving responses from the server in about .05
>>>>> seconds with subsequent responses following at roughly .002-.01 second
>>>>> intervals.  This is well within my 10 second value.  Even adding the
>>>>> latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
>>>>> less than my 10 second value.
>>>>>
>>>>> The new logic goes like this
>>>>> if( we've been expecting a response from the server (in_flight), and
>>>>>  message has been in_flight for more than 10 seconds and
>>>>>  we haven't had any other contact from the server in that time
>>>>>   reconnect
>>>>>
>>>>
>>>> That will break writes long past the EOF. Note too that reconnects on
>>>> CIFS are horrifically expensive and problematic. Much of the state on a
>>>> CIFS mount is tied to the connection. When that drops, open files are
>>>> closed and things like locks are dropped. SMB1 has no real mechanism
>>>> for state recovery, so that can really be a problem.
>>>>
>>>>> On a side note, I discovered a small race condition in the previous
>>>>> logic while working on this, that my new patch also fixes.
>>>>> 1s  request
>>>>> 2s  response
>>>>> 61.995 echo job pops
>>>>> 121.995 echo job pops and sends echo
>>>>> 122 server_unresponsive called.  Finds no response and attempts to
>>>>>        reconnect
>>>>> 122.95 response to echo received
>>>>>
>>>>
>>>> Sure, here's a reproducer. Do this against a windows server, preferably
>>>> one exporting NTFS on relatively slow storage. Make sure that
>>>> "testfile" doesn't exist first:
>>>>
>>>>      $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192
>>>>
>>>> NTFS doesn't support sparse files, so the OS has to zero-fill up to the
>>>> point where you're writing. That can take a looooong time on slow
>>>> storage (minutes even). What we do now is periodically send a SMB echo
>>>> to make sure the server is alive rather than trying to time out a
>>>> particular call.
>>>
>>> Writing past end of file in Windows can be very slow, but note that it
>>> is possible for a windows to set as sparse a file on an NTFS
>>> partition.   Quoting from
>>> http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx
>>> Windows NTFS does support sparse files (and we could even send it over
>>> cifs if we want) but it has to be explicitly set by the app on the
>>> file:
>>>
>>> "To determine whether a file system supports sparse files, call the
>>> GetVolumeInformation function and examine the
>>> FILE_SUPPORTS_SPARSE_FILES bit flag returned through the
>>> lpFileSystemFlags parameter.
>>>
>>> Most applications are not aware of sparse files and will not create
>>> sparse files. The fact that an application is reading a sparse file is
>>> transparent to the application. An application that is aware of
>>> sparse-files should determine whether its data set is suitable to be
>>> kept in a sparse file. After that determination is made, the
>>> application must explicitly declare a file as sparse, using the
>>> FSCTL_SET_SPARSE control code."
>>>
>>>
>>
>> That's interesting. I didn't know about the fsctl.
>>
>> It doesn't really help us though. Not all servers support passthrough
>> infolevels, and there are other filesystems (e.g. FAT) that don't
>> support sparse files at all.
>>
>> In any case, the upshot of all of this is that we simply can't assume
>> that we'll get the response to a particular call in any given amount of
>> time, so we have to periodically check that the server is still
>> responding via echoes before giving up on it completely.
>>
>
> I just verified this by running the dd testcase against a windows 7
> server.  I'm going to rewrite my patch to optimise the echo logic as
> Jeff suggested earlier.  The only difference being that, I think we
> should still have regular echos when nothing else is happening, so that
> the connection can be rebuilt when nothing urgent is going on.
>
> It still makes more sense to me that we should be checking the status of
> the tcp socket, and it's underlying nic, but I'm still not completely
> clear on how that could be accomplished.  Any pointers to that regard
> would be appreciated.

It is also worth checking if the witness protocol would help us (even
in a nonclustered environment) because it was designed to allow (at
least for smb3 mounts) a client to tell when a server is up or down


-- 
Thanks,

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/