[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <72e16f18-d4ae-f963-fd09-5f1fa6885a1d@cambridgegreys.com>
Date: Fri, 26 Feb 2021 15:40:13 +0000
From: Anton Ivanov <anton.ivanov@...bridgegreys.com>
To: Timo Rothenpieler <timo@...henpieler.org>,
Bruce Fields <bfields@...ldses.org>
Cc: Salvatore Bonaccorso <carnil@...ian.org>,
Chuck Lever <chuck.lever@...cle.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
"940821@...s.debian.org" <940821@...s.debian.org>,
Linux NFS Mailing List <linux-nfs@...r.kernel.org>,
trond.myklebust@...merspace.com, anna.schumaker@...app.com
Subject: Re: NFS Caching broken in 4.19.37
On 26/02/2021 15:03, Timo Rothenpieler wrote:
> I think I can reproduce this, or something that at least looks very
> similar to this, on 5.10. Namely on 5.10.17 (On both Client and Server).
I think this is a different issue - see below.
>
> We are running slurm, and since a while now (coincides with updating
> from 5.4 to 5.10, but a whole bunch of other stuff was updated at the
> same time, so it took me a while to correlate this) the logs it writes
> have been truncated, but only while they're being observed on the
> client, using tail -f or something like that.
>
> Looks like this then:
>
> On Server:
>> store01 /srv/export/home/users/timo/TestRun # ls -l slurm-41101.out
>> -rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
>> store01 /srv/export/home/users/timo/TestRun # wc -l slurm-41101.out
>> 61 slurm-41101.out
>
> On Client:
>> timo@...in01 ~/TestRun $ ls -l slurm-41101.out
>> -rw-r--r-- 1 timo timo 1931 Feb 26 15:46 slurm-41101.out
>> timo@...in01 ~/TestRun $ wc -l slurm-41101.out
>> 24 slurm-41101.out
>
> See https://gist.github.com/BtbN/b9eb4fc08ccc53bb20087bce0bf9f826 for
> the respective file-contents.
>
> If I run the same test job, wait until its done, and then look at its
> slurm.out file, it matches between NFS Client and Server.
> If I tail -f the slurm.out on an NFS client, the file stops getting
> updated on the client, but keeps getting more logs written to it on
> the NFS server.
>
> The slurm.out file is being written to by another NFS client, which is
> running on one of the compute nodes of the system. It's being reads
> from a login node.
These are two different clients, then what you see is possible on NFS
with client side caching. If you have multiple clients reading/writing
to the same files you usually need to tune the caching options and/or
use locking. I suspect that if you leave it for a while (until the cache
expires) it will sort itself out.
In my test-case it is just one client, it missed a file deletion and
nothing short of an unmount and remount fixes that. I have waited for 30
mins+. It does not seem to refresh or expire. I also see the opposite
behavior - the bug shows up on 4.x up to at least 5.4. I do not see it
on 5.10.
Brgds,
>
>
>
>
> Timo
>
>
> On 21.02.2021 16:53, Anton Ivanov wrote:
>> Client side. This seems to be an entirely client side issue.
>>
>> A variety of kernels on the clients starting from 4.9 and up to 5.10
>> using 4.19 servers. I have observed it on a 4.9 client versus 4.9
>> server earlier.
>>
>> 4.9 fails, 4.19 fails, 5.2 fails, 5.4 fails, 5.10 works.
>>
>> At present the server is at 4.19.67 in all tests.
>>
>> Linux jain 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2
>> (2019-11-11) x86_64 GNU/Linux
>>
>> I can set-up a couple of alternative servers during the week, but so
>> far everything is pointing towards a client fs cache issue, not a
>> server one.
>>
>> Brgds,
>>
>
>
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
Powered by blists - more mailing lists