linux-kernel - Re: Regression in NFS probably due to very large amounts of readahead

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6777d050-99a2-4f3c-b398-4b4271c427d5@gmail.com>
Date: Tue, 26 Nov 2024 13:49:09 +0100
From: Anders Blomdell <anders.blomdell@...il.com>
To: Jan Kara <jack@...e.cz>
Cc: Philippe Troin <phil@...i.org>,
 "Matthew Wilcox (Oracle)" <willy@...radead.org>,
 Andrew Morton <akpm@...ux-foundation.org>, linux-fsdevel@...r.kernel.org,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: Regression in NFS probably due to very large amounts of readahead



On 2024-11-26 11:37, Jan Kara wrote:
> On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
>> On 2024-11-26 02:48, Philippe Troin wrote:
>>> On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
>>>> When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
>>>> we got terrible performance (lots of nfs: server x.x.x.x not
>>>> responding).
>>>> What triggered this problem was virtual machines with NFS-mounted
>>>> qcow2 disks
>>>> that often triggered large readaheads that generates long streaks of
>>>> disk I/O
>>>> of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
>>>> area of the
>>>> machine.
>>>>
>>>> A git bisect gave the following suspect:
>>>>
>>>> git bisect start
>>>
>>> 8< snip >8
>>>
>>>> # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
>>>> readahead: properly shorten readahead when falling back to
>>>> do_page_cache_ra()
>>>
>>> Thank you for taking the time to bisect, this issue has been bugging
>>> me, but it's been non-deterministic, and hence hard to bisect.
>>>
>>> I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
>>> slightly different setups:
>>>
>>> (1) On machines mounting NFSv3 shared drives. The symptom here is a
>>> "nfs server XXX not responding, still trying" that never recovers
>>> (while the server remains pingable and other NFSv3 volumes from the
>>> hanging server can be mounted).
>>>
>>> (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
>>> several minutes) on random I/O. These stalls eventually recover.
>>>
>>> I've built a 6.11.10 kernel with
>>> 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
>>> normal (no more NFS hangs, no more VM stalls).
>>>
>> Some printk debugging, seems to indicate that the problem
>> is that the entity 'ra->size - (index - start)' goes
>> negative, which then gets cast to a very large unsigned
>> 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
>> bug is still eludes me, though.
> 
> Thanks for the report, bisection and debugging! I think I see what's going
> on. read_pages() can go and reduce ra->size when ->readahead() callback
> failed to read all folios prepared for reading and apparently that's what
> happens with NFS and what can lead to negative argument to
> do_page_cache_ra(). Now at this point I'm of the opinion that updating
> ra->size / ra->async_size does more harm than good (because those values
> show *desired* readahead to happen, not exact number of pages read),
> furthermore it is problematic because ra can be shared by multiple
> processes and so updates are inherently racy. If we indeed need to store
> number of read pages, we could do it through ractl which is call-site local
> and used for communication between readahead generic functions and callers.
> But I have to do some more history digging and code reading to understand
> what is using this logic in read_pages().
> 
> 								Honza
Good, look forward to a quick revert, and don't forget to CC GKH, so I get kernels recent  that work ASAP.

Regards

/Anders