linux-kernel - Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie detected)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YlNgN5f1KnT1walD@codewreck.org>
Date:   Mon, 11 Apr 2022 07:54:47 +0900
From:   asmadeus@...ewreck.org
To:     Christian Schoenebeck <linux_oss@...debyte.com>
Cc:     David Kahurani <k.kahurani@...il.com>, davem@...emloft.net,
        ericvh@...il.com, kuba@...nel.org, linux-kernel@...r.kernel.org,
        lucho@...kov.net, netdev@...r.kernel.org,
        v9fs-developer@...ts.sourceforge.net,
        David Howells <dhowells@...hat.com>, Greg Kurz <groug@...d.org>
Subject: Re: 9p fs-cache tests/benchmark (was: 9p fscache Duplicate cookie
 detected)

Thanks for keeping it up!

Christian Schoenebeck wrote on Sun, Apr 10, 2022 at 06:18:38PM +0200:
> > I used git-bisect to identify the commit that broke 9p behaviour, and it is
> > indeed this one:
> > 
> > commit eb497943fa215897f2f60fd28aa6fe52da27ca6c (HEAD, refs/bisect/bad)
> > Author: David Howells <dhowells@...hat.com>
> > Date:   Tue Nov 2 08:29:55 2021 +0000
> > 
> >     9p: Convert to using the netfs helper lib to do reads and caching

Yes, quite a few things changed with that.

> I looked into the errors I get, and as far as I can see it, all misbehaviours
> that I see, boil down to "Bad file descriptor" (EBADF) errors being the
> originating cause.
> 
> The easiest misbehaviours on the guest system I can look into, are errors
> with the git client. For instance 'git fetch origin' fails this way:

FWIW I didn't report but did try to reproduce, on my machines (tried a
couple) booting on a small alpine rootfs over 9p works, and I tried some
git clone/git fetch of variying sizes of local repo (tmpfs in VM -> 9p
mount of tmpfs on host) to no avail.
Perhaps backing filesystem dependant? qemu version? virtfs access options?

It's all extremely slow though... like the final checkout counting files
at less than 10/s

> ...
> write(3, "d16782889ee07005d1f57eb884f4a06b"..., 40) = 40
> write(3, "\n", 1)                       = 1
> close(3)                                = 0
> access(".git/hooks/reference-transaction", X_OK) = -1 ENOENT (No such file or directory)
> openat(AT_FDCWD, ".git/logs/refs/remotes/origin/master", O_WRONLY|O_CREAT|O_APPEND, 0666) = 3
> openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 7
> fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
> fstat(7, {st_mode=S_IFREG|0644, st_size=2326, ...}) = 0
> read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 2326
> lseek(7, -1467, SEEK_CUR)               = 859
> read(7, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\t\0\0\0\t\0\0\0\0"..., 8192) = 1467
> close(7)                                = 0
> write(3, "d8a68c5027ef629d93b9d9519ff4da95"..., 168) = -1 EBADF (Bad file descriptor)
> ...
> error: cannot update the ref 'refs/remotes/origin/master': unable to append to '.git/logs/refs/remotes/origin/master': Bad file descriptor
> 
> I tried to manually replicate those file access operations on that
> .git/logs/refs/remotes/origin/master file in question, and it worked. But when
> I look at the strace output above, I see there is a close(3) call just before
> the subsequent openat(".git/logs/refs/remotes/origin/master") call returning 3,
> which makes me wonder, is this maybe a concurrency issue on file descriptor
> management?

hmm, in cache=loose case write should just be updating the page cache
for buffers to be flushed later, so this is definitely weird.

If you can reproduce well enough for this, could you first confirm that
the EBADF comes from the client and not qemu? either mounting with debug
or getting traces from qemu at a protocol level would get that.

If it's local there are only so many places EBADF can come from and it
should be possible to trace it back with e.g. perf probe or bpftrace,
but even if we confirm that e.g. the process' fd table is messed up it
won't tell us why it was, so it's going to be annoying... I'd really
like to be able to reproduce this somehow :/

-- 
Dominique