lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200310221406.GO10776@dread.disaster.area>
Date:   Wed, 11 Mar 2020 09:14:06 +1100
From:   Dave Chinner <david@...morbit.com>
To:     "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@...ia.com>
Cc:     "darrick.wong@...cle.com" <darrick.wong@...cle.com>,
        "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "hch@....de" <hch@....de>
Subject: Re: 5.5 XFS getdents regression?

On Tue, Mar 10, 2020 at 08:45:58AM +0000, Rantala, Tommi T. (Nokia - FI/Espoo) wrote:
> Hello,
> 
> One of my GitLab CI jobs stopped working after upgrading server 5.4.18-
> 100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64.
> (tested 5.5.8-100.fc30.x86_64 too, no change)
> The server is fedora30 with XFS rootfs.
> The problem reproduces always, and takes only couple minutes to run.
> 
> The CI job fails in the beginning when doing "git clean" in docker
> container, and failing to rmdir some directory:
> "warning: failed to remove 
> .vendor/pkg/mod/golang.org/x/net@...0.0-20200114155413-6afb5195e5aa/intern
> al/socket: Directory not empty"
> 
> Quick google search finds some other people reporting similar problems
> with 5.5.0:
> https://gitlab.com/gitlab-org/gitlab-runner/issues/3185

Which appears to be caused by multiple gitlab processes modifying
the directory at the same time. i.e. something is adding an entry to
the directory at the same time something is trying to rm -rf it.
That's a race condition, and would lead to the exact symptoms you
see here, depending on where in the directory the new entry is
added.

> Collected some data with strace, and it seems that getdents is not
> returning all entries:
> 
> 5.4 getdents64() returns 52+50+1+0 entries 
> => all files in directory are deleted and rmdir() is OK
> 
> 5.5 getdents64() returns 52+50+0+0 entries
> => rmdir() fails with ENOTEMPTY

Yup, that's a classic userspace TOCTOU race.

Remember, getdents() is effectively a sequential walk through the
directory data - subsequent calls start at the offset (cookie) where
the previous one left off. New entries can be added between
getdents() syscalls.

If that new entry is put at the tail of the directory, then the last
getdents() call will return that entry rather than none because it
was placed at an offset in the directory that the getdents() sweep
has not yet reached, and hence will be found by a future getdents()
call in the sweep.


However, if there is a hole in the directory structure before the
current getdents cookie offset, a new entry can be added in that
hole. i.e. at an offset in the directory that getdents has already
passed over. That dirent will never be reported by the current
getdents() sequence - a directory rewind and re-read is required to
find it. i.e. there's an inherent userspace TOUTOC race condition in
'rm -rf' operations.

IOWs, this is exactly what you'd expect to see when there are
concurrent userspace modifications to a directory that is currently
being read. Hence you need to rule out an application and userspace
level issues before looking for filesystem level problems.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ