lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 09 May 2008 07:22:46 +0200
From:	Jesper Krogh <jesper@...gh.cc>
To:	linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: Many open/close on same files yeilds "No such file or directory".

Hi.

A week has passed the problem still persist and I have done a fair 
amount of testing.

I have now been able to reproduce it on 2 different servers, a 
dual-core,8 cpu Sun X4600 amd64 and a 2 cpu single-core Sun V20Z.

It has been reproduced against 4 different diskarrays one of them being
the IDE-SCSI(We have 2 of them) array mentioned before. The other one is
an iSCSI SAN. And lately when trying to move of the "trouble systems" we
hit the bug on a FC-SAN.

I belive that we can rule out hardware now.

I have both reproduced it on "old" ext3 volumes and freshly created
ones.

I havent been able to reproduce it on the internal system disks of
the servers, but thinking a bit more about the setup it seems that
the filesystem need to have a significant load on the
filesystem-structures (not data) to exploit the bug. The typical
environment where I found it was serving (the same) ~5GB of files of the
NFS-server to 48 dual-core machines, this bacically never hits the
actual disk (with 32GB of ram in the machine). In this setup
a single process could hit the bug on the server (the problem could
also be seen over NFS from the clients).

It is worth keeping in mind that the problem is "rare". Thus all
applications only opening and reading a single file every 10 minutes
or so probably never hits it.

Altering the test-script so it just tries again to pick up the file,
it succeeds on the second pass.

 From my perspective this defininately has something to do with how the
OS caches/invalidates the filesystem structures it has in memory (or
somewhere near that).

I still haven't been able to produce a small pack-and-shit test that
knocks the problem out on every system. But please come with suggestions
about how to write a piece of code that tries to hit the problem from
my descriptions.

My feeling is that the script below may reveal the bug on any "busy" 
volume, where busy is lots of activity in the OS-cache of the volume,
not on the actual drives.

All suggestions are welcome.

I have tried 4 different kernel versions:
2.6.20, 2.6.22, 2.6.24, 2.6.25.2

Jesper

Jesper Krogh wrote:
> Hi list.
> 
> I have a "fairly" reproducible problem. When a program opens and closes
> the same file many times, it eventually ends up with a "no such file or
> directory". Test program that can reproduce the problem on my setup:
> 
> root@...t:~# cat test-file-c.c
> #include <stdlib.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <unistd.h>
> 
> int main(int argc, char *argv[]) {
>    unsigned long i=0;
>    int fh;
>    char *filename;
> 
>    filename=argv[1];
> 
>    while(1) {
>       fh=open(filename, O_RDONLY);
>       if (fh==-1) {
>          printf("Failed to open %s\n", filename);
>          printf("Open number: %ld\n",i);
>          exit(10);
>       }
>       close(fh);
>       i++;
>    }
> 
>    exit(0);
> }
> root@...t:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
> Failed to open /z/bio/databases/online/index/index-by-accno
> Open number: 61785000
> root@...t:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
> Failed to open /z/bio/databases/online/index/index-by-accno
> Open number: 120929685
> (The problem is not isolate to a single file on the filesystem).
> 
> 
> strace on the program reviel that the system indeed return a "No such
> file or directory" to the program.
> 
> This is run on an Ubuntu Gutsy (vendor kernel): 2.6.22-14-server on an
> 4.5TB ext3 filesystem on an LVM volume. The volume was created on a
> dapper (2 releases back) and has just followed with during upgrades.
> I cannot reproduce it on other disks attached to the same server or on
> other servers attached to similar disksystems.
> 
> The filesystem was taken offline yesterday for a forced fsck and it was
> found to be clean.
> 
> The diskarray is a quite old Fibrenetix FX1200 with 12xPATA disk
> in raid5 (with hotspare) exposed to the OS as 3 SCSI-disks of
> 2+2+0.5TB assembled with LVM afterwards. The SCSI-controller is a:
> 05:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
> Fusion-MPT Dual Ultra320 SCSI (rev c1)
> 
> What suggestions do you have to solve this problem?
> 
> I'm about to mkfs.ext3 the volume and spool it back in from the backup,
> but somehow I'm not convinced that it will solve the problem at all.
> It may just be a hardware problem, but dmesg doesnt tell anything.
> 
> We actually got the problem from a perl-script, but this seems to be the
> minimal program that reproduces the problem.
> 
> Jesper

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ