lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALQm4jg-f9iBXx8B5jnP0_6t4xqVbLGSP8AXBczLTOfT9kcDyQ@mail.gmail.com>
Date:	Thu, 12 Sep 2013 09:58:29 -0700
From:	Cuong Tran <cuonghuutran@...il.com>
To:	"Sidorov, Andrei" <Andrei.Sidorov@...isi.com>
Cc:	"linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>
Subject: Re: Java Stop-the-World GC stall induced by FS flush or many large
 file deletions

Andrei, regarding core binding, our test program has only 1 thread
which appends to the log. I did not explicitly bind this test to any
core. Even without core binding, would a thread already scheduled for
the core (say due to NUMA) get stuck until the core is available or
does the kernel migrate this poor thread?

Regarding journaling, your explanation is very clear. I read that a
record takes up 1 to multiple buffers, 4 KB each. And a journal
default size is 128 MB, regardless of partition size. Thus a journal
can be filled up to 32,000  records before it has to check point. Is
this correct?

Thanks again for your help.

--Cuong

On Thu, Sep 12, 2013 at 8:47 AM, Sidorov, Andrei
<Andrei.Sidorov@...isi.com> wrote:
> If there are threads bound to this core, they probably won't be able to
> progress for the time of commit. I don't think scheduler would migrate
> them to a different core immediately. So, if GC wants to talk to these
> threads, it would block as well.
>
> Of course deleting many small files is much slower than deleting single
> file of the same size. Number of records is roughly the number of files
> deleted, split at allocation group boundaries. The group size is 128M
> unless you enabled bigalloc feature. That is in simplest case a release
> blocks entry for 256M file contains one record with 2 groups to free.
> But release blocks entry for 256 1M files will contain 256 records that
> are to be precessed separately. Before that change, commit time for
> removing one 256M is the same as removing 256 1M files (wrt to time
> taken to release blocks). After the change, releasing blocks of 256M
> file would take 2 “iterations” as opposed to 256.
>
> Basically, there is no thing like “deleting N blocks”, unless you delete
> a file by progressively truncating it towards zero.
>
> Regards,
> Andrei.
>
> On 12.09.2013 02:08, Cuong Tran wrote:
>> My desk top has 8 cores, including hyperthreading. Thus deleting files
>> would lock up one core but that should not affect GC threads if core
>> lock-up is an issue? Would # journal records be proportional to #
>> blocks deleted. And thus deleting N blocks, one block at a time would
>> create N times more journal records than deleting all N blocks in "one
>> shot"?
>>
>> --Cuong
>>
>> On Wed, Sep 11, 2013 at 11:02 PM, Sidorov, Andrei
>> <Andrei.Sidorov@...isi.com> wrote:
>>> It would lock-up one core whichever jdb/sdaX runs on. This will usually
>>> happen upon commit that runs every x seconds, 5 by default (see “commit”
>>> mount option for ext4). I.e. deleting 5 files one by one with 1 second
>>> interval in between is basically the same as deleting all of them “at once”.
>>>
>>> Yes, fallocated files are the same wrt releasing blocks.
>>>
>>> Regards,
>>> Andrei.
>>>
>>> On 12.09.2013 01:45, Cuong Tran wrote:
>>>> Awesome fix and thanks for very speedy response. I have some
>>>> questions. We delete files one at a time, and thus that would lock up
>>>> one core or all cores?
>>>>
>>>> And in our test, we use falloc w/o writing to file. That would still
>>>> cause freeing block-by-block, correct?
>>>> --Cuong
>>>>
>>>> On Wed, Sep 11, 2013 at 10:32 PM, Sidorov, Andrei
>>>> <Andrei.Sidorov@...isi.com> wrote:
>>>>> Hi,
>>>>>
>>>>> Large file deletions are likely to lock cpu for seconds if you're
>>>>> running non-preemptible kernel < 3.10.
>>>>> Make sure you have this change:
>>>>> http://patchwork.ozlabs.org/patch/232172/ (available in 3.10 if I
>>>>> remember it right).
>>>>> Turning on preemption may be a good idea as well.
>>>>>
>>>>> Regards,
>>>>> Andrei.
>>>>>
>>>>> On 12.09.2013 00:18, Cuong Tran wrote:
>>>>>> We have seen GC stalls that are NOT due to memory usage of applications.
>>>>>>
>>>>>> GC log reports the CPU user and system time of GC threads, which are
>>>>>> almost 0, and stop-the-world time, which can be multiple seconds. This
>>>>>> indicates GC threads are waiting for IO but GC threads should be
>>>>>> CPU-bound in user mode.
>>>>>>
>>>>>> We could reproduce the problems using a simple Java program that just
>>>>>> appends to a log file via log4j. If the test just runs by itself, it
>>>>>> does not incur any GC stalls. However, if we run a script that enters
>>>>>> a loop to create multiple large file via falloc() and then deletes
>>>>>> them, then GC stall of 1+ seconds can happen fairly predictably.
>>>>>>
>>>>>> We can also reproduce the problem by periodically switch the log and
>>>>>> gzip the older log. IO device, a single disk drive, is overloaded by
>>>>>> FS flush when this happens.
>>>>>>
>>>>>> Our guess is GC has to acquiesce its threads and if one of the threads
>>>>>> is stuck in the kernel (say in non-interruptible mode). Then GC has to
>>>>>> wait until this thread unblocks. In the mean time, it already stops
>>>>>> the world.
>>>>>>
>>>>>> Another test that shows similar problem is doing deferred writes to
>>>>>> append a file. Latency of deferred writes is very fast but once a
>>>>>> while, it can last more than 1 second.
>>>>>>
>>>>>> We would really appreciate if you could shed some light on possible
>>>>>> causes? (Threads blocked because of journal check point, delayed
>>>>>> allocation can't proceed?). We could alleviate the problem by
>>>>>> configuring expire_centisecs and writeback_centisecs to flush more
>>>>>> frequently, and thus even-out the workload to the disk drive. But we
>>>>>> would like to know if there  is a methodology to model the rate of
>>>>>> flush vs. rate of changes and IO throughput of the drive (SAS, 15K
>>>>>> RPM).
>>>>>>
>>>>>> Many thanks.
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>>>> the body of a message to majordomo@...r.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ