lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9cc97eac-ce0b-4492-08f7-32b3884e24d7@linux.alibaba.com>
Date:   Mon, 23 Aug 2021 09:42:17 +0800
From:   Joseph Qi <joseph.qi@...ux.alibaba.com>
To:     Eric Whitney <enwlinux@...il.com>
Cc:     Jeffle Xu <jefflexu@...ux.alibaba.com>, tytso@....edu,
        adilger.kernel@...ger.ca, linux-ext4@...r.kernel.org
Subject: Re: [PATCH] ext4: fix reserved space counter leakage



On 8/23/21 5:52 AM, Eric Whitney wrote:
> * Joseph Qi <joseph.qi@...ux.alibaba.com>:
>>
>>
>> On 8/22/21 9:06 PM, Joseph Qi wrote:
>>>
>>>
>>> On 8/21/21 12:45 AM, Eric Whitney wrote:
>>>> * Jeffle Xu <jefflexu@...ux.alibaba.com>:
>>>>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM,
>>>>> previously reserved space is not released as the error handling,
>>>>> in which case @s_dirtyclusters_counter is left over. Since this delayed
>>>>> extent failes to be inserted into extent status tree, when inode is
>>>>> written back, the extra @s_dirtyclusters_counter won't be subtracted and
>>>>> remains there forever.
>>>>>
>>>>> This can leads to /sys/fs/ext4/<dev>/delayed_allocation_blocks remains
>>>>> non-zero even when syncfs is executed on the filesystem.
>>>>>
>>>>
>>>> Hi:
>>>>
>>>> I think the fix below looks fine.  However, this comment doesn't look right
>>>> to me.  Are you really seeing delayed_allocation_blocks values that remain
>>>> incorrectly elevated across last closes (or across file system unmounts and
>>>> remounts)?  s_dirtyclusters_counter isn't written out to stable storage -
>>>> it's an in-memory only variable that's created when a file is first opened
>>>> and destroyed on last close.
>>>>
>>>
>>> Actually we've encountered a real case in our production environment,
>>> which has about 20G space lost (df - du = ~20G).
>>> After some investigation, we've confirmed that it cause by leaked
>>> s_dirtyclusters_counter (~5M), and even we do manually sync, it remains.
>>> Since there is no error messages, we've checked all logic around
>>> s_dirtyclusters_counter and found this. Also we can manually inject
>>> error and reproduce the leaked s_dirtyclusters_counter.
>>>
> 
> Sure - as I noted, the fix looks good - I agree that you could see inaccurate
> s_dirtyclusters_counter (and i_reserved_data_blocks) values.  This is a good
> catch and a good fix.  It's the comment I find misleading / inaccurate, and
> I'd like to see that improved for the sake of developers reading commit
> histories in the future.
> 
> Also, Gao Xiang's idea of checking i_reserved_data_blocks in the inode evict
> path sounds good to me - I'd considered doing that in the past but never
> actually did it.
> 
>>
>> BTW, it's a runtime lost, but not about on-disk.
>> If umount and then mount it again, it becomes normal. But
>> application also should be restarted...
> 
> And this is where the comment could use a little help.  "when inode is
> written back, the extra @s_dirtyclusters_counter won't be subtracted and
> remains there forever" suggests to me that s_dirtyclusters_counter is
> being persisted on stable storage.  But as you note, simply umounting and
> remounting the filesystem clears up the problem.  (And in my rush to get
> my feedback out earlier I incorrectly stated that s_dirtyclusters_counter
> would get created and destroyed on first open and last close - that's
> i_reserved_data_blocks, of course.)
> 
> So, in order to speed things along, please allow me to suggest some edits
> for the commit comment:
> 
> When ext4_insert_delayed block receives and recovers from an error from
> ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the
> space it has reserved for that block insertion as it should.  One effect
> of this bug is that s_dirtyclusters_counter is not decremented and remains
> incorrectly elevated until the file system has been unmounted.  This can
> result in premature ENOSPC returns and apparent loss of free space.
> 
> Another effect of this bug is that /sys/fs/ext4/<dev>/delayed_allocation_blocks
> can remain non-zero even after syncfs has been executed on the filesystem.
> 
> Does that sound right?
> 
Yes, looks better. Thanks for your comments.
We'll update the commit log in v2.

Thanks,
Joseph

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ