linux-kernel - Re: [PATCH v2] checkpatch: fix false positives in REPEATED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5121bf7c-a126-6178-62ff-e54f0bb4cb6e@gmail.com>
Date:   Fri, 23 Oct 2020 00:44:59 +0530
From:   Aditya <yashsri421@...il.com>
To:     Joe Perches <joe@...ches.com>
Cc:     linux-kernel@...r.kernel.org,
        linux-kernel-mentees@...ts.linuxfoundation.org,
        lukas.bulwahn@...il.com, dwaipayanray1@...il.com
Subject: Re: [PATCH v2] checkpatch: fix false positives in REPEATED_WORD
 warning

On 22/10/20 9:40 pm, Joe Perches wrote:
> On Thu, 2020-10-22 at 20:20 +0530, Aditya Srivastava wrote:
>> Presence of hexadecimal address or symbol results in false warning
>> message by checkpatch.pl.
> []
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> []
>> @@ -3051,7 +3051,10 @@ sub process {
>>  		}
>>  
>>  # check for repeated words separated by a single space
>> -		if ($rawline =~ /^\+/ || $in_commit_log) {
>> +# avoid false positive from list command eg, '-rw-r--r-- 1 root root'
>> +		if (($rawline =~ /^\+/ || $in_commit_log) &&
>> +		$rawline !~ /[bcCdDlMnpPs\?-][rwxsStT-]{9}/) {
> 
> Alignment and use \b before and after the regex please.

If we use \b either before or after or both it does not match patterns
such as:
+   -rw-r--r--. 1 root root 112K Mar 20 12:16
selinux-policy-3.14.4-48.fc31.noarch.rpm

This is happening probably because it is counting '-' for '\b'
I have not observed any negatives of using this though.

> 
> 		if (($rawline =~ /^\+/ || $in_commit_log) &&
> 		    $rawline !~ /\b[bcCdDlMnpPs\?-][rwxsStT-]{9}\b/) {
>> @@ -3065,6 +3068,34 @@ sub process {
>>  				next if ($first ne $second);
>>  				next if ($first eq 'long');
>>  
>> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> +				if ($first =~ /\b[0-9a-f]{2,}/) {
>> +					# if such sequence occurs more than 4, it is most probably part of some of code
>> +					next if ((scalar @hex_seq)>4);
>> +					# for hex occurrences which are less than 4
>> +					# get first hex word in the line
>> +					if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> +						my $post_hex_seq = $';
>> +
>> +						# set suffieciently high default values to avoid ignoring or counting in absence of another
>> +						my $non_hex_char_pos = 1000;
>> +						my $special_chars_pos = 500;
>> +
>> +						if ($post_hex_seq =~ /[g-z]+/) {
>> +							# first non hex character in post_hex_seq
>> +							$non_hex_char_pos = $-[0];
>> +						}
>> +						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
>> +							# first occurrence of 2 or more special chars
>> +							$special_chars_pos = $-[0];
>> +						}
> 
> What does all this code actually avoid?
> 
> 

Sir, there are multiple variations of hex for which this warning is
occurring, for eg:
1) 00 c0 06 16 00 00 ff ff 00 93 1c 18 00 00 ff ff  ................
2) ffffffff ffffffff 00000000 c070058c
3)     f5a:       48 c7 44 24 78 ff ff    movq
$0xffffffffffffffff,0x78(%rsp)
4) +    fe fe
5) +    fe fe   - ? end marker ?
6) Code: ff ff 48 (...)

So I first check if the repeated word matches /\b[0-9a-f]{2,}/ . If it
does and occurs as a sequence of such repetitions more than 4(ie more
than or equal to 5), then it is most probably a part of hexadecimal
code. This is implemented here,

+				if ($first =~ /\b[0-9a-f]{2,}/) {
+					# if such sequence occurs more than 4, it is most probably part
of some of code
+					next if ((scalar @hex_seq)>4);

This addresses our issues for warning similar to example (1),(2) and (3).

But still we haven't detected 4,5,6. One can argue that we can modify:

+					next if ((scalar @hex_seq)>4);

with (scalar @hex_seq)>2 or (scalar @hex_seq)>3

but then, we'll not be able to account for warnings such as:

7) +	 * sets this to -1, the slack value will be calculated to be be
halfway
8) + * @seg: index of packet segment whose raw fields are to be be
extracted
9) The data in destination buffer is expected to be be parsed in big
10) +	 *   1. New session or device can'be be created - session sysfs
files

Here I observed that in hex codes, there are atleast 2 special
characters present before any non-hex character, for eg. in (5). Also
generally such occurrences are very rare in writing english, and it is
also helpful in our case.

This is implemented here:

>> +				# avoid repeating hex occurrences like 'ff ff fe 09 ...'
>> +				if ($first =~ /\b[0-9a-f]{2,}/) {
>> +					# if such sequence occurs more than 4, it is most probably
part of some of code
>> +					next if ((scalar @hex_seq)>4);
>> +					# for hex occurrences which are less than 4
>> +					# get first hex word in the line
>> +					if ($rawline =~ /\b[0-9a-f]{2,} /) {
>> +						my $post_hex_seq = $';
>> +
>> +						# set suffieciently high default values to avoid ignoring or
counting in absence of another
>> +						my $non_hex_char_pos = 1000;
>> +						my $special_chars_pos = 500;
>> +
>> +						if ($post_hex_seq =~ /[g-z]+/) {
>> +							# first non hex character in post_hex_seq
>> +							$non_hex_char_pos = $-[0];
>> +						}
>> +						if($post_hex_seq =~ /[^a-zA-Z0-9]{2,}/) {
>> +							# first occurrence of 2 or more special chars
>> +							$special_chars_pos = $-[0];
>> +						}

I have used these two lines for cases like example(4):
+						my $non_hex_char_pos = 1000;
+						my $special_chars_pos = 500;

Here, non-hex characters are missing, thus the default character helps
us to get desired result.
Also, I have set higher values such that if one of them occurs in a
line, the result remain unaffected, than with lower default values.


Thanks
Aditya