Rockbox

Tasklist

FS#10030 - latest nat sort fix still sorts wrong

Attached to Project: Rockbox
Opened by bryan vandyke (bryan) - Tuesday, 17 March 2009, 19:54 GMT
Last edited by Steve Bavin (pondlife) - Tuesday, 06 September 2011, 16:31 GMT
Task Type Bugs
Category User Interface
Status Unconfirmed   Reopened
Assigned To Thomas Martitz (kugel.)
Operating System All players
Severity Low
Priority Normal
Reported Version Version 3.1
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Votes 0
Private No

Details

In the never ending saga of strnatcmp the fix for 10029 sorts as
001
002
01
02

Attached is yet another attempt to get it right.
This task depends upon

Comment by Paul Louden (Llorean) - Tuesday, 17 March 2009, 20:08 GMT
Could you post it as a patch so we can see what is changed more readily without having to diff manually?
Comment by bryan vandyke (bryan) - Tuesday, 17 March 2009, 21:01 GMT
patch a is against 20339
patch b is against 20345

Comment by Thomas Martitz (kugel.) - Tuesday, 17 March 2009, 21:13 GMT
So you're counting the zeros? I'm thinking that could be done easier.

Anyway, I'm going to close again. We thought of special casing "real" leading zeros (those at the very beginning of the string, not those within the string when the string before is the same for both strings). But we decided on following the original implementation for now. It gives consistent results. It gives consistent behavior with other software using the algorithm.

And I think sorting
001
002
01
02
isn't a bug afterall. There's a good chance that this is intention. Also, imagine this are decimal numbers.
1.001
1.002
1.01
1.02
It would be even more correct for this case.
Or this case:
007_tomorrow-never-dies
24_S02E11
This sorting makes more sense to me too.

We've discussed this deeply, see discussion starting here: http://www.rockbox.org/irc/log-20090317#17:53:35
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 17:31 GMT
Due to the latest mailing list discussion, I've reopened this task.
As of now, the general consensus seems that this task is a bug, and should be fixed.

Thus, I'm attaching a patch which resolves this issue.

Before looping through the chars, strnatcmp will go through leading zeros, increasing the index. After that is finished, the index is decreased by 1 if there the current char is a non-digit, so that the leading zero counts for non-digits.

This leads to (strng -> value for sorting):
00 -> 0
0b -> 0b
01 -> 1
1 -> 1

which is the behavior of nautilus and windows explorer too.

It also fixes  FS#10031 , by using tolower instead of toupper for case-insenstive sorting. tolower is used by strcmp too, so that strnatcmp sorts chars between 'Z' and 'a' like strcmp does.
Comment by bryan vandyke (bryan) - Thursday, 19 March 2009, 18:02 GMT
Here's something I've been working on. It's based on the pre-revert version. It only deals with leading zeros. It does nothing with spaces.

I had to introduce a bias variable, since there's some kind of inconsistency on how the nat comp and strcmp handle the number of leading zeroes. For one file it favored more zeroes and in another it would favor less zeroes.

Also there's ifdef alternative code for a version that removes leading zeroes for number and letters. so [00, 1, 0b, 002] sorts as [00, 1, 002, 0b] instead of [00, 0b, 1, 002]

Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 18:05 GMT
Uhm, I think my version does pretty good, have you tried it?
Comment by bryan vandyke (bryan) - Thursday, 19 March 2009, 18:05 GMT
Oops bad line corrected version
Comment by bryan vandyke (bryan) - Thursday, 19 March 2009, 18:13 GMT
Your's works fine.

Try it on files named.

000.txt
00.txt
001.txt
01.txt

You should see the inconsistency I ran into;

sorts as

00.txt
000.txt
001.txt
01.txt
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 18:16 GMT
You're removing compare_left again, which should break fractional numbers like 1.001.
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 18:25 GMT
Well, this is due to the strcmp fallback.
Comment by bryan vandyke (bryan) - Thursday, 19 March 2009, 19:01 GMT
Sorry, I wasn't try to remove compare_left again or replace existing code. I was only trying to illustrate a different take. Basically, a run of numbers is compared as a integer, everything else is a separator. I used pre-revert code for this since it's closer to that assumption and easier to illustrate. I didn't need two functions since all number compares are treated the same.

For my code 1.001 is treated as 1[seperator]1 not One thousand and One or 1 point 1/1000th. Or whatever.

Personally, I think this is the easiest to implement and easiest to explain to non techie users.


As for the strcmp fallback, Yeah I know that's what causing it. I pointed that out in a previous comment. But, you just know somebody is going to file that as a bug, so I thought I'd give you a heads up. :D
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 19:05 GMT
Then try 1.002 and 1.01 as filenames.
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 19:30 GMT
So, your patch breaks decimal numbers (or any number after a non-digit separator).

The inconsitency you noted is only when the series of digits is the same. They sorted correctly if the series is different. (different by the means of: 001.txt and 01.txt are different, 000.txt and 00.txt are not).
I rather not break decimal numbers. And your inconsistency is at least consistently appearing, so my wishes to fix that are limited (and your version breaks more than it fixes).
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 19:54 GMT
Small update, no functional chage.
Comment by bryan vandyke (bryan) - Thursday, 19 March 2009, 19:56 GMT
I'm not sure what you're trying to get at. They sort differently as to be expected. Different assumptions, different code.

Mine
1.01 -> 1[sep]1
1.002 -> 1[sep]2

Yours and current
1.002 -> 1 and 2/1000
1.01 -> 1 and 10/1000

If your saying 1.002 should be sorted first because it less than 1.01 because it can be interpreted as a real number. Mine is wrong, current is right.

If those are version numbers both are right and both are wrong. .01 can be interpreted as a 1 or a 10. Depending on who you ask.

If your using another countries rules they could be interpreted as 1.01 and 1002. In which case mine is wrong, but sorts right for the wrong reasons and current code is wrong.

I apologize if I'm misunderstanding you. This is obviously an easy subject for that to happen. Just look at the irc and dev list. But trying to argue what 1.002 vs 1.01 'means' just leads down a path where nobody can agree and everybody is frustrated.

I'm not suggesting my patch be included in RB. I was trying to implement the minimal version of 'sort by number' and figured somebody would be interested in looking at it. Not deal with spaces, not guess what somebody meant by add leading zeros, underscore, or whatever. Just if it's 0-9 treat it as a number.
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 20:00 GMT
I interpret it as "real number". That's (IIUC) what the discussion said should be done. Anything serves as a separator, so 1a002 < 1a01 is correct too, so it's not locale dependent.

This is also wanted for discnumber.tracknumber. 1.1 < 1.10, as well as 1.01 < 1.10, that's how it should sort discumber.tracknumber filenames.
Comment by Alexander Levin (fml2) - Thursday, 19 March 2009, 22:03 GMT
>I interpret it as "real number"

I thought we didn't want to interpret real numbers because their interpretation depends on what char is used as a decimal point, which is country dependent. Hence I think that the logic to interpret "1.002" as "1 sep 2" is to be used.
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 22:06 GMT
And I repeatedly said, that this isn't dependent on locales, since any char acts as separator.

Maybe I missed it, but where did we agree on intentionally breaking decimal numbers? I heavily disagree on that.
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 22:09 GMT
Not to mention, that this would be against doing what the major filebrowser do again.
Comment by Daniel Stenberg (bagder) - Thursday, 19 March 2009, 22:11 GMT
I'm strongly against trying to interpret floats. The separator _is_ locale dependent and you cannot code your way around that. Accepting "any" character just makes it worse.

A number should be a sequence of digits. Period.
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 22:25 GMT
It's not about interpreting floats. It's about treating any number within the string (i.e. with non-digits before) as fractional (this of course includes floats also).

Anyway, we agreed on mimic'ing major browsers, and I just double checked windows explorer and nautilus. And they do NOT interpret floats, i.e. they sort: 1.1 < 1.03 < 1.4 < 1.007. So, I guess I'm ok with it.

edit: ls -lv though interprets floats.
Comment by Dominik Riebeling (bluebrother) - Thursday, 19 March 2009, 22:33 GMT
I might have missed something, but the discussion on the mailing list mostly was about treating leading zeros. If you consider any string a separator, how would files like "2_episode5", "2_episode_10" get sorted? I'm completely with Daniel here.
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 22:35 GMT
2_episode5" < "2_episode_10", because 5 comes before _.

Anyway, that's why I asked again. I may have missed it, because the discussion was almost entirely about leading zeros at the beginning of the string.

If we want to follow major browsers, we do not interpret fractional numbers.
Comment by Dominik Riebeling (bluebrother) - Thursday, 19 March 2009, 22:38 GMT
argh, unintentionally put a stray _ in my filename examples. Question was intended to be "2_episode5", "2_episode10". I think that should have been clear in the first place but well.

Handling stuff other than leading zeros wasn't discussed in detail, so I guess there is another discussion coming up. Though I guess it will be shorter.
Comment by Thomas Martitz (kugel.) - Thursday, 19 March 2009, 22:41 GMT
It would still sort 2_episode5 < 2_episode10. There's no float involved. strnatcmp identifies floats/fractional by a leading zero (leading for this particular series of digits, not for the whole string).
Comment by Jonathan Gordon (jdgordon) - Monday, 21 December 2009, 05:57 GMT
fix or close this...

Loading...