Rockbox

Tasklist

FS#10720 - diacritic positioning patch

Attached to Project: Rockbox
Opened by Phinitnun Chanasabaeng (crackerizer) - Sunday, 25 October 2009, 09:20 GMT
Last edited by Tomer Shalev (tomers) - Tuesday, 24 November 2009, 21:09 GMT
Task Type Patches
Category User Interface
Status Closed
Assigned To Tomer Shalev (tomers)
Operating System All players
Severity Low
Priority Normal
Reported Version Release 3.4
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No

Details

This patch will correct the position of diacritic characters in many languages (hopefully). It is based on my previous thai correction work. By cleaning up the code and adding more languages, this patch is done. It is tested with thai (as i don't know other languages) and worked good, no problem yet. Please test it with other languages and give me any comments.
This task depends upon

View Dependency Graph

Closed by  Tomer Shalev (tomers)
Tuesday, 24 November 2009, 21:09 GMT
Reason for closing:  Accepted
Additional comments about closing:  Committed in revision 23742.
Comment by Tomer Shalev (tomers) - Sunday, 25 October 2009, 09:37 GMT
Some coding style comments:
- Use spaces instead of tabs
- Use for loop instead of while in is_diacritic(). Also, no need for ret - just return 1 if found, or 0 at the end.

Comment by Tomer Shalev (tomers) - Sunday, 25 October 2009, 09:40 GMT
Also, consider renaming diac_t to diac_range_t, to emphasize it is a range.
Comment by Phinitnun Chanasabaeng (crackerizer) - Sunday, 25 October 2009, 11:53 GMT
Thank you, tomers.

This file is based on your comment. There is a question. Should we include unsupported languages?
Comment by Tomer Shalev (tomers) - Sunday, 25 October 2009, 11:59 GMT
> There is a question. Should we include unsupported languages?
I think we can include unsupported languages, or at least wrap them with '#if 0'.
The overhead for each language is minor, and I can't imagine any translator of a new language is going to review this code and update the array.
Comment by Phinitnun Chanasabaeng (crackerizer) - Sunday, 25 October 2009, 12:03 GMT
There is an incorrect code in the previous patch. This is the updated patch.
Comment by Frank Gevaerts (fg) - Sunday, 25 October 2009, 12:14 GMT
Some more comments:

- typedef struct. See docs/CONTRIBUTING
- instead of const int diac_entries = 108 I'd use sizeof(array)/sizeof(struct) to avoid errors when someone changes the table
- is looping through this array fast enough? Maybe it needs a binary search instead?
Comment by Phinitnun Chanasabaeng (crackerizer) - Sunday, 25 October 2009, 12:38 GMT
Thanks, fg. Here the latest update regarding to fg comment.
ps. i see some codes in svn that still use typedef struct such as in arabicjoin.h.
Comment by Phinitnun Chanasabaeng (crackerizer) - Sunday, 25 October 2009, 12:50 GMT
Update the clean up code.
Comment by Thomas Martitz (kugel.) - Sunday, 25 October 2009, 18:40 GMT
Why is the array exported? it's not used in other files. Make it static and remove it from diacritic.h.

We have the ARRAYLEN macro which does the same, maybe you can use this for calculating the length (very minor).
Comment by Antoine Cellerier (dionoea) - Monday, 26 October 2009, 15:09 GMT
Wouldn't it be possible to sort the diactric_range array and then have a lookup in log2(array length) instead of array length? (the ranges are disjoint right?)
And maybe even add another (small) array for really common ranges like the "normal" ASCII chars which don't follow diacritic rules so we don't loop through the whole array for those.

Edit: clarify second comment.
Comment by Phinitnun Chanasabaeng (crackerizer) - Tuesday, 27 October 2009, 07:23 GMT
I have been thinking about making the lookup based on the codepage configuration.
I'm investigating where and how to get the configuration value from the code.
Any suggestion would speed up this process. Thank you
Comment by Antoine Cellerier (dionoea) - Tuesday, 27 October 2009, 12:17 GMT
Isn't the codepage configuration specific to FAT directory and file names? (lots of other info like meta data, the interface or file contents could then be encoded with characters that aren't in that codepage)
Comment by Phinitnun Chanasabaeng (crackerizer) - Wednesday, 28 October 2009, 09:34 GMT
In this patch, The diacritic array is sorted. I also added a test for non-diacritic character to avoid searching.
Comment by Antoine Cellerier (dionoea) - Wednesday, 28 October 2009, 11:57 GMT
Nice. You could still speed up the lookup for diacritic characters using a dichotomy instead of just looping through all the values in the array.
In pseudo code it would be something like:
n = ceil(log2(array len))
n /= 2
i = n
while n != 0
__n /= 2
__if i >= array len || ch < diacritic[i].begin
____i -= n
__else if ch > diacritic[i].end
____i += n
__else
____return true
return false

Edit: "__" stand for spaces
Comment by Tomer Shalev (tomers) - Wednesday, 28 October 2009, 15:10 GMT
This algorithms does not take advantages of locality - the high probability that the next char will be of the same language.
I suggest having a static variable to hold a pointer / index of last 'language' used (i.e. the first range of the set of ranges that belongs to one language), and start the search from there. Combining this with binary search will be a bit more complicated, though.
Comment by Phinitnun Chanasabaeng (crackerizer) - Wednesday, 28 October 2009, 15:21 GMT
I still concern if this array is large enough to implement any complicated algorithm. Any suggestion?
Comment by Phinitnun Chanasabaeng (crackerizer) - Thursday, 29 October 2009, 08:33 GMT
By following tomers's idea, here is the latest patch. The search should be much faster for the same diacritic char set.
Comment by Tomer Shalev (tomers) - Thursday, 29 October 2009, 10:12 GMT
- The sizeof(diacritic)/sizeof(struct diac_range) should be replaced with ARRAYLEN macro (as kugel suggested above), and the macro should be outside the for loop to avoid consequent division operation.

- The prev_index should be renamed to prev_lang, and be updated on each found lang, not on found range. Therefore, a new field should be introduced in struct diac_range, which is a flag that marks 'first in lang'. Example:
+ {0x0591, 0x05a1, 1}, /* Hebrew */
+ {0x05a3, 0x05bd, 0},
+ {0x05bf, 0x05bf, 0},
+ {0x05c1, 0x05c2, 0},
+ {0x05c4, 0x05c4, 0},

- The logic that uses prev_index is totally flawed at the moment. Starting from prev_index to end of array, and then if failed calling the same function again with prev_index set to zero, thus starting from beginning to end of array is not correct and not elegant.

- I suggest having an array of ranges, which considers off index to hold a range of non-diacric characters, and even index to hold diacric characters:

struct diac_range
{
unsigned short end;
bool first_in_lang;
};


const struct diactric_t ranges[] = {
{0x100, 1}, /* 0x000-0x100: Non-diactric, mark new lang (e.g. English) */
{0x200, 0}, /* 0x101-0x200: Diactric (still English) */
{0x250, 0}, /* 0x201-0x250: Non-diactric (still English) */
{0x260, 1}, /* 0x251-0x260: Diactric, mark new lang (e.g. Hebrew) */
{0x280, 0}, /* 0x261-0x280: Non-diactric (still Hebrew) */
{0x285, 0}, /* 0x281-0x285: Diactric (still Hebrew) */
...
}

The loop should test (i % 2): If 0 - even, thus non-diactric, otherwise odd, thus diactric.
Comment by Frank Gevaerts (fg) - Thursday, 29 October 2009, 10:25 GMT
I don't think ARRAYLEN() needs to be moved out of the loop. This is a division of two constants, I'd expect the compiler to handle that properly.
Comment by Thomas Martitz (kugel.) - Thursday, 29 October 2009, 11:05 GMT
We could reconsider doing that for every single case; or we just do the ARRAYLEN() outside always which always works best.
Comment by Phinitnun Chanasabaeng (crackerizer) - Thursday, 29 October 2009, 14:58 GMT
Thank you for suggestions. What i tried to do at the first place, is to avoid messing up with the whole unicode table. But it's unavoidable now :).
I think updating prev_lang on found lang is more complicated than on found range. if updating on found lang is preferred, the array have to be changed to something like

const struct diacritic_t range[][] = {
{/* Lang 1 */
{0x100, 1},
{0x200, 0}
},
{/* Lang 2 */
{0x250, 0},
{0x300, 1}
},
...
}
Comment by Antoine Cellerier (dionoea) - Friday, 30 October 2009, 07:35 GMT
I still don't understand why you don't want a log2(array len) lookup :)
Comment by Thomas Martitz (kugel.) - Friday, 30 October 2009, 10:40 GMT
Do you mean binary search? Wouldn't that involve divisions? I'm thinking doing divisions is worse for this rather tiny array. But I assume it can be done with shifts too.
Comment by Antoine Cellerier (dionoea) - Friday, 30 October 2009, 14:22 GMT
Well binary lookup as mentioned in my previous post (Wednesday, 28 October 2009) only involves integer divisions by 2 which is >>1 so it costs nothing. The speedup would be quite noticeable without a cache. If a cache is included I don't know if it would really matter but it wouldn't hurt.
Comment by Tomer Shalev (tomers) - Monday, 16 November 2009, 22:31 GMT
Updated and synced patch.

- Make patch actually compile, by updating SOURCES file
- Move diacritic.h into firmware/include because it is used by font.c
- font_getstringsize() now does not take diacritic characters' width into account
- Added RTL support (this complicates the code even further)
- Fixed error in diacritic range for the Hebrew language.

Still need to be fixed:
- Proper MRU / binary search mechanism
- Diacritic ranges for all languages need to be reviewed
Comment by Phinitnun Chanasabaeng (crackerizer) - Wednesday, 18 November 2009, 14:11 GMT
Thank you tomer. This is a much complete code. I have just finish a full list of diacritic and non-diacritic array. This is its structure:

const unsigned short diacritic[] = { /* Odd index is diacritic character */
0x02ff, /* 0. Latin and its extended */
0x036f, /* 1. Combining Diacritical Marks */
0x0482, /* 2. */
0x0489, /* 3. Cyrillic */
...
...
};

Merge with your rtl patch, it would become

struct diac_range
{
unsigned short end;
bool is_rtl;
};

const struct diac_range diacritic[] = { /* Odd index is the set of diacritic character */
{0x02ff, 0}, /* 0. Latin and its extended */
{0x036f, 0}, /* 1. Combining Diacritical Marks */
{0x0482, 0}, /* 2. */
{0x0489, 0}, /* 3. Cyrillic */
...
...
};

Search algorithm is being coded. This might take time because i'm not available now.
Get back soon.
Comment by Tomer Shalev (tomers) - Saturday, 21 November 2009, 16:08 GMT
This patch contains the following:
- Implement MRU mechanism
- Synched diacritic database to The Unicode Standard, Version 5.2
- Move database and its API to a separate file

There nothing else to implement, TMHO.
Also the binary search is not needed any more, since the MRU list it enough.

I would appreciate if other developers can review the code and comment about it so this patch can be committed.
Comment by Thomas Martitz (kugel.) - Saturday, 21 November 2009, 16:15 GMT
2 questions:

Why do you put the big array in the .h file? that way each file which #includes the file will get its own copy.

We have a mru mechanism/library that is used for font glyph caching (lru.[ch]), can that be reused?
Comment by Tomer Shalev (tomers) - Sunday, 22 November 2009, 06:15 GMT
> Why do you put the big array in the .h file? that way each file which #includes the file will get its own copy.
My mistake. I'll revert this change.

> We have a mru mechanism/library that is used for font glyph caching (lru.[ch]), can that be reused?
I will try to use this mechanism and if appropriate (I guess it is - I've read its code), I'll post an updated patch.
Comment by Phinitnun Chanasabaeng (crackerizer) - Sunday, 22 November 2009, 07:46 GMT
An opinion, change the SVN header to credit your work. :)
Comment by Tomer Shalev (tomers) - Sunday, 22 November 2009, 18:00 GMT
- Revert moving database to a separate file (thanks, kugel)
- Updated copyright notice (thanks, crackerizer)

Still testing use of lru library.
Comment by Tomer Shalev (tomers) - Monday, 23 November 2009, 19:19 GMT
> We have a mru mechanism/library that is used for font glyph caching (lru.[ch]), can that be reused?
Using this mechanism is an will bloat the all thing away, increasing bin size, increase RAM usage, and reduce performance.
We'll need to implement a code whose scope is similar to font_cache.c. This is totally unnecessary.

I think it's ready to be committed as it is now.
Comment by Thomas Martitz (kugel.) - Monday, 23 November 2009, 20:22 GMT
Sometimes it's better to reuse known-to-work-efficiently code than to reimplement the thing always for different scopes. What would cause bloat and increase exactly? Maybe the lru implemenation isn't generic enough? Anyways, I was just wondering. Just saying that reusing code aids maintainability and (usually) keeps bloat down.
Comment by Tomer Shalev (tomers) - Tuesday, 24 November 2009, 06:57 GMT
> Sometimes it's better to reuse known-to-work-efficiently code than to reimplement the thing always for different scopes. What would cause bloat and increase exactly?
> Maybe the LRU implementation isn't generic enough? Anyways, I was just wondering. Just saying that reusing code aids maintainability and (usually) keeps bloat down.

I suggest committing this patch as it is, with its current MRU caching mechanism, and then try to replace it with the LRU library separately from this task, after research this matter a bit further.
Any objections?

Loading...