Rockbox

  • Status Closed
  • Percent Complete
    100%
  • Task Type Patches
  • Category User Interface
  • Assigned To
    tomers
  • Operating System All players
  • Severity Low
  • Priority Very Low
  • Reported Version Release 3.4
  • Due in Version Undecided
  • Due Date Undecided
  • Votes
  • Private
Attached to Project: Rockbox
Opened by crackerizer - 2009-10-25
Last edited by tomers - 2009-11-24

FS#10720 - diacritic positioning patch

This patch will correct the position of diacritic characters in many languages (hopefully). It is based on my previous thai correction work. By cleaning up the code and adding more languages, this patch is done. It is tested with thai (as i don’t know other languages) and worked good, no problem yet. Please test it with other languages and give me any comments.

The task blocks this from closing
ID Project Summary Priority Severity Assigned To Progress
8491 Rockbox  FS#8491 - Hindi dependent Vowel signs are not rendering properly  Very Low Low
100%
Closed by  tomers
2009-11-24 21:09
Reason for closing:  Accepted
Additional comments about closing:   Warning: Undefined array key "typography" in /home/rockbox/flyspray/plugins/dokuwiki/inc/parserutils.php on line 371 Warning: Undefined array key "camelcase" in /home/rockbox/flyspray/plugins/dokuwiki/inc/parserutils.php on line 407

Committed in revision 23742.

Some coding style comments:
- Use spaces instead of tabs
- Use for loop instead of while in is_diacritic(). Also, no need for ret - just return 1 if found, or 0 at the end.

Also, consider renaming diac_t to diac_range_t, to emphasize it is a range.

Thank you, tomers.

This file is based on your comment. There is a question. Should we include unsupported languages?

There is a question. Should we include unsupported languages?
I think we can include unsupported languages, or at least wrap them with ‘#if 0’.
The overhead for each language is minor, and I can’t imagine any translator of a new language is going to review this code and update the array.

There is an incorrect code in the previous patch. This is the updated patch.

Admin
fg commented on 2009-10-25 12:14

Some more comments:

- typedef struct. See docs/CONTRIBUTING
- instead of const int diac_entries = 108 I’d use sizeof(array)/sizeof(struct) to avoid errors when someone changes the table
- is looping through this array fast enough? Maybe it needs a binary search instead?

Thanks, fg. Here the latest update regarding to fg comment.
ps. i see some codes in svn that still use typedef struct such as in arabicjoin.h.

Update the clean up code.

Why is the array exported? it’s not used in other files. Make it static and remove it from diacritic.h.

We have the ARRAYLEN macro which does the same, maybe you can use this for calculating the length (very minor).

Wouldn’t it be possible to sort the diactric_range array and then have a lookup in log2(array length) instead of array length? (the ranges are disjoint right?)
And maybe even add another (small) array for really common ranges like the “normal” ASCII chars which don’t follow diacritic rules so we don’t loop through the whole array for those.

Edit: clarify second comment.

I have been thinking about making the lookup based on the codepage configuration.
I’m investigating where and how to get the configuration value from the code.
Any suggestion would speed up this process. Thank you

Isn’t the codepage configuration specific to FAT directory and file names? (lots of other info like meta data, the interface or file contents could then be encoded with characters that aren’t in that codepage)

In this patch, The diacritic array is sorted. I also added a test for non-diacritic character to avoid searching.

Nice. You could still speed up the lookup for diacritic characters using a dichotomy instead of just looping through all the values in the array.
In pseudo code it would be something like:
n = ceil(log2(array len))
n /= 2
i = n
while n != 0
n /= 2
if i >= array len || ch < diacritic[i].begin
i -= n
else if ch > diacritic[i].end
i += n
else
return true
return false

Edit: “__” stand for spaces

This algorithms does not take advantages of locality - the high probability that the next char will be of the same language.
I suggest having a static variable to hold a pointer / index of last ‘language’ used (i.e. the first range of the set of ranges that belongs to one language), and start the search from there. Combining this with binary search will be a bit more complicated, though.

I still concern if this array is large enough to implement any complicated algorithm. Any suggestion?

By following tomers’s idea, here is the latest patch. The search should be much faster for the same diacritic char set.

- The sizeof(diacritic)/sizeof(struct diac_range) should be replaced with ARRAYLEN macro (as kugel suggested above), and the macro should be outside the for loop to avoid consequent division operation.

- The prev_index should be renamed to prev_lang, and be updated on each found lang, not on found range. Therefore, a new field should be introduced in struct diac_range, which is a flag that marks ‘first in lang’. Example:
+ {0×0591, 0x05a1, 1}, /* Hebrew */
+ {0x05a3, 0x05bd, 0},
+ {0x05bf, 0x05bf, 0},
+ {0x05c1, 0x05c2, 0},
+ {0x05c4, 0x05c4, 0},

- The logic that uses prev_index is totally flawed at the moment. Starting from prev_index to end of array, and then if failed calling the same function again with prev_index set to zero, thus starting from beginning to end of array is not correct and not elegant.

- I suggest having an array of ranges, which considers off index to hold a range of non-diacric characters, and even index to hold diacric characters:

struct diac_range
{

  unsigned short end;
  bool first_in_lang;

};

const struct diactric_t ranges[] = {

 {0x100, 1}, /* 0x000-0x100: Non-diactric, mark new lang (e.g. English) */
 {0x200, 0}, /* 0x101-0x200: Diactric (still English) */
 {0x250, 0}, /* 0x201-0x250: Non-diactric (still English) */
 {0x260, 1}, /* 0x251-0x260: Diactric, mark new lang (e.g. Hebrew) */
 {0x280, 0}, /* 0x261-0x280: Non-diactric (still Hebrew) */
 {0x285, 0}, /* 0x281-0x285: Diactric (still Hebrew) */
 ...

}

The loop should test (i % 2): If 0 - even, thus non-diactric, otherwise odd, thus diactric.

Admin
fg commented on 2009-10-29 10:25

I don’t think ARRAYLEN() needs to be moved out of the loop. This is a division of two constants, I’d expect the compiler to handle that properly.

We could reconsider doing that for every single case; or we just do the ARRAYLEN() outside always which always works best.

Thank you for suggestions. What i tried to do at the first place, is to avoid messing up with the whole unicode table. But it’s unavoidable now :).
I think updating prev_lang on found lang is more complicated than on found range. if updating on found lang is preferred, the array have to be changed to something like

const struct diacritic_t range[][] = {
{/* Lang 1 */
{0×100, 1},
{0×200, 0}
},
{/* Lang 2 */
{0×250, 0},
{0×300, 1}
},
… }

I still don’t understand why you don’t want a log2(array len) lookup :)

Do you mean binary search? Wouldn’t that involve divisions? I’m thinking doing divisions is worse for this rather tiny array. But I assume it can be done with shifts too.

Well binary lookup as mentioned in my previous post (Wednesday, 28 October 2009) only involves integer divisions by 2 which is »1 so it costs nothing. The speedup would be quite noticeable without a cache. If a cache is included I don’t know if it would really matter but it wouldn’t hurt.

Updated and synced patch.

- Make patch actually compile, by updating SOURCES file
- Move diacritic.h into firmware/include because it is used by font.c
- font_getstringsize() now does not take diacritic characters’ width into account
- Added RTL support (this complicates the code even further)
- Fixed error in diacritic range for the Hebrew language.

Still need to be fixed:
- Proper MRU / binary search mechanism
- Diacritic ranges for all languages need to be reviewed

Thank you tomer. This is a much complete code. I have just finish a full list of diacritic and non-diacritic array. This is its structure:

const unsigned short diacritic[] = { /* Odd index is diacritic character */

  0x02ff, /* 0. Latin and its extended */
  0x036f, /* 1. Combining Diacritical Marks */
  0x0482, /* 2. */
  0x0489, /* 3. Cyrillic */
  ...
  ...

};

Merge with your rtl patch, it would become

struct diac_range
{

  unsigned short end;
  bool is_rtl;

};

const struct diac_range diacritic[] = { /* Odd index is the set of diacritic character */

  {0x02ff, 0}, /* 0. Latin and its extended */
  {0x036f, 0}, /* 1. Combining Diacritical Marks */
  {0x0482, 0}, /* 2. */
  {0x0489, 0}, /* 3. Cyrillic */
  ...
  ...

};

Search algorithm is being coded. This might take time because i’m not available now.
Get back soon.

This patch contains the following:
- Implement MRU mechanism
- Synched diacritic database to The Unicode Standard, Version 5.2
- Move database and its API to a separate file

There nothing else to implement, TMHO.
Also the binary search is not needed any more, since the MRU list it enough.

I would appreciate if other developers can review the code and comment about it so this patch can be committed.

2 questions:

Why do you put the big array in the .h file? that way each file which #includes the file will get its own copy.

We have a mru mechanism/library that is used for font glyph caching (lru.[ch]), can that be reused?

Why do you put the big array in the .h file? that way each file which #includes the file will get its own copy.
My mistake. I’ll revert this change.
We have a mru mechanism/library that is used for font glyph caching (lru.[ch]), can that be reused?
I will try to use this mechanism and if appropriate (I guess it is - I’ve read its code), I’ll post an updated patch.

An opinion, change the SVN header to credit your work. :)

- Revert moving database to a separate file (thanks, kugel)
- Updated copyright notice (thanks, crackerizer)

Still testing use of lru library.

We have a mru mechanism/library that is used for font glyph caching (lru.[ch]), can that be reused?
Using this mechanism is an will bloat the all thing away, increasing bin size, increase RAM usage, and reduce performance.
We’ll need to implement a code whose scope is similar to font_cache.c. This is totally unnecessary.

I think it’s ready to be committed as it is now.

Sometimes it’s better to reuse known-to-work-efficiently code than to reimplement the thing always for different scopes. What would cause bloat and increase exactly? Maybe the lru implemenation isn’t generic enough? Anyways, I was just wondering. Just saying that reusing code aids maintainability and (usually) keeps bloat down.

Sometimes it’s better to reuse known-to-work-efficiently code than to reimplement the thing always for different scopes. What would cause bloat and increase exactly?
> Maybe the LRU implementation isn’t generic enough? Anyways, I was just wondering. Just saying that reusing code aids maintainability and (usually) keeps bloat down.

I suggest committing this patch as it is, with its current MRU caching mechanism, and then try to replace it with the LRU library separately from this task, after research this matter a bit further.
Any objections?

Loading...

Available keyboard shortcuts

Tasklist

Task Details

Task Editing