This is the bug/patch tracker for Rockbox. Click here for more information.
Quick links: Bugs · Patches · Rockbox frontpage
FS#10720 - diacritic positioning patch
Attached to Project:
Rockbox
Opened by Phinitnun Chanasabaeng (crackerizer) - Sunday, 25 October 2009, 10:20 GMT+2
Last edited by Tomer Shalev (tomers) - Tuesday, 24 November 2009, 22:09 GMT+2
Opened by Phinitnun Chanasabaeng (crackerizer) - Sunday, 25 October 2009, 10:20 GMT+2
Last edited by Tomer Shalev (tomers) - Tuesday, 24 November 2009, 22:09 GMT+2
|
DetailsThis patch will correct the position of diacritic characters in many languages (hopefully). It is based on my previous thai correction work. By cleaning up the code and adding more languages, this patch is done. It is tested with thai (as i don't know other languages) and worked good, no problem yet. Please test it with other languages and give me any comments.
|
This task blocks these from closing
FS#8491 - Hindi dependent Vowel signs are not rendering properly
Closed by Tomer Shalev (tomers)
Tuesday, 24 November 2009, 22:09 GMT+2
Reason for closing: Accepted
Additional comments about closing: Committed in revision 23742.
Tuesday, 24 November 2009, 22:09 GMT+2
Reason for closing: Accepted
Additional comments about closing: Committed in revision 23742.
- Use spaces instead of tabs
- Use for loop instead of while in is_diacritic(). Also, no need for ret - just return 1 if found, or 0 at the end.
This file is based on your comment. There is a question. Should we include unsupported languages?
I think we can include unsupported languages, or at least wrap them with '#if 0'.
The overhead for each language is minor, and I can't imagine any translator of a new language is going to review this code and update the array.
- typedef struct. See docs/CONTRIBUTING
- instead of const int diac_entries = 108 I'd use sizeof(array)/sizeof(struct) to avoid errors when someone changes the table
- is looping through this array fast enough? Maybe it needs a binary search instead?
ps. i see some codes in svn that still use typedef struct such as in arabicjoin.h.
We have the ARRAYLEN macro which does the same, maybe you can use this for calculating the length (very minor).
And maybe even add another (small) array for really common ranges like the "normal" ASCII chars which don't follow diacritic rules so we don't loop through the whole array for those.
Edit: clarify second comment.
I'm investigating where and how to get the configuration value from the code.
Any suggestion would speed up this process. Thank you
In pseudo code it would be something like:
n = ceil(log2(array len))
n /= 2
i = n
while n != 0
__n /= 2
__if i >= array len || ch < diacritic[i].begin
____i -= n
__else if ch > diacritic[i].end
____i += n
__else
____return true
return false
Edit: "__" stand for spaces
I suggest having a static variable to hold a pointer / index of last 'language' used (i.e. the first range of the set of ranges that belongs to one language), and start the search from there. Combining this with binary search will be a bit more complicated, though.
- The prev_index should be renamed to prev_lang, and be updated on each found lang, not on found range. Therefore, a new field should be introduced in struct diac_range, which is a flag that marks 'first in lang'. Example:
+ {0x0591, 0x05a1, 1}, /* Hebrew */
+ {0x05a3, 0x05bd, 0},
+ {0x05bf, 0x05bf, 0},
+ {0x05c1, 0x05c2, 0},
+ {0x05c4, 0x05c4, 0},
- The logic that uses prev_index is totally flawed at the moment. Starting from prev_index to end of array, and then if failed calling the same function again with prev_index set to zero, thus starting from beginning to end of array is not correct and not elegant.
- I suggest having an array of ranges, which considers off index to hold a range of non-diacric characters, and even index to hold diacric characters:
struct diac_range
{
unsigned short end;
bool first_in_lang;
};
const struct diactric_t ranges[] = {
{0x100, 1}, /* 0x000-0x100: Non-diactric, mark new lang (e.g. English) */
{0x200, 0}, /* 0x101-0x200: Diactric (still English) */
{0x250, 0}, /* 0x201-0x250: Non-diactric (still English) */
{0x260, 1}, /* 0x251-0x260: Diactric, mark new lang (e.g. Hebrew) */
{0x280, 0}, /* 0x261-0x280: Non-diactric (still Hebrew) */
{0x285, 0}, /* 0x281-0x285: Diactric (still Hebrew) */
...
}
The loop should test (i % 2): If 0 - even, thus non-diactric, otherwise odd, thus diactric.
I think updating prev_lang on found lang is more complicated than on found range. if updating on found lang is preferred, the array have to be changed to something like
const struct diacritic_t range[][] = {
{/* Lang 1 */
{0x100, 1},
{0x200, 0}
},
{/* Lang 2 */
{0x250, 0},
{0x300, 1}
},
...
}
- Make patch actually compile, by updating SOURCES file
- Move diacritic.h into firmware/include because it is used by font.c
- font_getstringsize() now does not take diacritic characters' width into account
- Added RTL support (this complicates the code even further)
- Fixed error in diacritic range for the Hebrew language.
Still need to be fixed:
- Proper MRU / binary search mechanism
- Diacritic ranges for all languages need to be reviewed
const unsigned short diacritic[] = { /* Odd index is diacritic character */
0x02ff, /* 0. Latin and its extended */
0x036f, /* 1. Combining Diacritical Marks */
0x0482, /* 2. */
0x0489, /* 3. Cyrillic */
...
...
};
Merge with your rtl patch, it would become
struct diac_range
{
unsigned short end;
bool is_rtl;
};
const struct diac_range diacritic[] = { /* Odd index is the set of diacritic character */
{0x02ff, 0}, /* 0. Latin and its extended */
{0x036f, 0}, /* 1. Combining Diacritical Marks */
{0x0482, 0}, /* 2. */
{0x0489, 0}, /* 3. Cyrillic */
...
...
};
Search algorithm is being coded. This might take time because i'm not available now.
Get back soon.
- Implement MRU mechanism
- Synched diacritic database to The Unicode Standard, Version 5.2
- Move database and its API to a separate file
There nothing else to implement, TMHO.
Also the binary search is not needed any more, since the MRU list it enough.
I would appreciate if other developers can review the code and comment about it so this patch can be committed.
Why do you put the big array in the .h file? that way each file which #includes the file will get its own copy.
We have a mru mechanism/library that is used for font glyph caching (lru.[ch]), can that be reused?
My mistake. I'll revert this change.
> We have a mru mechanism/library that is used for font glyph caching (lru.[ch]), can that be reused?
I will try to use this mechanism and if appropriate (I guess it is - I've read its code), I'll post an updated patch.
- Updated copyright notice (thanks, crackerizer)
Still testing use of lru library.
Using this mechanism is an will bloat the all thing away, increasing bin size, increase RAM usage, and reduce performance.
We'll need to implement a code whose scope is similar to font_cache.c. This is totally unnecessary.
I think it's ready to be committed as it is now.
> Maybe the LRU implementation isn't generic enough? Anyways, I was just wondering. Just saying that reusing code aids maintainability and (usually) keeps bloat down.
I suggest committing this patch as it is, with its current MRU caching mechanism, and then try to replace it with the LRU library separately from this task, after research this matter a bit further.
Any objections?