FS#9830 - UTF-8 encoded CUE file is parsed wrongly
Opened by Prikolchik (prikolchik) - Monday, 26 January 2009, 03:01 GMT
Last edited by Jonathan Gordon (jdgordon) - Monday, 26 January 2009, 11:24 GMT
|
DetailsWhen using CUEsheet file that has been encoded with UTF-8 Rockbox does not parse it correctly. All of the information that is non-ASCII is displayed as gibberish (track names and performer names in my case).
I'm only referring to UTF-8 CUE file problem here. ASCII CUE file (with correct codepage settings "Cyrillic (CP 1252)" ) works without issues. I have looked at the Rockbox source code and my very basic knowledge of C tells me that method from /app/misc.c: [code] /** Open a UTF-8 file and set file descriptor to first byte after BOM. * If no BOM is present this behaves like open(). * If the file is opened for writing and O_TRUNC is set, write a BOM to * the opened file and leave the file pointer set after the BOM. */ #define BOM "\xef\xbb\xbf" #define BOM_SIZE 3 int open_utf8(const char* pathname, int flags) { int fd; unsigned char bom[BOM_SIZE]; fd = open(pathname, flags); if(fd < 0) return fd; if(flags & (O_TRUNC | O_WRONLY)) { write(fd, BOM, BOM_SIZE); } else { read(fd, bom, BOM_SIZE); /* check for BOM */ if(memcmp(bom, BOM, BOM_SIZE)) lseek(fd, 0, SEEK_SET); } return fd; }[/code] DOES NOT properly handle UTF-8 files. It just ignores UTF-8 BOM (\xef\xbb\xbf) and still treats file as ASCII (instead of UTF-8). This may be a desired behaviour (which I doubt), but even if it is, it is NOT handled correctly when parsing CUE sheet in the method from /apps/cuesheet.c (my comments start with ///*): [code]/* parse cuesheet "file" and store the information in "cue" */ bool parse_cuesheet(char *file, struct cuesheet *cue) { char line[MAX_PATH]; char *s; DEBUGF("cue parse\n"); int fd = open_utf8(file,O_RDONLY); ///* <===== here open_utf8() is called */ if (fd < 0) { /* couln't open the file */ return false; } /* Initialization */ memset(cue, 0, sizeof(struct cuesheet)); strcpy(cue->path, file); cue->curr_track = cue->tracks; while ( read_line(fd,line,MAX_PATH) && cue->track_count < MAX_TRACKS ) ///* <== here it reads one line at a time from cue file we opened with open_utf8() */ { ///*skipped some code */ } close(fd); ///* skipped more code */ return true; }[/code] So in the end, CUE file with Russian track names encoded in ASCII looks like this (right on the photo), and same CUE file encoded in UTF-8 gives gibberish (left on photo). And if UTF-8 was handled correctly UTF-8 CUE file would give image on right. The attached is photo I just talked about and both of CUE files I used. To reproduce bug you must have my cue named "test.cue" and any Wavpack file larger than 2 minutes named "test.wv" in same folder. Also enable CUE sheet support in Settings if it is not enabled. (you can use files other than wavpack by editing test.cue in Notepad and changing test.wv to test.flac, test.ape, etc). This seems like a serious bug to me and it would be great if a developer could have a look at it. If there is ANY clarification needed please ask. |
Monday, 26 January 2009, 11:24 GMT
Reason for closing: Fixed
Additional comments about closing: fixed in r19858.
It may be desirable to override the codepage setting if a BOM is found, although I seem to recall that UTF-8 is not allowed for cuesheets. I'll have a look later.
For the heck of I have tried changing both methods so that now if UTF8 BOM is found CUE is parsed in UTF8 (and not in default encoding) and if BOM is not found, it is parsed in default codepage. So basically UTF8 CUE files are displayed correctly (providing there is BOM) no matter what is default codepage. Exactly what I wanted!
I do not know CUE file standards, but to me it seems that changing codepage every time you play an album with CUE sheet in a different language is a little too much, when I could just save it to UTF8 and forget about it. It would be nice to see it as an extra option you can enable in settings.
btw, the changes at the end are not really related to the bug report but seem to be needed... its only using 1/3 of the available buffer currently.