FS#9830 - UTF-8 encoded CUE file is parsed wrongly

Attached to Project: Rockbox
Opened by Prikolchik (prikolchik) - Monday, 26 January 2009, 03:01 GMT
Last edited by Jonathan Gordon (jdgordon) - Monday, 26 January 2009, 11:24 GMT
Task Type Bugs
Category Font/charset
Status Closed
Assigned To No-one
Operating System All players
Severity Low
Priority Normal
Reported Version Version 3.1
Due in Version Undecided
Due Date Undecided
Percent Complete 100%
Votes 0
Private No


When using CUEsheet file that has been encoded with UTF-8 Rockbox does not parse it correctly. All of the information that is non-ASCII is displayed as gibberish (track names and performer names in my case).

I'm only referring to UTF-8 CUE file problem here. ASCII CUE file (with correct codepage settings "Cyrillic (CP 1252)" ) works without issues.

I have looked at the Rockbox source code and my very basic knowledge of C tells me that method from /app/misc.c:
/** Open a UTF-8 file and set file descriptor to first byte after BOM.
* If no BOM is present this behaves like open().
* If the file is opened for writing and O_TRUNC is set, write a BOM to
* the opened file and leave the file pointer set after the BOM.
#define BOM "\xef\xbb\xbf"
#define BOM_SIZE 3

int open_utf8(const char* pathname, int flags)
int fd;
unsigned char bom[BOM_SIZE];

fd = open(pathname, flags);
if(fd < 0)
return fd;

if(flags & (O_TRUNC | O_WRONLY))
write(fd, BOM, BOM_SIZE);
read(fd, bom, BOM_SIZE);
/* check for BOM */
if(memcmp(bom, BOM, BOM_SIZE))
lseek(fd, 0, SEEK_SET);
return fd;
DOES NOT properly handle UTF-8 files. It just ignores UTF-8 BOM (\xef\xbb\xbf) and still treats file as ASCII (instead of UTF-8).

This may be a desired behaviour (which I doubt), but even if it is, it is NOT handled correctly when parsing CUE sheet in the method from /apps/cuesheet.c (my comments start with ///*):

[code]/* parse cuesheet "file" and store the information in "cue" */
bool parse_cuesheet(char *file, struct cuesheet *cue)
char line[MAX_PATH];
char *s;

DEBUGF("cue parse\n");
int fd = open_utf8(file,O_RDONLY); ///* <===== here open_utf8() is called */
if (fd < 0)
/* couln't open the file */
return false;

/* Initialization */
memset(cue, 0, sizeof(struct cuesheet));
strcpy(cue->path, file);
cue->curr_track = cue->tracks;

while ( read_line(fd,line,MAX_PATH) && cue->track_count < MAX_TRACKS ) ///* <== here it reads one line at a time from cue file we opened with open_utf8() */

///*skipped some code */

///* skipped more code */
return true;

So in the end, CUE file with Russian track names encoded in ASCII looks like this (right on the photo), and same CUE file encoded in UTF-8 gives gibberish (left on photo). And if UTF-8 was handled correctly UTF-8 CUE file would give image on right.

The attached is photo I just talked about and both of CUE files I used.

To reproduce bug you must have my cue named "test.cue" and any Wavpack file larger than 2 minutes named "test.wv" in same folder. Also enable CUE sheet support in Settings if it is not enabled. (you can use files other than wavpack by editing test.cue in Notepad and changing test.wv to test.flac, test.ape, etc).

This seems like a serious bug to me and it would be great if a developer could have a look at it.

If there is ANY clarification needed please ask.
This task depends upon

Closed by  Jonathan Gordon (jdgordon)
Monday, 26 January 2009, 11:24 GMT
Reason for closing:  Fixed
Additional comments about closing:  fixed in r19858.
Comment by Thomas Martitz (kugel.) - Monday, 26 January 2009, 09:53 GMT
Did you try changing to default codepage in Rockbox to UTF8?
Comment by Robert Kukla (roolku) - Monday, 26 January 2009, 09:55 GMT
The cuesheet uses the default code page in Rockbox (Settings/General Settings/Display/Codepage). So if you set this to UTF-8 it should work. The quoted code (utf8_open) was introduced later as part of a global change to skip the BOM for all file reads (r18185).

It may be desirable to override the codepage setting if a BOM is found, although I seem to recall that UTF-8 is not allowed for cuesheets. I'll have a look later.
Comment by Prikolchik (prikolchik) - Monday, 26 January 2009, 10:16 GMT
Yup, I found that setting codepage to UTF8 works, but that messes up CUE files encoded in ASCII. It is a choice between the two.
For the heck of I have tried changing both methods so that now if UTF8 BOM is found CUE is parsed in UTF8 (and not in default encoding) and if BOM is not found, it is parsed in default codepage. So basically UTF8 CUE files are displayed correctly (providing there is BOM) no matter what is default codepage. Exactly what I wanted!

I do not know CUE file standards, but to me it seems that changing codepage every time you play an album with CUE sheet in a different language is a little too much, when I could just save it to UTF8 and forget about it. It would be nice to see it as an extra option you can enable in settings.
Comment by Jonathan Gordon (jdgordon) - Monday, 26 January 2009, 10:46 GMT
with a bit of help from linuxstb I have attached a patch which should fix the problem... roolku, I see you most recently fiddled with cue in this area so its up to you what to do...

btw, the changes at the end are not really related to the bug report but seem to be needed... its only using 1/3 of the available buffer currently.
Comment by Robert Kukla (roolku) - Monday, 26 January 2009, 11:13 GMT
Thanks, that is exactly what I would have done. I am at work now, but will commit tonight unless you want to do it now.