Dictionary
Note: this is still a work in progress, developers only.
Note2: If you can write a xxx2rdf script in perl, please do so. The more sources the better.
Todo
Name |
Progress |
Description |
searching |
done |
Binary search in the dict plugin |
rdf2binary |
done |
Conversion from the rockbox dictionary format to the internal binary format |
wordnet2rdf |
done |
Conversion from the WordNet format to rockbox dictionary format |
displaying |
50% |
Wrapping lines is done, but we still need scrolling for long descriptions |
interface |
10% |
A better interface for the plugin, like a button for "new search" and things like that. (We now just have an exit button.) |
xxx2rdf |
0% |
Conversion from other formats to the rockbox dictionary format |
fileformat |
0% |
Improving the binary file format, more info below |
The most interesting xxx2rdf would be the Dict format, as there are a lot of free dictionarys availible in that format.
Creating a dictionary file
1. Download the prolog version of the
WordNet dictionary here:
http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz
2. Extract wn_g.pl and wn_s.pl from it.
3. Run "make" in the
svn tools directory.
4. Put wn2rdf.pl and the rdf2binary tool from the tools directory in a directory with wn_g.pl, wn_s.pl, and execute wn2rdf.pl
5. Execute the rdf2binary tool, it will output dict.desc and dict.index
6. Copy dict.desc and dict.index to
/.rockbox/rocks/apps
on the player.
The rockbox dictionary format
The input format for rdf2binary is very simple at this moment. It's one line per word, starting with the word, then a tab and then the description. The only thing you should be aware of when creating this files is that they must be in alphabetical order, and all words should be in lowercase.
The binary format
The binary format used for the index is pretty simple, the struct is like this one:
struct {
char word[WORDLEN];
long offset;
};
WORDLEN is a define in the rdf2binary tool, and the plugin. And the offset is an offset in dict.desc where the description is stored.
The improved binary format
This is still an idea under construction, but the new format would be just 1 file containing:
Field |
Size |
Description |
magic number |
4 bytes |
A simple identifier to know if this is a valid file. |
version |
4 bytes |
A format version number, this way we can detect old files and return an error. |
max_wordlen |
4 bytes |
The maximum word lenght used in this file. |
wordcount |
4 bytes |
The word count. |
After that there should be the index data:
Field |
Size |
Description |
offset |
4 bytes |
The offset from the beginning of the file to the description |
word |
max_wordlen |
The word, lenght from the header. |
And then just plain text description data, one description per line.
The hash binary format
Header:
Field |
Size |
Description |
magic number |
4 bytes |
A simple identifier to know if this is a valid file. |
version |
4 bytes |
A format version number, this way we can detect old files and return an error. |
wordcount |
4 bytes |
The word count. |
Offset table:
Field |
Size |
Description |
offset |
4 bytes |
The offset from the beginning of the file to a value in the hash table |
Hash table:
Field |
Size |
Description |
name |
variable |
The word |
offset |
4 bytes |
The offset from the beginning of the file to the description |
When searching for a word with hash X, the plugin looks up the offset for X and X+1 in the offset table. It reads the data between those offsets on looks for the word, we were searching for. It's just a hash table with chaining.
Sources for dictionary files
There is nearly everything needed for a german<->english single word translator on
http://dict.tu-chemnitz.de They are providing the wordlist under the GNU GPL Version 2 (whatever it means to have versions for licenses. Thats totally new to me). See
http://ftp.tu-chemnitz.de/pub/Local/urz/ding/de-en/ to download a 6 MB text file of words and words and words
For those who don't have to tools for compiling their own Dictionary files, you can download them from
https://www.rockbox.dreamhosters.com/dict.zip (6.6MB) . If you want to download the two parts separately just get,
http://www.rockbox.dreamhosters.com/dict.desc (17MB) and
http://www.rockbox.dreamhosters.com/dict.index (5.0MB).
--
PeterOlson - 12 Nov 2006
Copyright © by the contributing authors.