Rockbox

This is the bug/patch tracker for Rockbox. Click here for more information.

Quick links: Bugs · Patches · Rockbox frontpage

Tasklist

FS#4755 - Wikipedia

Attached to Project: Rockbox
Opened by Anonymous Submitter - Wednesday, 01 March 2006, 03:03 GMT+1
Last edited by Steve Bavin (pondlife) - Wednesday, 01 August 2007, 17:46 GMT+1
Task Type Patches
Category Plugins
Status Unconfirmed
Assigned To No-one
Player type All players
Severity Low
Priority Normal
Reported Version current build
Due in Version Undecided
Due Date Undecided
Percent Complete 0%
Private No

Details

As already discussed in a thread [1] at mysticriver it would be very amazing to have a version of Wikipedia on rockbox.

Only a plugin to easily search articles would be needed.
Converted dumps are already available, for example the one of the ipodlinux project [2].

[1] http://www.misticriver.net/showthread.php?t=36924
[2] http://ipodlinux.org/Wikipedia
This task depends upon

Comment by Alexandre Flament (flament) - Sunday, 05 March 2006, 11:26 GMT+1 Comment by Will Robertson (aliask) - Tuesday, 07 March 2006, 12:33 GMT+1
As far as I can tell, the script on the iPod linux page you linked just converts the files into plain text, which you could open in rockbox anyway - give it a shot.
Comment by Wiki River (wikiriver) - Tuesday, 26 September 2006, 19:49 GMT+1
If you look at the Misticriver thread and the "Rockipedia" wiki, you'll find information on a plugin of this kind being created.

Hope this helps.


Misticriver thread: http://misticriver.net/showthread.php?t=36924
Plugin wiki: http://rockipedia.techmight.com
Comment by Frederik (freqmod) - Friday, 27 October 2006, 00:23 GMT+1
I have created a plugin to search out wikipedia articles, it takes about 5 seconds to search for an article (exact match) on an i-pod and uses b-trees for an index and gzip compression for the articles. (about 2.5GB, split into 1GB files for english retrieved wikipedia a week ago).

I will release the source soon (before 5.nov) when it is a little more finished.
Comment by Izzy (bro2baseball) - Sunday, 29 October 2006, 04:31 GMT+1
wow! that sounds awesome freqmod! i can't wait!
Comment by Frederik (freqmod) - Saturday, 04 November 2006, 12:15 GMT+1
Haven't made any more progress, but it works. The archiving functions are split out, but the viewer in mww.c and the parser in wmconv.rb could need a rewrite.
Comment by Izzy (bro2baseball) - Sunday, 05 November 2006, 00:02 GMT+1
Sorry for being such a noob, but how do i use this?

is that just the source? will there be a patch soon?

great work, btw. i can't wait to use it

bro2
Comment by Izzy (bro2baseball) - Sunday, 05 November 2006, 00:06 GMT+1
woops... i found the readme file.

i'll try to figure it out from that. though i don't really understand most of it :(

thanks again!

bro2
Comment by Adam Gashlin (AdamGashlin) - Sunday, 05 November 2006, 15:10 GMT+1
Brilliant freqmod! I'm now testing it on an iPod photo, reading the article about Stephen Hawking while listening to MC Hawking. Are you intending to work on rendering the articles?
Comment by Frederik (freqmod) - Sunday, 05 November 2006, 15:58 GMT+1
I won't work on rendering now (if someone wants to I will certainly help), but maybe after a while. I second the release of connell's source so I (and others) have a startpoint for the renderer.
Comment by Joe (rockboxer1) - Sunday, 05 November 2006, 17:58 GMT+1
I patched the rockbox build fine, but I don't know how to dump the wiki.

Are the directions in your readme for linux users?

When it says in your Readme

---run ./Make.sh
run bzcat <wikimedia dump> | ruby mwxmldumpparse.rb /dev/stdin <prepared dump prefix>--- ect...

What does run mean? How do I run it? I have to install Ruby, GCC, and bzip2? Any way to make this a little bit more friendly to newer users :)?

Your work is awesome btw. I've been waiting for this for a long time!

rckbxr
Comment by Frederik (freqmod) - Sunday, 05 November 2006, 22:40 GMT+1
ok, if you don't want to install unix you could install ruby from http://rubyforge.org/frs/download.php/12751/ruby185-21.exe and compile the ( c ) apps in cygwin/mingw (and upload the binaries here) where you compile rockbox. (the bzip2 file could be decompressed ( http://www.bzip.org/downloads.html ) before it is passed to ruby if you have enough disk space)
Comment by Adam Gashlin (AdamGashlin) - Monday, 06 November 2006, 18:56 GMT+1
I've rewritten the viewer, it is very basic but working (I was having a lot of trouble with stuff dropping out or cutting off with the earlier one). I'll be adding link following support later today if I can get to it.
   mww.c (9.3 KiB)
Comment by Adam Gashlin (AdamGashlin) - Monday, 06 November 2006, 19:43 GMT+1
The characters for italics are now ignored, and an issue with articles cutting off too soon has been fixed.
   mww.c (9.4 KiB)
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 07 November 2006, 03:09 GMT+1
Links are now supported, Select to enter link selection mode, scroll to select a link on screen, Menu to back out of this mode, Select to load the linked article. There were some bugs in gzip which prevented it from working twice, those are now (mostly?) fixed.
The biggest irritants to me right now are: 1) backwards scroll speed 2) text insensitive matching which turns up the wrong pages (see FRance) 3) no "back" feature as one would expect on a browser.
   mww2.zip (13.2 KiB)
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 07 November 2006, 07:15 GMT+1
You can now no longer scroll beyond the end of the article, and the link selector will stay on screen. Text sensitive matching is now attempted before insensitive matching, if there is an exact match we should use that (France is now accessible). The article title at the start of each article is now skipped. Scrolling may be a litte faster.
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 07 November 2006, 08:44 GMT+1
Fixed a careless file descriptor leak.
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 07 November 2006, 15:43 GMT+1
Scrolling backwards is now faster as the state of the last 50 lines is saved. There is a 10-deep history of pages, accessible by pressing Left. When pages are not found the viewer returns to the previous page (or exits if the start page was not found). Links that start before the first line on the screen now work properly. If no links are onscreen select will not enter link selection mode. The "searching..." display is a bit more consistent.

Does anyone care?
Comment by Frederik (freqmod) - Tuesday, 07 November 2006, 19:39 GMT+1
Wonderful job.

I have updated the MARKUP_* macros to match the values in wmconv.rb, added colored underlining to signalize underlined, italic and bold text. Added a little more memory in inflate.c (to fix some problematic articles).

When it comes to case (in)sensitive searching i don't know if a compare function that is different from the function used to create a b-tree will work, it will not however give false positives so it is no propblem, except that it uses a little more disk access time.
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 07 November 2006, 20:49 GMT+1
A quick fix: insert
hist[curhist].curline=0;
at line 275 in mww.c.
curline was never being initialized.
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 07 November 2006, 22:26 GMT+1
That exposed another issue regarding line counting (go to an article, go back, scroll up... jumps back to line 0 because the previous fix clears curline and we lose track of how many lines to seek ahead for backwards scrolling).
Should be good now.
Comment by Adam Gashlin (AdamGashlin) - Wednesday, 08 November 2006, 17:44 GMT+1
I used the viewer for a few hours yesterday (much easier to use with a large font to reduce eye strain) and came up with a list of things I think need doing (written in the dark at varying angles in red pen on random scraps of paper on my desk...):

* Don't reallocate memory when loading new articles - though I know the current system will produce the same addresses each time the page history relies rather strongly on that fact, which could be a potential problem if things change.
* Case sensitive search only when the article title is produced internally, i.e. from a link (only case-insensitive search for user-entered text)
* Save the index of an article when found, so upon going back to it we don't need to search again.
* Save current status to a file, to come back to it at a later point. This could produce problems if not handled carefully: changing font between runs will change line breaks (easily solved by storing the character offset of the beginning of the first line displayed and recaululating breaks upon loading), the database itself might change (I think this is unlikely enough to not worry about how to handle it, save to ensure it doesn't crash), and other potential pitfalls, I'm sure.
* Menu - to use the save functionality, also to enter a new article title (should be treated as if it was a link followed so we can always go back), perhaps toggle emphasis highlighting (as I don't particularly care for those colored underlines) and other configurables. Should take a look at the text viewer for ideas here.
* Better display of article title (right now it is ignored)

Stuff for the preprocessor (note however that I have not so much as read a line of it, yet):
* Process basic templates - stuff like seealso and main should be handled, as they are quite often used in large articles.
* Images - just put the alt text for the image
* Offsite links - not sure how to handle these... ignorance should be a fine policy
* Cross-mediawiki links - could potentially be supported if the database for the given article is also present, I don't consider this too important, though
* Categories - Generating a page for each category, and processing the template on each article to link to that category, would aid navigation
* HTML Entities - should be replaced with equivalent unicode characters (what about &gt; and &lt; ?)
* Anchors - How to support? Simple half-solution is to just strip off the anchor when following a link.
* Do something nice with ==headings==, lists (on further thought this may not be necessary, both are presented reasonably enough in the current code)
* Do something (even if not nice) with tables
* Do something with the ref tags
* Redirect processing seems to have some problems

It would also be helpful to take a nice long look at how wikipodia works, although I don't intend to prettify things too much and I'd perfer to keep it fairly simple.

Just some ideas I'll be working on. Is there a better place for this kind of discussion than in the feature request?
Comment by Adam Gashlin (AdamGashlin) - Wednesday, 08 November 2006, 17:50 GMT+1
Sorry, by wikipodia I meant Encyclopodia, it had been a while since I looked thoroughly at that.
Comment by Frederik (freqmod) - Wednesday, 08 November 2006, 20:08 GMT+1
A system for patching the wikis (to save network bandwidth) (I will try to make it):
* Compares each article with the article in the old version against the new and store diffs (gz-/bz-/lzma- ed) or xdeltas (I will see what that is most efficient, and if diff/patch produces _exactly_ equal files) of the uncompressed article text
* Store a "recepipe" (binary):
* Copy article: <name>
* Patch artilcle: <patch offset> <name>
* Create an ansi c program that compiles on Windows/Mac os X and Linux, with no dependencies that patches a database.

Things that would be nice:
A menu where you could access:
* Page history (a list, with page titles, and stored offsets (in database))
* Document structure (the parser should generate an index with (byte)offsets (localy in the decompressed article)
* Search for new article (with the last article name entered)
* Exit
Comment by Adam Gashlin (AdamGashlin) - Wednesday, 08 November 2006, 20:33 GMT+1
I understand and agree with all your points but the Document structure menu item, I am not clear what you mean. Headings, anchors, sections? Yes, that makes sense, now that I think of it.
I've made a fix for some issues with multiple whitespaces and wrapping (if there are two spaces at the end of a line one will be consumed for the break, the other will be at the start of the next line, which I don't like, the new version ignores multiple whitespace (with the exception of mutliple newlines)) but I'll wait to post it until I've done a bit more work today.
Comment by Giles (Giles) - Friday, 10 November 2006, 09:15 GMT+1
Ok, so I understand all about the sources and compiling and all, but try as I might, rockbox won;t compile! I upgraded my make, arm-elf-gcc and gcc and it still gives me errors. Props to anyone who will compile me a build of rockbox for the ipod video containing the latest media wiki viewer, that way I can show off my iPod and get all the chicks.

Please, I will be eternally grateful.
Comment by Frederik (freqmod) - Saturday, 11 November 2006, 21:38 GMT+1
1 Bug in btsearch.c:87 LOGF("K:%d,%s",utf8strcnmp(((const unsigned char*)key),((const unsigned char*)nd_key),rkeylen,keylen),nd_key); --> LOGF("K:%d,%s",utf8strcnmp(((const unsigned char*)key),((const unsigned char*)nd_key),rkeylen,keylen,casesense),nd_key);
2 my rockbox build http://freqmod.dyndns.org/upload/rockbox.11.11.06.patches.mww.zip (with mww, and a few other patches (some does not work as intended, but should not harm))
Comment by Adam Gashlin (AdamGashlin) - Sunday, 12 November 2006, 00:59 GMT+1
Rewrite of a substantial portion, now with: menu (save history, new article, exit, clear history), history save/restore, no need to search again when visiting pages in history.
Not too heavily tested yet, though I will surely be using it frequently.
Comment by Adam Gashlin (AdamGashlin) - Sunday, 12 November 2006, 01:17 GMT+1
Navigating to a new article through the menu didn't reset the scroll offset. Also made abort from keyboard behave properly (the code was there but commented out).
Comment by Izzy (bro2baseball) - Sunday, 12 November 2006, 05:11 GMT+1
Amazing work, Adam. Don't think you're not appreciated. I wish that I could figure out how to use this! I'm on windows and for some reason I can't figure it out. But I'm gonna keep trying; just wanted to let you know that you're doing great work!

bro2
Comment by Adam Gashlin (AdamGashlin) - Sunday, 12 November 2006, 06:15 GMT+1
Reinstated the "Loading..." message, renamed save files to .wws.
Izzy, if you can get in touch with me I can possibly walk you through the building, I'm often in #rockbox, my nick is hcs.
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 14 November 2006, 02:07 GMT+1
I'm in the process of rewriting the xml dump parser in C with expat, as a first step towards improved preprocessing. I'm undecided about what to do with templates so far, though. Some I think it would be best to inline (main, seealso), some could use some special processing (categories), and some I just don't know about (infobox).
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 14 November 2006, 13:42 GMT+1
A few additional features for the viewer (haven't gotten anything significant done with the preprocessor yet):
-seeking to targets (you know, the # things)
-navigation (the Navigate thing in the menu now displays an outline that you can scroll through, pressing select will jump to that part of the article)
-fixed a few bugs with saving
-history view (a list of the titles in the history)
Comment by Adam Gashlin (AdamGashlin) - Wednesday, 15 November 2006, 00:15 GMT+1
Faster backwards scrolling in outline navigation, additional option for entering a new article name from scratch, pressing select in the history view will jump back to the topmost displyed article.
Comment by Adam Gashlin (AdamGashlin) - Wednesday, 15 November 2006, 00:49 GMT+1
fix to make target seeking tolerant of spaces at the end of the heading
Comment by Izzy (bro2baseball) - Saturday, 02 December 2006, 23:23 GMT+1
Adam, is there anyway I can get the (already compiled) file to actually run the wikimedia dump? I did build rockbox with the mww viewer, but I can't figure out how to compile in VMware. I'm not exactly sure how the actual dump process works, but maybe you can make this a little easier for me (the windows user) :(

thanks for any help

bro2
Comment by Adam Gashlin (AdamGashlin) - Sunday, 10 December 2006, 16:27 GMT+1
bro2, what target do you have?
Comment by Izzy (bro2baseball) - Sunday, 10 December 2006, 17:37 GMT+1
Ipod 5g
Comment by Adam Gashlin (AdamGashlin) - Monday, 11 December 2006, 03:03 GMT+1
http://www.hcs64.com/wiki

Here you will find a somewhat out of date processed wikipedia dump and a build of the viewer from the latest CVS.
Comment by Izzy (bro2baseball) - Tuesday, 12 December 2006, 03:06 GMT+1
Thanks so much Adam! As I'm downloading these files, I just thought I'd ask where I should put them. They're(the .wwa files) all different parts of the wikipedia, correct? Am I able to search all these parts (the entire wikipedia) at once? Again, thanks a million. Your work is amazing!

bro2
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 12 December 2006, 03:31 GMT+1
The wwa files are the actual article text, wwi is the index. They can be anywhere in the filesystem but they should all be in the same directory. The plugin actually opens the wwi file, which is then used to access the others.
mww.rock goes in .rockbox/viewers/, viewers.config replaces the one in .rockbox/ (or you can manually add the line
wwi,viewers/mww, 55 55 55 55 55 55
to the existing file)

Also, credit where it is due, the archives are all generated by freqmod's conversion utilities, and the database backend of the viewer is still mostly his.
Comment by Frederik (freqmod) - Tuesday, 12 December 2006, 18:03 GMT+1
Cleaned up warnings (in all files except mww.c where Adam has done a great job).
Added playbackmenu to the menu.
Comment by Frederik (freqmod) - Tuesday, 12 December 2006, 18:30 GMT+1
Uploading specification of the wwi and wwa format.

I don't have very much time to work on this project as I am preparing a program for showing videoes on multiple projectors which must be finished before a preformance in the last week of february (and for rehersals). I will problably work more on the patch solution for this viewer before that when I have made some more progress on that project.
Comment by Frederik (freqmod) - Tuesday, 12 December 2006, 18:47 GMT+1
(the specification was raw text, all numbers were in little endian)

all text lengths are in bytes

Input for btcreate (endiannes like the architecture of btcreate):
In-list:
uint32 data_lo
uint32 data_hi
uint32 title_length
char8*title_length title (in UTF8)
Redirect-list:
uint32 from_length
uint32 to_length
char8*from_length redirect from (in UTF8)
char8*to_length redirect to (in UTF8) (the datapointer is taken from this entry in the in-list)
Comment by emilie dancer (stardancer) - Monday, 01 January 2007, 01:17 GMT+1
Wow, just installed the viewer and made my own dumps from wikipedia. Thanks for all the amazing work! Can we change the status of this tracker to at least "Alpha"? So people know that it is in development and very functional?
Comment by Matthias Larisch (Matze88m) - Thursday, 18 January 2007, 13:20 GMT+1
I have problems with my own dump... dewiki becomes a 35mb wwi + 1gb wwa file, but the reader doesnt find ANY article! Some problems are at converting with illegal links, but that are only about 50 or so... The rest of the converting process finishes without any error.

When i download the enwiki dump from the link above it works without problems... When I use my own dump, it says "Searching..." and immediately (within same second, when not same 100th of the second) it says "Not found" and quits the reader.

I have to say that i use iriver h300 (with #define BUTTON_SCROLL_FWD BUTTON_UP and equal ones for SCROLL_DOWN and BUTTON_MENU) and the plugin works fine :) but only with the enwiki dump from above. The Ipodsimulator doesnt work with my dewikidump either.

Any ideas?

Thank you!
Comment by Matthias Larisch (Matze88m) - Thursday, 18 January 2007, 13:21 GMT+1
Oh if someone wants to look at my .wwi file: http://eow.ath.cx/other/tier/rockbox/dewiki20061130.wwi (currently uploading - finished in 30 minutes - 35MB)
Comment by Frederik (freqmod) - Thursday, 18 January 2007, 16:40 GMT+1
I have looked at the wwi file. If you could provide (a part of) the wwt file it would help. It seems like all the entries are in the .wwi index but all the values (pointer to the adresses in the wwa file) points to 0 0. If the wwt file also has 0 0 as the index then there is a problem with the ruby converter, else there is a problem with btcreate.

If you want to investigate yourself open the file in an hex editor and use the file format documentation as a reference.
Comment by Matthias Larisch (Matze88m) - Thursday, 18 January 2007, 20:43 GMT+1
Thanks for your fast response. Oh yes, in wwt-File there is everything (except one byte) between the titles set to 0. The ruby converter of your second post (mediawikiviewer.tar.bz2) is the right one? I'll have a look at the source and try a bit around...
Well there is one step i did different than the README:
instead of using bzcat file | ruby converter /dev/stdin outputfile

i used directly ruby converter <my_decompressed_dewiki.xml> <outputfile> because my cygwin doesnt have a device /dev/stdin.

I'll try doing that directy on a real linux box with readme command next night after looking in the ruby script.
Comment by Matthias Larisch (Matze88m) - Friday, 19 January 2007, 08:16 GMT+1
It works :) The cygwin ruby reads only zero at $filepointer.len which is used to write the offset to wwt file... I tested this with a debug-print in there. On my real linux machine it works, the generated wwi/wwa files do work too :))
Really nice work!
Maybe I'll write a fast C converter when I have the time (but that would be windows only at first - maybe someone will port it some day)
Comment by Izzy (bro2baseball) - Saturday, 20 January 2007, 23:45 GMT+1
I'd love if you did that, Matthias. A converter would be amazing! :)
Comment by Matthias Larisch (Matze88m) - Sunday, 21 January 2007, 01:16 GMT+1
Well I have begun today morning... It is able to create wwr File for now. wwt should be now problem but i have a general C/Pointer problem now :) I'm not very good in C. My code is verrrry dirty, but works and is much quicker than the original ruby code.
For now I have no parsing of any style - I expect a completely working alpha-version tomorrow evening :)
Comment by Izzy (bro2baseball) - Sunday, 21 January 2007, 05:16 GMT+1
Sweetness. Can't wait :)
Comment by Matthias Larisch (Matze88m) - Sunday, 21 January 2007, 09:22 GMT+1
But you know that u can also use the converter in freqmods second post? Well it is maybe a bit complicated, but mine wont be much easier ^^
Only problem is that it doesnt seem to run on a cygwin, dont know why. But you should try it with win32-ruby. Real linux should be best :)
My problem with it: pc is simply rebooting at output filesize ~720MB. Maybe that's my crappy PC, maybe a harddiskerror, dont know. Complete Output file is 1000MB in size. Thats the main reason why i'm coding a new program - so that i can use the whole wikipedia :)

What is your problem with freqmods converter?
Comment by Matthias Larisch (Matze88m) - Monday, 22 January 2007, 14:26 GMT+1
hmm will take a little bit longer, sorry :) Had to cope with a bufferoverflow which i didnt find:) Now everything is clear, work goes on and maybe finished today or tomorrow. But now i have to go to my real work for 5 hours :)

Ah btw a problem with the ruby converter: My PC always crashed at 2GB of the input file. this is what i found out. I didnt use directly decompression with bzcat but converter <wikidump.xml>. Maybe ruby has a problem with the 2GB limit?
Comment by Izzy (bro2baseball) - Monday, 22 January 2007, 22:37 GMT+1
Actually, I've never suceeded in dumping my own file. I merely used the kinda outdated files on Adam's website http://www.hcs64.com/wiki. I have a working cygwin installation, but the instructions included in freqmod's Readme weren't clear enough for me to finish. A n00b like me needs a step by step guide with everything included :-\\. I was hoping for something that would be a bit easier :-)

bro2
Comment by Matthias Larisch (Matze88m) - Tuesday, 23 January 2007, 01:19 GMT+1
Okay :) converter is in a useful state now. But it doesnt do ANY layout/article processing. It only creates one big wwa file and the two wwr/wwt files.
Just if anyone needs it now... I would call it 0.1 pre-alpha :) because of the missing main functionality.

how to:
run compile (requires zlib & pcre devel packages)
start xmlconv with wikidump as first argument and output prefix as second argument.
(cygwin or reallinux!)

After that u have to use the btcreate from freqmod.

dont blame me for bad coding... I'm not that good at it ^^

conversion time: dewiki 2,5gb -> 1gb wwa file in exactly 45 minutes on amd athlon xp 1700+ with 512mb sd ram running debian 3.0
Comment by Matthias Larisch (Matze88m) - Wednesday, 24 January 2007, 20:43 GMT+1
okay here is a new version :)

Conversion time on 1,86GHz Pentium-M Laptop (slow HD...) was 13 minutes for de-wiki. I think thats quite amazing in contrast to original converter :)

limitations:
-doesnt really parse the XML, only wikipedia "style" will be parsed correctly
-Parsing of layout (bold, etc.) is done via char for char comparison... this speeds up and should be expandable later (regexps should work also)
-I dont know how mww expects Links (links work, but not links with other name than the target)
-gzip-compression really bad implemented... writes to temp.gz file on harddisk and rereads it... I couldnt find an easy to implement way of in-memory compression. This should give another speedup :)

I added precompiled binaries compiled on my cygwin machine.
Libz and pcre-development packages are needed for compiling.

Have fun with it :)


Comment by Frederik (freqmod) - Wednesday, 24 January 2007, 22:19 GMT+1
Links: A_START <link/description> A_END
or: A_START <link> G_MODE <description> A_END
From the wwi-specification:
Name ASCII-code:
A_START 007
A_END 008
G_MODE 015

I hope this helps you :)
Comment by Matthias Larisch (Matze88m) - Wednesday, 24 January 2007, 23:33 GMT+1
Ahh, G-MODE is the one i look for :) But it is not implemented in mww?

Was a very simple addition to the converter...
It now should work with same features as yours :)
I forgot to say: It automatically strips off html comments (<!-- -->) and converts &gt; and &lt; to > and <.
One weird thing: my article.wwa file was shrinked from 1000MB to 870MB after implementation of the whole conversion...
Okay, for ''''' it is 5:1 compression, but that shouldnt be sooo many savings... Same for html comments... I browsed through a few articles, they were allright.

--snap
found mistake :) I declared comment-endings as --&lt; but they are --&gt; :) so articles with comments should be missing



attachment: new xmlconv.c
-changes version number to 0.2
-fixes comment-end mistake
-added functionality: alternative link-texts are supported
(application/octet-stream)    xmlconv.c (16.4 KiB)
Comment by Matthias Larisch (Matze88m) - Thursday, 25 January 2007, 21:48 GMT+1
freqmod: there is a problem :) My links wont work now!
After looking to mww.c:
#define MARKUP_BAR 13
this is used for seperating link from description!

In your converter u have PL_END as 13 and use that :)

i'm a little bit confused...
Comment by Frederik (freqmod) - Saturday, 27 January 2007, 22:11 GMT+1
I understand why you are confused. I don't understand the links myself.
My convertor:

variable declarations:

A_START=''<< 007
A_END=''<< 8
...
PL_END=''<< 013
G_MODE=''<< 015

links (with pipe):

arr[i]=Mwtags::A_START+tmp[0,pipeidx]+Mwtags::G_MODE+tmp[pipeidx+1,tmp.length-pipeidx-1]+Mwtags::A_END
(this line uses G_MODE)

The output makes a link with 13 (PL_END) as separator. I don't know why, it seems like a bug/unwanted feature in my code. However your convertor (Matthias) works just as well and is much more efficient and makes my ruby convertor obsolete. Therefore i have rewritten mww.c to use 15 as G_MODE.

This breaks compatibillity, but this plugin is in no way stable so I think it would be better to change the behavour than to change the "specification".

Updated convertor, to work with xmlconv:
   mww16.tbz (23.4 KiB)
Comment by Frederik (freqmod) - Saturday, 27 January 2007, 22:12 GMT+1
ehh, updated mww
Comment by Adam Gashlin (AdamGashlin) - Thursday, 15 February 2007, 22:02 GMT+1
Updated xmlconv.c to support reading from stdin (by using - as the input file name), so I can pipe the output of bzcat directly into it without using up disk space on the uncompressed archive. Full enwiki takes about 1 hour 15 minutes. I'd like to avoid the creation of the temp.gz thing as well but I haven't gotten to it yet.
I hope to get some more work done on mww now that I'm nominally done with NSF and SPC.
   xmlconv.c (16.5 KiB)
Comment by Matthias Larisch (Matze88m) - Thursday, 15 February 2007, 22:55 GMT+1
nice work :)

i didnt get zlib working like expected (gzip headers etc.) without the temporary file. In-Memory compression without this file should give another speedup of about 10% I think... Did file splitting work correct? I never tested this ^^ dewiki is only 980MB compressed :)
Comment by Adam Gashlin (AdamGashlin) - Thursday, 15 February 2007, 23:17 GMT+1
Yes, files are split correctly, there are three.
Comment by Adam Gashlin (AdamGashlin) - Friday, 16 February 2007, 09:30 GMT+1
Updated mww.c to use up and down keys instead of scroll so it'll build for gigabeat, seems to work fine.
Since there are a few of this potentially working on this at the same time, should we set up a sourceforge project for it for version control until it is ready for rockbox?
   mww.c (39.1 KiB)
Comment by Adam Gashlin (AdamGashlin) - Saturday, 17 February 2007, 22:14 GMT+1
Fixed some problems with redirect processing (fseek was using the wrong file descriptor), more entity handling in xmlconv (quotes, especially), no use of temporary file for gzip but no speed improvement (perhaps need to directly use deflate).
Comment by Adam Gashlin (AdamGashlin) - Monday, 19 February 2007, 03:01 GMT+1
Here's the version of the conversion tools I'm currently working with. xmlconv would do odd things if there was wikicode inside a comment, I moved comment processing first in the parsing code to combat this.
I'd like to get rid of the ref tags, as well, I don't care too much about the references and they make things a lot less readable.
Packed everything together for ease of management, and also mww17 so one doesn't have to pick things up all over this task...
Comment by Adam Gashlin (AdamGashlin) - Monday, 19 February 2007, 03:06 GMT+1
reuploading converter04 as .tar.bz2
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 20 February 2007, 08:39 GMT+1
Reimplemented article parsing with a recursive-descent parser (mostly), should allow much more advanced parsing in the future on this framework. Newly supported things include images (well, a link to the page with info about the image shown as [I], and normal text processing on the caption, which may itself include links). The code is something of a preprocessing monstrosity, but I think it lends itself to being easier to tell what the parser is actually doing. Speed is pretty much the same as the old version.

(If anyone disapproves of my totally ripping apart xmlconv, sticking my name in it, and releasing with a new version number, let me know...)
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 20 February 2007, 08:56 GMT+1
Oh, and I remove ref tags.
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 20 February 2007, 17:36 GMT+1
Revised entity support to understand the fact that &ndash; is for instance &amp;ndash; in the dump. Added support for all HTML entities I've been able to find tell of. Despite the fairly large number and an inefficient linear search for them, they are the exception and do not seem to slow down processing at all.
Comment by Adam Gashlin (AdamGashlin) - Thursday, 22 February 2007, 00:49 GMT+1
ref checking now part of general text processing (to catch it anywhere), added support for self terminating ref tags.
I'd like to once again redo this, with a stage for each level of translation done (1. xml 2. wikicode 3. html rendering) which I think will present a more logical flow for things and improve some bits that are sort of hackishly scattered in at the moment. Might be a while before I get to it, though.
Comment by Matthias Larisch (Matze88m) - Thursday, 22 February 2007, 07:24 GMT+1
You are doing very great work, exactly what i hoped to see after my initial "new convertor". I do not have any problem with you ripping away the whole program or with sticking your name in it. As i earlier said, my version is very dirty but already worked :) If you manage to write a converter that does much better job (which it already does) in nearly same time, this is great! I personally wont do much work on this project in future because I have very little time and not the programming skills needed...
Comment by Adam Gashlin (AdamGashlin) - Thursday, 22 February 2007, 22:55 GMT+1
Those ref tags continue to be a pain, I forgot that html tags are case-insensitive.
I'm wondering if it might be a good idea to try and use the static HTML dumps? That way we wouldn't have to worry about some of the more irritating to implement aspects of wikicode, like templates (though I've given them a lot of thought). The downside is that they are rather infrequently updated. I'm not considering this too seriously.
I think the multi-stage approach, firmly grounded in the structure of the grammars involved, should be fine. I've just got to get some structural aspects of it decided. I'm taking a compilers course now so I'm trying to apply what I'm learning about parsing to this. Wondering if it might be better to just see how mediawiki does it and rewrite in C, though...

Considering templates again, should they be expanded inline or kept as a seperate thing? I'm thinking inline, but I worry it might make things a lot bigger than they need to be.
Comment by Adam Gashlin (AdamGashlin) - Wednesday, 09 May 2007, 18:34 GMT+1
sync'd, and formulated as a patch for easy application
Comment by Adam Gashlin (AdamGashlin) - Sunday, 17 June 2007, 20:57 GMT+1
synchronized
Comment by Timo Horstschäfer (x1jmp) - Saturday, 23 June 2007, 13:10 GMT+1
As this plugin and FS#6697 have almost the same purpose, it would be better if we could merge the projects or work together on one of them.

The advantages of this plugin is its support for links, the gzip compression and binary tree search (maybe I missed something, because I hadn't had a deeper lok at this plugin for some time...).
What I like about my dict plugin is that it isn't limited to accessing the Wikipedia (a lot of other dictionaries as well) and uses a simple fuzzy search.
I think it's also a better idea to create a plugin-independent document viewer for Rockbox.

So what are your opinions?
Comment by Adam Gashlin (AdamGashlin) - Wednesday, 27 June 2007, 07:44 GMT+1
While you have not had a deeper look for some time, I've never looked at FS #6697, so I really have no idea. I can barely find the time to sync this now and then, but I agree that the effort would probably be best spent on a plugin somewhere between these two. I'm just not able to do any of it myself at the moment.
Comment by Alistair Marshall (amar) - Tuesday, 24 July 2007, 21:29 GMT+1
should this task be changed to a patch rather than a feature request?
Comment by peter watkins (peterw) - Wednesday, 15 August 2007, 22:20 GMT+1
Frederik, Adam, et. al. -- thanks for creating this! I'm enjoying having Wikipedia (EN) on my iPod.

Here are a few notes for other users from my experience with this and a fairly recent (week old) Subversion checkout:
- The English Wikipedia dump would uncompress to to something like 11 GB. If you're processing the dump on a filesystem that can't handle files that big (say, the fat32 fs on your DAP), you'll want to use pipes to avoid the need to make an uncompressed file. On Linux, something like the following will allow you to make the article files.
bunzip2 -c enwiki-20070716-pages-articles.xml.bz2 | ./xmlconv /dev/stdin wikipedia
- How it works: this creates a "viewer" for .wwi files. To view/search a processed wikimedia article dump, use the Files browse manu, find the ".wwi" file (both the wwi and wwa files must be on your DAP, as noted above), and open it. If your Rockbox DAP is configured to show only supported files, you'll see the name of only the wwi file.
- Fonts and non-ASCII characters. Not all Rockbox fonts have good support for non-US-ASCII characters. If you find that letters with diacritical marks are not displayed properly, try a different font.

Timo -- I tried your dict plugin (was really looking forward to it), but kept running into errors (iPod 5.5g, 64mb). This wikimedia viewer seems more stable.

Personally, I'd also be interested in a PC app to convert dict files to wikimedia XML format so they could be viewed with this plugin.
Comment by Frank M. (framo) - Sunday, 19 August 2007, 14:30 GMT+1
Dewiki and enwiki are working fine on my iPod nano and video :)
The patch doesn't compile for Sansa e200 though.
Comment by Adam Gashlin (AdamGashlin) - Wednesday, 29 August 2007, 14:51 GMT+1
I'm looking at WikiFilter (wikifilter.sf.net) as a starting point for wikicode parsing, and I've downloaded a few dumps to start experimenting with updates.
And at some point I should look at that dict plugin...
Comment by Timo Horstschäfer (x1jmp) - Wednesday, 29 August 2007, 18:31 GMT+1
I suggest having a look at FlexBisonParse which is listed on http://meta.wikimedia.org/wiki/Alternative_parsers.
It looks quite easy to adapt to custom output.
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 04 September 2007, 16:24 GMT+1
Updated patch, I don't think there's anything new here but it'll apply to current SVN.
Comment by Simon Wenger (musician72) - Saturday, 08 September 2007, 21:24 GMT+1
Can someone point me to wikipedia dumps already prepared for this plugin? (The links above give me 404's)
Thanks, Simon
Comment by Adam Gashlin (AdamGashlin) - Sunday, 09 September 2007, 04:17 GMT+1
Indeed, that page is down. I don't have the capability to upload the 3.8 GB the most recent converted enwiki dump (2007/08/02) takes up, unfortunately, though I do have the web space. If someone else is interested in volunteering I can set up an FTP account to put it up in my space.
Comment by Simon Wenger (musician72) - Sunday, 09 September 2007, 08:51 GMT+1
Ok. I'm downloading and converting the dewiki (german) now. I might just as well upload it later, after testing. I'll contact you about an FTP account then.
And I'd like to write a short HOWTO on the patching (no details) and the downloading/converting of the dump. Should I just place it in a comment or could we put it in the top (Details section)? It's confusing to have the old converter method and dead links in the earlier posts.
Comment by Simon Wenger (musician72) - Monday, 10 September 2007, 09:07 GMT+1
The patch applies to the current revision, but doesn't compile.

MAKE in mww
CC mww.c
mww.c: In function ‘set_article_offset’:
mww.c:189: warning: implicit declaration of function ‘printf’
mww.c: In function ‘viewer_init’:
mww.c:936: error: ‘BUTTON_MENU’ undeclared (first use in this function)
mww.c:936: error: (Each undeclared identifier is reported only once
mww.c:936: error: for each function it appears in.)
make[3]: *** [/home/simon/Desktop/iaudio/bleeding/rockbox/build/apps/plugins/mww/mww.o] Error 1
make[2]: *** [mww] Error 2
make[1]: *** [rocks] Error 2
make: *** [build] Fehler 2

From the little I know about programming, I guess it's because of the target? I compile for iaudio X5 (sim) which does not have a menu button.
Am I right. Is there a way out of this?
Comment by Simon Wenger (musician72) - Monday, 10 September 2007, 09:28 GMT+1
Hmm, can't compile for H300 sim either... and this target knows a menu button (A-B).
Comment by Daniel Dalton (ddalton) - Monday, 10 September 2007, 09:35 GMT+1
Hmmm it looks like "BUTTON_MENU" hasn't been declared anywhere. You will have to find out what it is called. Maybe BUTTON_MENU is the menu button? (Ab on h300)
Comment by Adam Gashlin (AdamGashlin) - Monday, 10 September 2007, 10:26 GMT+1
Looking a the keymaps, it looks like BUTTON_MODE|BUTTON_REL is used for the menu function on the h300, so just drop that in there. I should probably find out how to get access to the keymap stuff so mww doesn't have to be too nastily specific to each target.
Regarding making things easier, we're really in a horribly hacky incomplete state now, especially with the converter (though mww is a mess codewise, too). I'd rather it not be too easy to do things the wrong way, especially to avoid transitioning issues when a better way of keeping things updated comes along. This is, however, merely my position, and I'm not doing anything active for this project at the moment anyway (sorry for the grand plans a few messages back), so by all means make things happen if you wish.
Comment by Frank M. (framo) - Wednesday, 12 September 2007, 14:37 GMT+1
It compiled ok for my Sansa E280 when I replace BUTTON_MENU with BUTTON_POWER in mww23.diff.
The article scrolling is a bit strange (using the up/down button instead of the wheel) but it works.
Comment by Simon Wenger (musician72) - Wednesday, 12 September 2007, 23:03 GMT+1
Yes, I did almost the same. I replaced BUTTON_MENU with BUTTON_MODE|BUTTON_REL, works perfectly, thanks! An impressive piece of software!
Adam, you could send me an email with the ftp login now, I will then upload the converted dump (german atm, maybe someone else can do others?), and, later a little howto. simon(?)jso.be
Comment by Xinlu Huang (polygonal) - Monday, 24 September 2007, 01:20 GMT+1
I encounter a stack over flow when using the xmlconverter:
Exception: STATUS_STACK_OVERFLOW at eip=00403333
eax=005F73BC ebx=00000000 ecx=00032CDC edx=0000004C esi=611001A0 edi=004037A0
ebp=0022CCE8 esp=0022CCD4 program=C:\cygwin\home\...\xmlconv.exe, pid 5788, thread main
cs=001B ds=0023 es=0023 fs=003B gs=0000 ss=0023
Stack trace:
Frame Function Args
0022CCE8 00403333 (00000003, 61169690, 01630090, 00000000)
0022CD98 61006198 (00000000, 0022CDD0, 61005510, 0022CDD0)
61005510 61004416 (0000009C, A02404C7, E8611001, FFFFFF48)
8 [main] xmlconv 5788 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)

I'm using cygwin on XP, if that helps. Any idea why this is happening?
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 25 September 2007, 01:00 GMT+1
I've never encountered that error, then again I've never tried using the converter in cygwin.
Exactly which mediawiki dump are you using?

In other news, Simon's conversion of dewiki (20070903) is now on my web site.
http://hcs64.com/rockbox/wikipedia/
Comment by Xinlu Huang (polygonal) - Tuesday, 25 September 2007, 04:49 GMT+1
I use enwiki-latest-pages-articles.xml.bz2 from http://download.wikimedia.org/enwiki/latest/

The link to the converted wiki is interesting, but I don't understand a word of German ;) I'll need to find some way to get the enwiki in the right format...
Comment by Adam Gashlin (AdamGashlin) - Tuesday, 25 September 2007, 08:36 GMT+1
Aaand, my mirror of dewiki is now down. Went through 250 GB in a few hours (I have 3.6 TB monthly but they won't let me use it all up in a day). I'd try something like a torrent but I don't have anything to host it from.
Comment by Xinlu Huang (polygonal) - Wednesday, 26 September 2007, 23:52 GMT+1
I got Ubuntu for this and the xmlconv works pretty well on my USB stick ;) There is still quite some tags not parsed - ouststandingly (for me) some tags in the beginning of articles, the quote tags, the math tags (which I of course don't expect a nice integral picture or something, but reading <math></math> is quite annoying, and for most case it will be simple removal of the math tag itself). I wonder are these unparsed tag expected?

Also I'm curious about the coding in the converter: is UTF-8 used?

As for the cygwin problem, I'm guessing that cygwin restrict mem usuage, etc. since it is running in a host enviroment, so maybe that's where the stack overflow comes from.

It is so wonderful to have wikipedia on my ipod :D
Comment by Adam Gashlin (AdamGashlin) - Friday, 28 September 2007, 02:59 GMT+1
Yes, UTF-8 is used. Regarding the HTML tags, they're handled on a case-by-case basis, and many are not handled at all. The converter needs a rework. Sorry for the extended "proof of concept" phase.
Comment by Fed (Fed) - Saturday, 06 October 2007, 16:09 GMT+1
Is there a way of getting a peared down set of wiki pages?
Comment by Adam Gashlin (AdamGashlin) - Saturday, 06 October 2007, 18:05 GMT+1
What do you mean? Like a selection of a smaller number of important pages? There are projects that do that but I don't think they provide XML dumps.
Comment by Fed (Fed) - Saturday, 06 October 2007, 18:17 GMT+1
Where am I supposed to put the xml file on the sansa?
Comment by Fed (Fed) - Saturday, 06 October 2007, 20:16 GMT+1
That's exactly what I mean. Can I save some pages, and make my own xml file and use that?
Comment by Simon Wenger (musician72) - Sunday, 07 October 2007, 00:10 GMT+1
Here's the very first version of a really simple HOWTO. It took me a while to figure out, where to get and how to convert the wiki dumps and how to compile for other targets than ipod. Hope others can make a shorter way.
Please feel free to send me any improvements. I will love to include them.
I suggest to make the document available in the head of this page, as it has become very long and complex by now.
Thanks to Matze88m for the work on the converter and freqmod and AdamGashlin for this fantastic plugin, I've spent at least 20 hours in the past month, learning my head off....
Comment by Fed (Fed) - Sunday, 07 October 2007, 11:10 GMT+1
Thanks a lot for the howto!!!
Comment by Fed (Fed) - Sunday, 07 October 2007, 15:22 GMT+1
I get an error when I try to compile. It looks like pcre.h is missing.
Comment by Xinlu Huang (polygonal) - Sunday, 07 October 2007, 15:32 GMT+1
Your linux distro (or cygwin?) does not have the libpcre package. You have to install it.
Comment by Fed (Fed) - Sunday, 07 October 2007, 15:38 GMT+1
I am using a mac. Do you know what I have to install?
Comment by Fed (Fed) - Sunday, 07 October 2007, 15:49 GMT+1
I seem to have found a pcre.c for mac. Does this seem right to you? Where should I put it in the file tree?
   pcre.h (12.1 KiB)
Comment by Fed (Fed) - Monday, 08 October 2007, 17:13 GMT+1
I am sorry about all the posts. I have finally gotten everything to work.
The pcre for mac is available at http://pcre.darwinports.com/

I use a Sansa, and in addition to "replace BUTTON_MENU with BUTTON_POWER (Thanks framo!)", you should replace BUTTON_SCROLL_BACK with BUTTON_SCROLL_UP, and BUTTON_SCROLL_FWD with BUTTON_SCROLL_DOWN

I also added "rb->backlight_set_timeout(1);" after "rb = api;" as well as
"rb->backlight_set_timeout(rb->global_settings->backlight_timeout);" before "return PLUGIN_OK;" so that the backlight stays on while I read, and goes back to the user settings when the plugin closes. I think if this could be set up as an option it would be very useful.

Would it be possible to have a function to copy text to a text file? (ie export a selection)

Also, could someone add a bookmarking function? I don't even know where to start for that.

Thanks again for such a great program!



Comment by Simon Wenger (musician72) - Monday, 08 October 2007, 19:06 GMT+1
Glad it worked! Bookmarking is implemented. Press rec (in my case) to get the menu, then save the history. Now the last article is reopened when restarted and you can view the history and jump back. As long as you don't resave the history it will remain unchanged.
The backlight thing would be really cool!
Comment by Fed (Fed) - Monday, 08 October 2007, 23:21 GMT+1
What about copying text. Any idea how that could be done?

The problem I find with the bookmarking as it is now is that you can't pick and choose what to keep. You have to keep it all.
Comment by Fed (Fed) - Tuesday, 09 October 2007, 00:47 GMT+1
I am trying to make a faster scroll (using the rec button), and I think I found an error in the code.
Should line 224 be:
advance_scrollback(1,(dorender && i==a-1));
it is currently
advance_scrollback(1,(dorender && i==-a-1));

Comment by Adam Gashlin (AdamGashlin) - Tuesday, 09 October 2007, 01:39 GMT+1
Bookmarking for individual articles would be nice.
Yes, that would appear to be a bug on line 224. Doesn't show itself because I never use multiline scroll except when scrolling backwards past the scrollback buffer. I guess I just never noticed when that failed to render as it only happens on the one line when the buffer must be regenerated, and I only ever tested that when scrolling back quickly so a single frame went unnoticed. I just tested this now and it is in fact a problem in the current build.
When you've enabled your faster scrolling and the backlight fix (as that had bothered me as well) could you post an updated patch? If not I could, I guess.
Comment by Fed (Fed) - Tuesday, 09 October 2007, 02:29 GMT+1
Here is the new mmw.c
I haven't made the patch because I have a lot of other changes, and it is too confusing. Can you make the patch with this?
   mww.c (40.2 KiB)
Comment by Fed (Fed) - Tuesday, 09 October 2007, 15:38 GMT+1
I was thinking about how to save text.
How about using the right key to advance the text (ie in the 'normal mode') and at the same time append the 'current line' to a file named [title of the article].txt
Comment by Fed