This is the bug/patch tracker for Rockbox. Click here for more information.
Quick links: Bugs · Patches · Rockbox frontpage
FS#4755 - Wikipedia
Attached to Project:
Rockbox
Opened by Anonymous Submitter - Wednesday, 01 March 2006, 03:03 GMT+1
Last edited by Steve Bavin (pondlife) - Wednesday, 01 August 2007, 17:46 GMT+1
Opened by Anonymous Submitter - Wednesday, 01 March 2006, 03:03 GMT+1
Last edited by Steve Bavin (pondlife) - Wednesday, 01 August 2007, 17:46 GMT+1
|
DetailsAs already discussed in a thread [1] at mysticriver it would be very amazing to have a version of Wikipedia on rockbox.
Only a plugin to easily search articles would be needed. Converted dumps are already available, for example the one of the ipodlinux project [2]. [1] http://www.misticriver.net/showthread.php?t=36924 [2] http://ipodlinux.org/Wikipedia |
This task depends upon
http://encyclopodia.sourceforge.net/en/index.html
http://ipodlinux.org/forums/viewtopic.php?p=57434#57335
Hope this helps.
Misticriver thread: http://misticriver.net/showthread.php?t=36924
Plugin wiki: http://rockipedia.techmight.com
I will release the source soon (before 5.nov) when it is a little more finished.
is that just the source? will there be a patch soon?
great work, btw. i can't wait to use it
bro2
i'll try to figure it out from that. though i don't really understand most of it :(
thanks again!
bro2
Are the directions in your readme for linux users?
When it says in your Readme
---run ./Make.sh
run bzcat <wikimedia dump> | ruby mwxmldumpparse.rb /dev/stdin <prepared dump prefix>--- ect...
What does run mean? How do I run it? I have to install Ruby, GCC, and bzip2? Any way to make this a little bit more friendly to newer users :)?
Your work is awesome btw. I've been waiting for this for a long time!
rckbxr
The biggest irritants to me right now are: 1) backwards scroll speed 2) text insensitive matching which turns up the wrong pages (see FRance) 3) no "back" feature as one would expect on a browser.
Does anyone care?
I have updated the MARKUP_* macros to match the values in wmconv.rb, added colored underlining to signalize underlined, italic and bold text. Added a little more memory in inflate.c (to fix some problematic articles).
When it comes to case (in)sensitive searching i don't know if a compare function that is different from the function used to create a b-tree will work, it will not however give false positives so it is no propblem, except that it uses a little more disk access time.
hist[curhist].curline=0;
at line 275 in mww.c.
curline was never being initialized.
Should be good now.
* Don't reallocate memory when loading new articles - though I know the current system will produce the same addresses each time the page history relies rather strongly on that fact, which could be a potential problem if things change.
* Case sensitive search only when the article title is produced internally, i.e. from a link (only case-insensitive search for user-entered text)
* Save the index of an article when found, so upon going back to it we don't need to search again.
* Save current status to a file, to come back to it at a later point. This could produce problems if not handled carefully: changing font between runs will change line breaks (easily solved by storing the character offset of the beginning of the first line displayed and recaululating breaks upon loading), the database itself might change (I think this is unlikely enough to not worry about how to handle it, save to ensure it doesn't crash), and other potential pitfalls, I'm sure.
* Menu - to use the save functionality, also to enter a new article title (should be treated as if it was a link followed so we can always go back), perhaps toggle emphasis highlighting (as I don't particularly care for those colored underlines) and other configurables. Should take a look at the text viewer for ideas here.
* Better display of article title (right now it is ignored)
Stuff for the preprocessor (note however that I have not so much as read a line of it, yet):
* Process basic templates - stuff like seealso and main should be handled, as they are quite often used in large articles.
* Images - just put the alt text for the image
* Offsite links - not sure how to handle these... ignorance should be a fine policy
* Cross-mediawiki links - could potentially be supported if the database for the given article is also present, I don't consider this too important, though
* Categories - Generating a page for each category, and processing the template on each article to link to that category, would aid navigation
* HTML Entities - should be replaced with equivalent unicode characters (what about > and < ?)
* Anchors - How to support? Simple half-solution is to just strip off the anchor when following a link.
* Do something nice with ==headings==, lists (on further thought this may not be necessary, both are presented reasonably enough in the current code)
* Do something (even if not nice) with tables
* Do something with the ref tags
* Redirect processing seems to have some problems
It would also be helpful to take a nice long look at how wikipodia works, although I don't intend to prettify things too much and I'd perfer to keep it fairly simple.
Just some ideas I'll be working on. Is there a better place for this kind of discussion than in the feature request?
* Compares each article with the article in the old version against the new and store diffs (gz-/bz-/lzma- ed) or xdeltas (I will see what that is most efficient, and if diff/patch produces _exactly_ equal files) of the uncompressed article text
* Store a "recepipe" (binary):
* Copy article: <name>
* Patch artilcle: <patch offset> <name>
* Create an ansi c program that compiles on Windows/Mac os X and Linux, with no dependencies that patches a database.
Things that would be nice:
A menu where you could access:
* Page history (a list, with page titles, and stored offsets (in database))
* Document structure (the parser should generate an index with (byte)offsets (localy in the decompressed article)
* Search for new article (with the last article name entered)
* Exit
I've made a fix for some issues with multiple whitespaces and wrapping (if there are two spaces at the end of a line one will be consumed for the break, the other will be at the start of the next line, which I don't like, the new version ignores multiple whitespace (with the exception of mutliple newlines)) but I'll wait to post it until I've done a bit more work today.
Please, I will be eternally grateful.
2 my rockbox build http://freqmod.dyndns.org/upload/rockbox.11.11.06.patches.mww.zip (with mww, and a few other patches (some does not work as intended, but should not harm))
Not too heavily tested yet, though I will surely be using it frequently.
bro2
Izzy, if you can get in touch with me I can possibly walk you through the building, I'm often in #rockbox, my nick is hcs.
-seeking to targets (you know, the # things)
-navigation (the Navigate thing in the menu now displays an outline that you can scroll through, pressing select will jump to that part of the article)
-fixed a few bugs with saving
-history view (a list of the titles in the history)
thanks for any help
bro2
Here you will find a somewhat out of date processed wikipedia dump and a build of the viewer from the latest CVS.
bro2
mww.rock goes in .rockbox/viewers/, viewers.config replaces the one in .rockbox/ (or you can manually add the line
wwi,viewers/mww, 55 55 55 55 55 55
to the existing file)
Also, credit where it is due, the archives are all generated by freqmod's conversion utilities, and the database backend of the viewer is still mostly his.
Added playbackmenu to the menu.
I don't have very much time to work on this project as I am preparing a program for showing videoes on multiple projectors which must be finished before a preformance in the last week of february (and for rehersals). I will problably work more on the patch solution for this viewer before that when I have made some more progress on that project.
all text lengths are in bytes
Input for btcreate (endiannes like the architecture of btcreate):
In-list:
uint32 data_lo
uint32 data_hi
uint32 title_length
char8*title_length title (in UTF8)
Redirect-list:
uint32 from_length
uint32 to_length
char8*from_length redirect from (in UTF8)
char8*to_length redirect to (in UTF8) (the datapointer is taken from this entry in the in-list)
When i download the enwiki dump from the link above it works without problems... When I use my own dump, it says "Searching..." and immediately (within same second, when not same 100th of the second) it says "Not found" and quits the reader.
I have to say that i use iriver h300 (with #define BUTTON_SCROLL_FWD BUTTON_UP and equal ones for SCROLL_DOWN and BUTTON_MENU) and the plugin works fine :) but only with the enwiki dump from above. The Ipodsimulator doesnt work with my dewikidump either.
Any ideas?
Thank you!
If you want to investigate yourself open the file in an hex editor and use the file format documentation as a reference.
Well there is one step i did different than the README:
instead of using bzcat file | ruby converter /dev/stdin outputfile
i used directly ruby converter <my_decompressed_dewiki.xml> <outputfile> because my cygwin doesnt have a device /dev/stdin.
I'll try doing that directy on a real linux box with readme command next night after looking in the ruby script.
Really nice work!
Maybe I'll write a fast C converter when I have the time (but that would be windows only at first - maybe someone will port it some day)
For now I have no parsing of any style - I expect a completely working alpha-version tomorrow evening :)
Only problem is that it doesnt seem to run on a cygwin, dont know why. But you should try it with win32-ruby. Real linux should be best :)
My problem with it: pc is simply rebooting at output filesize ~720MB. Maybe that's my crappy PC, maybe a harddiskerror, dont know. Complete Output file is 1000MB in size. Thats the main reason why i'm coding a new program - so that i can use the whole wikipedia :)
What is your problem with freqmods converter?
Ah btw a problem with the ruby converter: My PC always crashed at 2GB of the input file. this is what i found out. I didnt use directly decompression with bzcat but converter <wikidump.xml>. Maybe ruby has a problem with the 2GB limit?
bro2
Just if anyone needs it now... I would call it 0.1 pre-alpha :) because of the missing main functionality.
how to:
run compile (requires zlib & pcre devel packages)
start xmlconv with wikidump as first argument and output prefix as second argument.
(cygwin or reallinux!)
After that u have to use the btcreate from freqmod.
dont blame me for bad coding... I'm not that good at it ^^
conversion time: dewiki 2,5gb -> 1gb wwa file in exactly 45 minutes on amd athlon xp 1700+ with 512mb sd ram running debian 3.0
Conversion time on 1,86GHz Pentium-M Laptop (slow HD...) was 13 minutes for de-wiki. I think thats quite amazing in contrast to original converter :)
limitations:
-doesnt really parse the XML, only wikipedia "style" will be parsed correctly
-Parsing of layout (bold, etc.) is done via char for char comparison... this speeds up and should be expandable later (regexps should work also)
-I dont know how mww expects Links (links work, but not links with other name than the target)
-gzip-compression really bad implemented... writes to temp.gz file on harddisk and rereads it... I couldnt find an easy to implement way of in-memory compression. This should give another speedup :)
I added precompiled binaries compiled on my cygwin machine.
Libz and pcre-development packages are needed for compiling.
Have fun with it :)
or: A_START <link> G_MODE <description> A_END
From the wwi-specification:
Name ASCII-code:
A_START 007
A_END 008
G_MODE 015
I hope this helps you :)
Was a very simple addition to the converter...
It now should work with same features as yours :)
I forgot to say: It automatically strips off html comments (<!-- -->) and converts > and < to > and <.
One weird thing: my article.wwa file was shrinked from 1000MB to 870MB after implementation of the whole conversion...
Okay, for ''''' it is 5:1 compression, but that shouldnt be sooo many savings... Same for html comments... I browsed through a few articles, they were allright.
--snap
found mistake :) I declared comment-endings as --< but they are --> :) so articles with comments should be missing
attachment: new xmlconv.c
-changes version number to 0.2
-fixes comment-end mistake
-added functionality: alternative link-texts are supported
After looking to mww.c:
#define MARKUP_BAR 13
this is used for seperating link from description!
In your converter u have PL_END as 13 and use that :)
i'm a little bit confused...
My convertor:
variable declarations:
A_START=''<< 007
A_END=''<< 8
...
PL_END=''<< 013
G_MODE=''<< 015
links (with pipe):
arr[i]=Mwtags::A_START+tmp[0,pipeidx]+Mwtags::G_MODE+tmp[pipeidx+1,tmp.length-pipeidx-1]+Mwtags::A_END
(this line uses G_MODE)
The output makes a link with 13 (PL_END) as separator. I don't know why, it seems like a bug/unwanted feature in my code. However your convertor (Matthias) works just as well and is much more efficient and makes my ruby convertor obsolete. Therefore i have rewritten mww.c to use 15 as G_MODE.
This breaks compatibillity, but this plugin is in no way stable so I think it would be better to change the behavour than to change the "specification".
Updated convertor, to work with xmlconv:
I hope to get some more work done on mww now that I'm nominally done with NSF and SPC.
i didnt get zlib working like expected (gzip headers etc.) without the temporary file. In-Memory compression without this file should give another speedup of about 10% I think... Did file splitting work correct? I never tested this ^^ dewiki is only 980MB compressed :)
Since there are a few of this potentially working on this at the same time, should we set up a sourceforge project for it for version control until it is ready for rockbox?
I'd like to get rid of the ref tags, as well, I don't care too much about the references and they make things a lot less readable.
Packed everything together for ease of management, and also mww17 so one doesn't have to pick things up all over this task...
(If anyone disapproves of my totally ripping apart xmlconv, sticking my name in it, and releasing with a new version number, let me know...)
I'd like to once again redo this, with a stage for each level of translation done (1. xml 2. wikicode 3. html rendering) which I think will present a more logical flow for things and improve some bits that are sort of hackishly scattered in at the moment. Might be a while before I get to it, though.
I'm wondering if it might be a good idea to try and use the static HTML dumps? That way we wouldn't have to worry about some of the more irritating to implement aspects of wikicode, like templates (though I've given them a lot of thought). The downside is that they are rather infrequently updated. I'm not considering this too seriously.
I think the multi-stage approach, firmly grounded in the structure of the grammars involved, should be fine. I've just got to get some structural aspects of it decided. I'm taking a compilers course now so I'm trying to apply what I'm learning about parsing to this. Wondering if it might be better to just see how mediawiki does it and rewrite in C, though...
Considering templates again, should they be expanded inline or kept as a seperate thing? I'm thinking inline, but I worry it might make things a lot bigger than they need to be.
The advantages of this plugin is its support for links, the gzip compression and binary tree search (maybe I missed something, because I hadn't had a deeper lok at this plugin for some time...).
What I like about my dict plugin is that it isn't limited to accessing the Wikipedia (a lot of other dictionaries as well) and uses a simple fuzzy search.
I think it's also a better idea to create a plugin-independent document viewer for Rockbox.
So what are your opinions?
Here are a few notes for other users from my experience with this and a fairly recent (week old) Subversion checkout:
- The English Wikipedia dump would uncompress to to something like 11 GB. If you're processing the dump on a filesystem that can't handle files that big (say, the fat32 fs on your DAP), you'll want to use pipes to avoid the need to make an uncompressed file. On Linux, something like the following will allow you to make the article files.
bunzip2 -c enwiki-20070716-pages-articles.xml.bz2 | ./xmlconv /dev/stdin wikipedia
- How it works: this creates a "viewer" for .wwi files. To view/search a processed wikimedia article dump, use the Files browse manu, find the ".wwi" file (both the wwi and wwa files must be on your DAP, as noted above), and open it. If your Rockbox DAP is configured to show only supported files, you'll see the name of only the wwi file.
- Fonts and non-ASCII characters. Not all Rockbox fonts have good support for non-US-ASCII characters. If you find that letters with diacritical marks are not displayed properly, try a different font.
Timo -- I tried your dict plugin (was really looking forward to it), but kept running into errors (iPod 5.5g, 64mb). This wikimedia viewer seems more stable.
Personally, I'd also be interested in a PC app to convert dict files to wikimedia XML format so they could be viewed with this plugin.
The patch doesn't compile for Sansa e200 though.
And at some point I should look at that dict plugin...
It looks quite easy to adapt to custom output.
Thanks, Simon
And I'd like to write a short HOWTO on the patching (no details) and the downloading/converting of the dump. Should I just place it in a comment or could we put it in the top (Details section)? It's confusing to have the old converter method and dead links in the earlier posts.
MAKE in mww
CC mww.c
mww.c: In function ‘set_article_offset’:
mww.c:189: warning: implicit declaration of function ‘printf’
mww.c: In function ‘viewer_init’:
mww.c:936: error: ‘BUTTON_MENU’ undeclared (first use in this function)
mww.c:936: error: (Each undeclared identifier is reported only once
mww.c:936: error: for each function it appears in.)
make[3]: *** [/home/simon/Desktop/iaudio/bleeding/rockbox/build/apps/plugins/mww/mww.o] Error 1
make[2]: *** [mww] Error 2
make[1]: *** [rocks] Error 2
make: *** [build] Fehler 2
From the little I know about programming, I guess it's because of the target? I compile for iaudio X5 (sim) which does not have a menu button.
Am I right. Is there a way out of this?
Regarding making things easier, we're really in a horribly hacky incomplete state now, especially with the converter (though mww is a mess codewise, too). I'd rather it not be too easy to do things the wrong way, especially to avoid transitioning issues when a better way of keeping things updated comes along. This is, however, merely my position, and I'm not doing anything active for this project at the moment anyway (sorry for the grand plans a few messages back), so by all means make things happen if you wish.
The article scrolling is a bit strange (using the up/down button instead of the wheel) but it works.
Adam, you could send me an email with the ftp login now, I will then upload the converted dump (german atm, maybe someone else can do others?), and, later a little howto. simon(?)jso.be
Exception: STATUS_STACK_OVERFLOW at eip=00403333
eax=005F73BC ebx=00000000 ecx=00032CDC edx=0000004C esi=611001A0 edi=004037A0
ebp=0022CCE8 esp=0022CCD4 program=C:\cygwin\home\...\xmlconv.exe, pid 5788, thread main
cs=001B ds=0023 es=0023 fs=003B gs=0000 ss=0023
Stack trace:
Frame Function Args
0022CCE8 00403333 (00000003, 61169690, 01630090, 00000000)
0022CD98 61006198 (00000000, 0022CDD0, 61005510, 0022CDD0)
61005510 61004416 (0000009C, A02404C7, E8611001, FFFFFF48)
8 [main] xmlconv 5788 _cygtls::handle_exceptions: Error while dumping state (probably corrupted stack)
I'm using cygwin on XP, if that helps. Any idea why this is happening?
Exactly which mediawiki dump are you using?
In other news, Simon's conversion of dewiki (20070903) is now on my web site.
http://hcs64.com/rockbox/wikipedia/
The link to the converted wiki is interesting, but I don't understand a word of German ;) I'll need to find some way to get the enwiki in the right format...
Also I'm curious about the coding in the converter: is UTF-8 used?
As for the cygwin problem, I'm guessing that cygwin restrict mem usuage, etc. since it is running in a host enviroment, so maybe that's where the stack overflow comes from.
It is so wonderful to have wikipedia on my ipod :D
Please feel free to send me any improvements. I will love to include them.
I suggest to make the document available in the head of this page, as it has become very long and complex by now.
Thanks to Matze88m for the work on the converter and freqmod and AdamGashlin for this fantastic plugin, I've spent at least 20 hours in the past month, learning my head off....
The pcre for mac is available at http://pcre.darwinports.com/
I use a Sansa, and in addition to "replace BUTTON_MENU with BUTTON_POWER (Thanks framo!)", you should replace BUTTON_SCROLL_BACK with BUTTON_SCROLL_UP, and BUTTON_SCROLL_FWD with BUTTON_SCROLL_DOWN
I also added "rb->backlight_set_timeout(1);" after "rb = api;" as well as
"rb->backlight_set_timeout(rb->global_settings->backlight_timeout);" before "return PLUGIN_OK;" so that the backlight stays on while I read, and goes back to the user settings when the plugin closes. I think if this could be set up as an option it would be very useful.
Would it be possible to have a function to copy text to a text file? (ie export a selection)
Also, could someone add a bookmarking function? I don't even know where to start for that.
Thanks again for such a great program!
The backlight thing would be really cool!
The problem I find with the bookmarking as it is now is that you can't pick and choose what to keep. You have to keep it all.
Should line 224 be:
advance_scrollback(1,(dorender && i==a-1));
it is currently
advance_scrollback(1,(dorender && i==-a-1));
Yes, that would appear to be a bug on line 224. Doesn't show itself because I never use multiline scroll except when scrolling backwards past the scrollback buffer. I guess I just never noticed when that failed to render as it only happens on the one line when the buffer must be regenerated, and I only ever tested that when scrolling back quickly so a single frame went unnoticed. I just tested this now and it is in fact a problem in the current build.
When you've enabled your faster scrolling and the backlight fix (as that had bothered me as well) could you post an updated patch? If not I could, I guess.
I haven't made the patch because I have a lot of other changes, and it is too confusing. Can you make the patch with this?
How about using the right key to advance the text (ie in the 'normal mode') and at the same time append the 'current line' to a file named [title of the article].txt