[Yanel-dev] crawler
Michael Wechner
michael.wechner at wyona.com
Fri Mar 2 14:34:20 CET 2007
Josias Thöny wrote:
> Michael Wechner wrote:
>
>> Josias Thöny wrote:
>>
>>> Hi,
>>>
>>> I've had a look at the crawler of lenya 1.2, and it seems that a few
>>> features are missing:
>>>
>>> basic missing features:
>>> - download of images
>>> - download of css
>>> - download of scripts
>>> - link rewriting
>>> - limits for max level / max documents
>>>
>>> advanced missing features:
>>> - handling of frames / iframes
>>> - tidy html -> xhtml
>>> - extraction of body content
>>> - resolving of links in css (background images etc.)
>>>
>>> Or am I misunderstanding something...?
>>
>>
>>
>> no ;-)
>>
>>>
>>> IMHO some of these features are quite essential, because we want to
>>> use the crawler in yanel to import the complete pages with images
>>> and everything, not only text content.
>>>
>>> The question is now, does it make sense to implement the missing
>>> features into that crawler, or should we look for an alternative?
>>
>>
>>
>> sure, if there is an alternative :-) Is there?
>
>
> The lenya crawler uses websphinx for the robot exclusion, which is
> actually a complete crawler framework, and I think we could use it
> instead of the lenya crawler. It supports the basic features that I
> mentioned above.
> I wrote a class DumpingCrawler which is based on the websphinx
> crawler. Basically it should be able to create a complete dump of a
> website including images, css, etc. It also rewrites links in the html
> code.
>
> The source code is at:
> https://svn.wyona.com/repos/public/crawler
>
> I also added the websphinx source code to our svn because I had to
> patch a few things.
I think it's important that we also add the patches separately in order
to know what has been patched.
> The license is apache-like, so it should be ok.
>
> The usage is shown in the following example:
>
> --------------------------------------------------
> String crawlStartURL = "http://wyona.org";
> String crawlScopeURL = "http://wyona.org";
> String dumpDir = "/tmp/dump";
>
> DumpingCrawler crawler = new DumpingCrawler(crawlStartURL,
> crawlScopeURL, dumpDir);
>
> EventLog eventLog = new EventLog(System.out);
> crawler.addCrawlListener(eventLog);
> crawler.addLinkListener(eventLog);
>
> crawler.run();
> crawler.close();
> --------------------------------------------------
>
> Remarks:
> - the EventLog is optional (it creates some log output)
what is the EventLog good for
> - the crawlScopeURL limits the scope of the retrieved pages, i.e. only
> urls starting with the scope url are being downloaded.
>
> For more information, see
> http://www.cs.cmu.edu/~rcm/websphinx/doc/websphinx/Crawler.html
sounds very good :-) Have you already uploaded the library to our Maven
repo?
Thanks
Michi
>
> Josias
>
>
>>
>> Thanks
>>
>> Michi
>>
>>>
>>> Josias
>>>
>>> _______________________________________________
>>> Yanel-development mailing list
>>> Yanel-development at wyona.com
>>> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>>>
>>
>>
>
>
> _______________________________________________
> Yanel-development mailing list
> Yanel-development at wyona.com
> http://wyona.com/cgi-bin/mailman/listinfo/yanel-development
>
--
Michael Wechner
Wyona - Open Source Content Management - Apache Lenya
http://www.wyona.com http://lenya.apache.org
michael.wechner at wyona.com michi at apache.org
+41 44 272 91 61
More information about the Yanel-development
mailing list