At the beginning of this term at the university I have subscribed to a project about the Semantic Web and Service Sciences. A really interesting topic, indeed. But this is not the main topic of this article, even though this course led me to think about how to crawl the web with a Java application (because the project leader, in other words the professor, has already had some code written in Java).
So then, I started with my basic skills of Java coding, to look around. Fortunately I already knew a little bit about crawling and that regular expressions are key in this area from my experience with PHP and libcurl.
Looking around in the web I (or should I say Google has) have found a good point to start with from Osborne (a unit of McGraw Hill) where the Book "The Art Of Java" from Herbert Schildt and James Holmes has been published. So I started to read the article Crawling The Web With Java.

As you may have noticed is my article here named identically as the original article and has also a numbering, indicating, that this is the first part of a series.
Now I come back to what I have mentioned at the beginning, my work on the topic of the semantic web. One of the central resources in the semantic web are ontologies. But there are not that much out there and so you can now start to build your own ontology for a specific domain and filling it with content from the internet. And there is our crawler involved. But until the application will support our aim, we need to modify it a little bit. And this will be the next part(s) of this series.
In the next step, we want to extend the crawler with an additional search method. So it can search regular expressions within a desired homepage.

Leave a comment