solr-wikipedia

💡 This repository has been created in the scope of a hackathon. It is not actively developed or used at the moment.

A collection of utilities for parsing WikiMedia XML dumps with the intent of indexing the content in Solr.

Quick-Start

Download a Wikipedia dump file (http://en.wikipedia.org/wiki/Wikipedia:Database_download)
Download Solr 4.9 and extract (http://lucene.apache.org/solr/)
Configure environment variables

Set SOLR_HOME to the location Solr was extracted to in Step 2 + "example", for example: export SOLR_HOME=/var/local/solr/example

Set JAVA_HOME to the location of your JDK.
Clone and build code

git clone https://github.com/bbende/solr-wikipedia.git

cd solr-wikipedia

mvn clean package -Pshade
Configure & start Solr

./deploy-wikipedia-collection.sh (copies src/main/resource/solr/wikiepediaCollection to $SOLR_HOME/solr/)

src/main/resources/solr.sh start

Check http://localhost:8983/solr in your browser
Ingest data (from solr-wikipedia dir)

java -jar target/solr-wikipeida-1.0-SNAPSHOT.jar http://localhost:8983/solr/wikipediaCollection /var/local/test-wiki-data.xml.bz2

Overview

There are three main concepts:

Handlers - Receive events related to the WikiMedia XML and produce objects based on those events. The DefaultHandler produces Page objects, but clients could implement a custom handler to produce another type of object.
Parser - A SAX parser for the WikiMedia XML. Clients pass in a Reader for the XML and a handler to take action on events.
Iterator - An Iterator that uses StAX processing to produces objects based on the given handler.

An example of parsing a bzip dump file:


    String testWikiXmlFile = "src/test/resources/test-wiki-data.xml.bz2";

    WikiMediaXMLParser wikiMediaXMLParser = new SAXWikiMediaParser<>();
    PageHandler handler = new DefaultPageHandler();

    try (FileInputStream fileIn = new FileInputStream(testWikiXmlFile);
         BZip2CompressorInputStream bzipIn = new BZip2CompressorInputStream(fileIn);
         InputStreamReader reader = new InputStreamReader(bzipIn)) {

        wikiMediaXMLParser.parse(reader, handler);
        ...
    }

An example of iterating over a bzip dump file:


    String testWikiXmlFile = "src/test/resources/test-wiki-data.xml.bz2";

    try (FileInputStream fileIn = new FileInputStream(testWikiXmlFile);
         BZip2CompressorInputStream bzipIn = new BZip2CompressorInputStream(fileIn);
         InputStreamReader reader = new InputStreamReader(bzipIn)) {

        PageHandler handler = new DefaultPageHandler();

        Iterator iterator = new WikiMediaIterator<>(
                reader, handler);

        while(iterator.hasNext()) {
            Page page = iterator.next();
        }
    }

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
deploy-wikipedia-collection.sh		deploy-wikipedia-collection.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

solr-wikipedia

Quick-Start

Overview

About

Releases

Packages

Languages

License

dbsystel/solr-wikipedia

Folders and files

Latest commit

History

Repository files navigation

solr-wikipedia

Quick-Start

Overview

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages