Indexing the documents stored in a database using Apache Solr and Apache Tika

Indexing the documents stored in a database


  • Setup a MySQL database [1] containing documents( PDF/DOC/HTML etc ).
  • Setup Apache Solr / Tika
  • Import the documents just by hitting an import url.

(NOTE: Also check the update note at the end of this post. )

These steps were done on my machine running Fedora 17. The commands be easliy converted for other distributions.

Setup MySQL database with documents

Install MySQL Server:

# yum install mysql-server
# service mysqld start

Also install Java library for connecting to MySQL ( Solr would need it later )

# yum install -y mysql-connector-java

Setup a MySQL database [1] for storing binary files

CREATE DATABASE binary_files;

CREATE TABLE tbl_files (
 id_files tinyint(3) unsigned NOT NULL auto_increment,
 bin_data longblob NOT NULL,
 description tinytext NOT NULL,
 filename varchar(255) NOT NULL,
 filesize integer NOT NULL,
 filetype varchar(255) NOT NULL,
 PRIMARY KEY (id_files)

 ON binary_files.*
 TO binary_user@localhost
 IDENTIFIED BY 'binary_password';

Now lets create a ruby script to populate the database with documents. We would need a Ruby-MySQL driver [2][3] and a MIME detection library [4].

# yum install ruby-mysql
# yum install rubygem-mime-types

Here is the script:

Lets add some documents to the database:

$ mkdir sample-docs
$ cd sample-docs
$ wget
$ wget
$ wget
$ cd ..
$ for f in sample-docs/* ; do DESC=`basename "$f" | tr ' ' '-' `; ruby insert-mysql.rb "$f" "$DESC"; done

If you get this error when running MySQL insert/update queries “Lost connection to MySQL server during query”, you might want to consider updating your MySQL server limits [6]. Update the limit in /etc/my.cnf:


Setup Apache Solr with Apache Tika integration

$ wget -c
$ tar zxf apache-solr-3.6.2.tgz
$ cd apache-solr-3.6.2
$ cd examples/

Here you would see that there is an example of Data import from HSQL database but we want to work with MySQL. So we create a new configuration. ( It will be easy to follow if you have follwed the README files in the Apache Solr package ).

$ cp -r example-DIH/ dih-mysql/
$ cd dih-mysql/
$ rm -rf hsqldb/

Remove everything except db/

$ cd solr/
$ rm -rf solr mail rss tika
$ ln -s /usr/share/java/mysql-connector-java.jar db/lib/

Now the directory structure should look something like this:

$ find  dih-mysql/ -type d

Lets now update Solr configuration. Here, it is necessary to make sure that the Tika content parser libraries are put in configuration file. How to do this is mentioned below:

Configuration file: dih-mysql/solr/db/conf/solr-config.xml

We just added libraries to parse the content ( to avoid ClassNotFound errors ).

  <lib dir="../../../../contrib/extraction/lib/" regex="tika-core-\d.*\.jar" />
  <lib dir="../../../../contrib/extraction/lib/" regex="tika-parsers-\d.*\.jar" />
  <lib dir="../../../../contrib/extraction/lib/" regex=".*\.jar" />

Configuration file: dih-mysql/solr/db/conf/schema.xml

Add the relevant fields which will be indexed along with the binary content ( PDF/DOC/HTML etc. )

Configuration file: dih-mysql/solr/db/conf/db-data-config.xml

We configured the database/table/columns from which to fetch the content to be indexed.

I would recommend you to go through the official documentation [7].

Now we are all set with the configuration. Its time to index the documents:

Index the documents

Start the Solr server. Notice how we are specifying the configuration path:

$ cd apache-solr-3.6.2/example
$ java -jar -Dsolr.solr.home="./dih-mysql/solr/" start.jar

And invoke the indexer by hitting this url: http://localhost:8983/solr/db/dataimport?command=full-import

The steps that I described worked for me just fine. I hope this helps in resolving the issues faced by others [8] and [9].


I had to add extra entity field in both schema.xml and db-data-config.xml to make the indexing work. Perhaps there is some other problem with my configuration. I don’t understand why this works ( and why the one officially documented [7] doesn’t work ), but this is the work around I figured out.

Update in schema.xml

   <field name="text2" type="text" indexed="false" stored="false"/>

Update in db-data-config.xml

	<entity dataSource="fieldReader"
		dataField="root.bin_data" format="text">
          <field column="text2" name="body" />