Solr  Data Import Handler (DIH) provides a mechanism for importing content from a data store and indexing it.We can also configure multiple datastore and indexing it.

In addition to relational databases, DIH can index content from HTTP based data sources such as RSS and ATOM feeds, e-mail repositories, and structured XML where an XPath processor is used to generate fields.

The main advantage of this method of data importing is no need for additional software development and the rapid integration of the data source. This second advantage, however, requires skill and practice. In this entry We will discuss the basics of DIH integration with SQL data source.

Solrconfig.xml configuration

We need to add dataimport libraries path in solrconfig as below.

<lib dir="${solr.install.dir:../../../..}/contrib/dataimporthandler-extras/lib" regex=".*\.jar" />
  <lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-\d.*\.jar" />

We need to configure additional request handler as below in solrconfig.xml

 <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
   <str name="config">/user/configs/db-data-config.xml</str>

config parameter determines configuration file which provides the definition of data sources.

Configure Connector

Need to add required db connector library under lib folder of solr core directory.

Configuring data source

Add the tag ‘dataSource’ directly under the ‘dataConfig’ tag.

SQL server example

<dataSource name="jdbc" type="JdbcDataSource" driver="" url="jdbc:sqlserver://localhost\instancename;databaseName=solrworkDb" user="db_username" password="db_password"/>

SQL server jdbc driver to Download sql server connector.

Oracle example

<dataSource name="jdbc" type="JdbcDataSource" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@//hostname:port/SID" user="db_username" password="db_password"/>

Oracle jdbc driver to download oracle connector.

MySql example

<dataSource type="JdbcDataSource" name="jdbc" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db1-host/dbname" user="db_username" password="db_password"/>

MySql jdbc driver to download mysql connector.

Datasource Attributes

The attribute ‘type’ specifies the implementation class. It is optional. The default value is ‘JdbcDataSource’.
The attribute ‘name’ can be used if there are multiple datasources used by multiple entities.
All other attributes in the <dataSource> tag are specific to the particular dataSource implementation being configured.

Single dataSource db-data-config.xml

<dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/products" user="root" password="[email protected]" />
    <document name="products">
        <entity name="item" query="select * from item">
            <field column="ID" name="id" />
            <field column="NAME" name="name" />
            <field column="MANU" name="manu" />
            <field column="WEIGHT" name="weight" />
            <field column="PRICE" name="price" />
            <field column="POPULARITY" name="popularity" />
            <field column="INSTOCK" name="inStock" />
            <field column="INCLUDES" name="includes" />

            <entity name="feature" query="select description from feature where item_id='${item.ID}'">
                <field name="features" column="description" />

Multiple DataSources

It is possible to have more than one datasources for a configuration. To configure an extra datasource , just keep an another ‘dataSource’ tag . There is an implicit attribute “name” for a datasource. If there are more than one, each extra datasource must be identified by a unique name 'name="datasource-2"'.

Multiple DataSources example

<dataSource type="JdbcDataSource" name="jdbc-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db1-host/dbname" user="db_username" password="db_password"/>
<dataSource type="JdbcDataSource" name="jdbc-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db2-host/dbname" user="db_username" password="db_password"/>

Multiple DataSources db-data-config.xml

<dataSource name="jdbc-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/products" user="root" password="[email protected]" />
<dataSource name="jdbc-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/produ_features" user="root" password="[email protected]" />
    <document name="products">
        <entity name="item" dataSource="jdbc-1" query="select * from item">
            <field column="ID" name="id" />
            <field column="NAME" name="name" />
            <field column="MANU" name="manu" />
            <field column="WEIGHT" name="weight" />
            <field column="PRICE" name="price" />
            <field column="POPULARITY" name="popularity" />
            <field column="INSTOCK" name="inStock" />
            <field column="INCLUDES" name="includes" />

            <entity name="feature" dataSource="jdbc-2" query="select description from feature where item_id='${item.ID}'">
                <field name="features" column="description" />

After changes all these configuration create new one with these configuraiton or reload solr core if exists to reload all the updating configurations.

After reloading or creating solr core you are able to see dataimport request handle as below.



  • full-import

    : Full Import operation can be started by hitting the URL http://<host>:<port>/solr/dataimport?command=full-import

    • This operation will be started in a new thread and the status attribute in the response should be shown busy now.

    • The operation may take some time depending on size of dataset.
    • When full-import command is executed, it stores the start time of the operation in a file located at conf/ (this file is configurable)

    • This stored timestamp is used when a delta-import operation is executed.
    • Queries to Solr are not blocked during full-imports.
    • It takes in extra parameters:
      • entity : Name of an entity directly under the <document> tag. Use this to execute one or more entities selectively. Multiple ‘entity’ parameters can be passed on to run multiple entities at once. If nothing is passed, all entities are executed.

      • clean : (default ‘true’). Tells whether to clean up the index before the indexing is started.

      • commit : (default ‘true’). Tells whether to commit after the operation.

      • optimize : (default ‘true’ up to Solr 3.6, ‘false’ afterwards). Tells whether to optimize after the operation. Please note: this can be a very expensive operation and usually does not make sense for delta-imports.

      • debug : (default ‘false’). Runs in debug mode. It is used by the interactive development mode (see here).

        • Please note that in debug mode, documents are never committed automatically. If you want to run debug mode and commit the results too, add ‘commit=true’ as a request parameter.
  • delta-import

    : For incremental imports and change detection run the command http://<host>:<port>/solr/dataimport?command=delta-import . It supports the same clean, commit, optimize and debug parameters as full-import command.

  • status

    : To know the status of the current command, hit the URL http://<host>:<port>/solr/dataimport . It gives an elaborate statistics on no. of docs created, deleted, queries run, rows fetched, status etc.

  • reload-config

    : If the data-config is changed and you wish to reload the file without restarting Solr. Run the command http://<host>:<port>/solr/dataimport?command=reload-config .

  • abort

    : Abort an ongoing operation by hitting the URL http://<host>:<port>/solr/dataimport?command=abort .

Refer DataImportHandler , DataImport for more details.



Was this post helpful?
Let us know, if you liked the post. Only in this way, we can improve us.

Leave a Reply

Your email address will not be published. Required fields are marked *