Wednesday, November 9, 2011

Solr Delta Import - Delete Index data

Delete Index data is essential requirement if you wants to have incremental delta imports. On Solr wiki delete is not described in very detail, so I thought of document the issues I faced and how we can delete indexing data with delta import.

schema.xml:
<fields>
<field name="userid" type="string" indexed="true" stored="true"
required="true"></field>
<field name="emailid" type="text" indexed="true" stored="true"></field>
<field name="name" type="text" indexed="true" stored="true"></field>
<field name="address" type="text" indexed="true" stored="true">
</field>
<uniquekey>emailid</uniquekey>
</fields>

data-config.xml:
<dataconfig>
<document>
<entity name="users" pk="emailid" query="SELECT * from users"
deletedPkQuery="SELECT emailid FROM users WHERE is_deleted = true and modification_date >
'${dataimporter.last_index_time}'"
<!-- deletedPkQuery must have emailid in select query -->
<field column="userid" name="userid"></field>
<field column="emailid" name="emaild"></field>
<field column="name" name="name"></field>
<field column="address" name="address"></field>
</entity>
</document>
</dataconfig>

Index data is deleted by uniqueKey defined in schema.xml and not by pk defined at entity level in data-config.xml
If pk at top level entity is not same as uniqueKey in your data-config uniqueKey, Delete will not work though in log file you will see number of deleted documents. This is because the fetched value will be matched with uniqueKey field.
Rows are deleted by uniqueKey.
Here is the method from SolrWriter which is actually called when documents are deleted using deletedPkQuery.

public void deleteDoc(Object id) { //here id value must be from uniqueKey field
try {
log.info("Deleting document: " + id);
DeleteUpdateCommand delCmd = new DeleteUpdateCommand();
delCmd.id = id.toString();
delCmd.fromPending = true;
delCmd.fromCommitted = true;
processor.processDelete(delCmd);
} catch (IOException e) {
log.error("Exception while deleteing: " + id, e);
}
}

3 comments:

Anonymous said...

Thanks, nice tip. Very useful to me

Goldwynn said...

Do you know why we need to specify query="SELECT * from users" ?

<entity name="users" pk="emailid" query="SELECT * from users"

Thanks.

Goldwynn said...

Ignore my comments above, I was thinking something else. Yeah, thanks for the article, nice piece.