Indexing CSV data


How to Index CSV data in Solr ?

In order to perform indexing in Solr we have to perform 2 important steps

1)Define the description of the fields that are present in our document which needs to be indexed inside schema.xml file

2)Publish /Post the data to the Solr for indexing

We will take the example of books.csv provided by Solr itself as a sample data to demonstrate the indexing.

So in our case, it is books.csv is our source of data.

Step 1
Define the description of the fields in this books.csv to consider for indexing in the schema.xml

Path for the same is as below
solr-6.2.0\server\solr\MyCore\conf\schema.xml

Open this file and add the following content after the < uniqueKey >id< /uniqueKey > tag

1
2
3
4
5
6
<!-- Fields added for indexing books.csv file-->
 <field name="cat" type="text_general" indexed="true" stored="true"/>
 <field name="name" type="text_general" indexed="true" stored="true"/>
 <field name="price" type="tdouble" indexed="true" stored="true"/>
 <field name="inStock" type="boolean" indexed="true" stored="true"/>
 <field name="author" type="text_general" indexed="true" stored="true"/>
<!-- Fields added for indexing books.csv file-->
 <field name="cat" type="text_general" indexed="true" stored="true"/>
 <field name="name" type="text_general" indexed="true" stored="true"/>
 <field name="price" type="tdouble" indexed="true" stored="true"/>
 <field name="inStock" type="boolean" indexed="true" stored="true"/>
 <field name="author" type="text_general" indexed="true" stored="true"/>

Now if we observe the books.csv file available inside solr-6.2.0\example\exampledocs folder, we have totally 9 fields but we have defined only 5 fields above to index.

What happens to other fields ? Will they be indexed?

Yes, the other fields will also be indexed but how?

The id field in the books.csv file will be taken care by the uniqueKey element of schema.xml file for indexing
< uniqueKey>id< /uniqueKey>

The other 3 fields will also be indexed using the dynamicField tag in the schema.xml

Observe the below dynamicField tags in the schema.xml file

1
2
3
4
5
6
7
  <dynamicField name="*_i" type="int" indexed="true" stored="true"/>
  <dynamicField name="*_s" type="string" indexed="true" stored="true"/>
  <dynamicField name="*_l" type="long" indexed="true" stored="true"/>
  <dynamicField name="*_t" type="text_general" indexed="true" stored="true"/>
  <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
  <dynamicField name="*_f" type="float" indexed="true" stored="true"/>
  <dynamicField name="*_d" type="double" indexed="true" stored="true"/>
  <dynamicField name="*_i" type="int" indexed="true" stored="true"/>
  <dynamicField name="*_s" type="string" indexed="true" stored="true"/>
  <dynamicField name="*_l" type="long" indexed="true" stored="true"/>
  <dynamicField name="*_t" type="text_general" indexed="true" stored="true"/>
  <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
  <dynamicField name="*_f" type="float" indexed="true" stored="true"/>
  <dynamicField name="*_d" type="double" indexed="true" stored="true"/>

There are many Dynamic field names defined in the schema.xml file

1
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>

The above tag says that all the fields ending with _i will be considered for indexing

1
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>

The above tag says that all the fields ending with _s will be considered for indexing

* stands for field name starts with anything.

Similarly other tags and hence other 3 fields (ends with _t,_i,_s) inside books.csv file will be indexed using the above dynamicField tags.

We have set the attribute indexed as true in the schema.xml file
It means the attribute will be indexed and can be retrieved using index.

If this field is set to false then field will be stored but not indexed and hence it cannot be retrieved using index query.

In simple words, if we want the field to be retrieved as part of the search results make it as true.

We have also set the attribute stored as true in the schema.xml file.

It means the field will be stored and can be retrieved in the output.

If we make it false then the field will be indexed but cannot be retrieved in the result.

We have given enough information to Solr about indexing our document.

Since we have altered the configuration file, just stop and start the solr server

Navigate to below path in command prompt
solr-6.2.0\bin
Srop using below command
Solr stop –all
Start the server now using below command
Solr start

Now we have completed the Step 1,lets proceed to step 2

Step 2
Post the data to the Solr to perform indexing.

To achieve the same,Solr provides a standalone java program called SimplePostTool which is packaged in a Jar and available inside the below folder
solr-6.2.0\example\exampledocs

We can use the below command to check the options available for this command
java -jar example/exampledocs/post.jar -h

We can post different types of data to solr for indexing like CSV,Json,XML data etc.

Let’s see how we can post books.csv data for indexing


Open and have a close look at the file books.csv available under below folder
solr-6.2.0\example\exampledocs

Let’s navigate to the below path in command prompt
solr-6.2.0\example\exampledocs

Let’s run the below command to post the books.csv file to the solr
java -Dtype=text/csv -Durl=http://localhost:8983/solr/MyCore/update -jar post.jar books.csv

Since it’s a java command, we can pass run time arguments using –D

We are passing 2 java run time arguments here

–Dtype – Specifyies the type of the file like CSV,XML,JSON etc
-Durl -> URL of the Solr Core under which indexing has to happen

We can see that Solr server has indexed the file and committed the indexed data in MyCore and displayed the following output in command prompt

Access the below url now and check the statistics of indexed data
http://localhost:8983/solr/#/MyCore

books csv indexed statistics

We can observe that Num Docs displays no of records which are indexed.
Since we have 10 records in books.csv file, all these records are indexed and hence Num Docs displays 10.

Access indexed data

We can access it directly in the Admin console of Solr or we can also use REST API to access the same.

Access the below url now
http://localhost:8983/solr/#/MyCore

Select MyCore and click on query option

Solr core query option in admin console

Now click on execute query

Solr core query execute in admin console

We can see the result which has retrieved all the 10 rows of indexed data.

Solr core query result for books csv data


Let’s try to access the indexed data based on some parameters

As we know that Solr provides REST API to access the data and we can pass different parameters to retrieve the data.

Search by Name of the Book

Access the below url in the browser which searches the indexed records in Solr whose name has A storm in it.

http://localhost:8983/solr/MyCore/select?q=name:A storm

We gave got the below response which has returned one record whose name has A storm in it.

solr books csv search by name result


Search using wildcard

Solr supports wild card search where we can search based on few letters in the word.

Let’s search the books whose author name is containing George in it.

If we observer our CSV file, we can see that 3 records available with author name containing George.

Access the below url in the browser
http://localhost:8983/solr/MyCore/select?q=author:*George*

We can see the below output which contains the 3 records

Solr books csv search by wild card


Search using Range

We can also pass some range to the Solr to fetch the records.

Solr provides fq parameter to set the range in the query.

Let’s search the books whose price is in the range between 6 and 7.

Access the below url
http://localhost:8983/solr/MyCore/select?q=*&fq=price:[6 TO 7]

We can see the below output which contains the 3 records whose price is between 6 & 7

Solr books csv search by range


Search using multiple conditions

Let’s say we want to search all the books whose price is between 6 and 7 and the author name starts with G.

Access the below url

http://localhost:8983/solr/MyCore/select?q=author:G*&fq=price:[6 TO 7]

We can find only 1 such record in our CSV.

We can see the below output which contains 1 record whose price is between 6 & 7 and author name starts with G.

solr books csv file search with multiple conditions

Note: If we modify the data in the books.csv file, we need to update solr server about the modification.

And for that we need to do the indexing again.

Any modification to the original data should follow the Indexing in Solr.

About the Author

Karibasappa G C (KB)
Founder of javainsimpleway.com
I love Java and open source technologies and very much passionate about software development.
I like to share my knowledge with others especially on technology 🙂
I have given all the examples as simple as possible to understand for the beginners.
All the code posted on my blog is developed,compiled and tested in my development environment.
If you find any mistakes or bugs, Please drop an email to kb.knowledge.sharing@gmail.com

Connect with me on Facebook for more updates

Share this article on