Indexing CSV data
How to Index CSV data in Solr ?
In order to perform indexing in Solr we have to perform 2 important steps
1)Define the description of the fields that are present in our document which needs to be indexed inside schema.xml file
2)Publish /Post the data to the Solr for indexing
We will take the example of books.csv provided by Solr itself as a sample data to demonstrate the indexing.
So in our case, it is books.csv is our source of data.
Step 1
Define the description of the fields in this books.csv to consider for indexing in the schema.xml
Path for the same is as below
solr-6.2.0\server\solr\MyCore\conf\schema.xml
Open this file and add the following content after the < uniqueKey >id< /uniqueKey > tag
- <!-- Fields added for indexing books.csv file-->
- <field name="cat" type="text_general" indexed="true" stored="true"/>
- <field name="name" type="text_general" indexed="true" stored="true"/>
- <field name="price" type="tdouble" indexed="true" stored="true"/>
- <field name="inStock" type="boolean" indexed="true" stored="true"/>
- <field name="author" type="text_general" indexed="true" stored="true"/>
<!-- Fields added for indexing books.csv file--> <field name="cat" type="text_general" indexed="true" stored="true"/> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="price" type="tdouble" indexed="true" stored="true"/> <field name="inStock" type="boolean" indexed="true" stored="true"/> <field name="author" type="text_general" indexed="true" stored="true"/>
Now if we observe the books.csv file available inside solr-6.2.0\example\exampledocs folder, we have totally 9 fields but we have defined only 5 fields above to index.
What happens to other fields ? Will they be indexed?
Yes, the other fields will also be indexed but how?
The id field in the books.csv file will be taken care by the uniqueKey element of schema.xml file for indexing
< uniqueKey>id< /uniqueKey>
The other 3 fields will also be indexed using the dynamicField tag in the schema.xml
Observe the below dynamicField tags in the schema.xml file
- <dynamicField name="*_i" type="int" indexed="true" stored="true"/>
- <dynamicField name="*_s" type="string" indexed="true" stored="true"/>
- <dynamicField name="*_l" type="long" indexed="true" stored="true"/>
- <dynamicField name="*_t" type="text_general" indexed="true" stored="true"/>
- <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
- <dynamicField name="*_f" type="float" indexed="true" stored="true"/>
- <dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_i" type="int" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_l" type="long" indexed="true" stored="true"/> <dynamicField name="*_t" type="text_general" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_f" type="float" indexed="true" stored="true"/> <dynamicField name="*_d" type="double" indexed="true" stored="true"/>
There are many Dynamic field names defined in the schema.xml file
- <dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
The above tag says that all the fields ending with _i will be considered for indexing
- <dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
The above tag says that all the fields ending with _s will be considered for indexing
* stands for field name starts with anything.
Similarly other tags and hence other 3 fields (ends with _t,_i,_s) inside books.csv file will be indexed using the above dynamicField tags.
We have set the attribute indexed as true in the schema.xml file
It means the attribute will be indexed and can be retrieved using index.
If this field is set to false then field will be stored but not indexed and hence it cannot be retrieved using index query.
In simple words, if we want the field to be retrieved as part of the search results make it as true.
We have also set the attribute stored as true in the schema.xml file.
It means the field will be stored and can be retrieved in the output.
If we make it false then the field will be indexed but cannot be retrieved in the result.
We have given enough information to Solr about indexing our document.
Since we have altered the configuration file, just stop and start the solr server
Navigate to below path in command prompt
solr-6.2.0\bin
Srop using below command
Solr stop –all
Start the server now using below command
Solr start
Now we have completed the Step 1,lets proceed to step 2
Step 2
Post the data to the Solr to perform indexing.
To achieve the same,Solr provides a standalone java program called SimplePostTool which is packaged in a Jar and available inside the below folder
solr-6.2.0\example\exampledocs
We can use the below command to check the options available for this command
java -jar example/exampledocs/post.jar -h
We can post different types of data to solr for indexing like CSV,Json,XML data etc.
Let’s see how we can post books.csv data for indexing
Open and have a close look at the file books.csv available under below folder
solr-6.2.0\example\exampledocs
Let’s navigate to the below path in command prompt
solr-6.2.0\example\exampledocs
Let’s run the below command to post the books.csv file to the solr
java -Dtype=text/csv -Durl=http://localhost:8983/solr/MyCore/update -jar post.jar books.csv
Since it’s a java command, we can pass run time arguments using –D
We are passing 2 java run time arguments here
–Dtype – Specifyies the type of the file like CSV,XML,JSON etc
-Durl -> URL of the Solr Core under which indexing has to happen
We can see that Solr server has indexed the file and committed the indexed data in MyCore and displayed the following output in command prompt
Access the below url now and check the statistics of indexed data
http://localhost:8983/solr/#/MyCore
We can observe that Num Docs displays no of records which are indexed.
Since we have 10 records in books.csv file, all these records are indexed and hence Num Docs displays 10.
Access indexed data
We can access it directly in the Admin console of Solr or we can also use REST API to access the same.
Access the below url now
http://localhost:8983/solr/#/MyCore
Select MyCore and click on query option
Now click on execute query
We can see the result which has retrieved all the 10 rows of indexed data.
Let’s try to access the indexed data based on some parameters
As we know that Solr provides REST API to access the data and we can pass different parameters to retrieve the data.
Search by Name of the Book
Access the below url in the browser which searches the indexed records in Solr whose name has A storm in it.
http://localhost:8983/solr/MyCore/select?q=name:A storm
We gave got the below response which has returned one record whose name has A storm in it.
Search using wildcard
Solr supports wild card search where we can search based on few letters in the word.
Let’s search the books whose author name is containing George in it.
If we observer our CSV file, we can see that 3 records available with author name containing George.
Access the below url in the browser
http://localhost:8983/solr/MyCore/select?q=author:*George*
We can see the below output which contains the 3 records
Search using Range
We can also pass some range to the Solr to fetch the records.
Solr provides fq parameter to set the range in the query.
Let’s search the books whose price is in the range between 6 and 7.
Access the below url
http://localhost:8983/solr/MyCore/select?q=*&fq=price:[6 TO 7]
We can see the below output which contains the 3 records whose price is between 6 & 7
Search using multiple conditions
Let’s say we want to search all the books whose price is between 6 and 7 and the author name starts with G.
Access the below url
http://localhost:8983/solr/MyCore/select?q=author:G*&fq=price:[6 TO 7]
We can find only 1 such record in our CSV.
We can see the below output which contains 1 record whose price is between 6 & 7 and author name starts with G.
Note: If we modify the data in the books.csv file, we need to update solr server about the modification.
And for that we need to do the indexing again.
Any modification to the original data should follow the Indexing in Solr.
Hi,
If we want to indexing product data in e-commerce application, then do we need to indexing all product fields (includes price etc..) or just we need to index only searchable fields.
Thanks in advance.
You need to index all the fields which you want as a result from Solr.
It includes searchable fields, sort fields,facets and display information on Solr search like image,price etc.
Hi,
one more configuration is missed out in solrconfig.xml. To enable schema mode disable the
AddSchemaFieldsUpdateProcessorFactory processor.
http://stackoverflow.com/questions/31719955/solr-error-this-indexschema-is-not-mutable
I got indexschema is not mutable error while posting the data, I had to disable this processor for posting the book.csv file.
hi , i am using solr 5.3 , I can not find schema.xml at below location .
G:\Hybris\solr-5.5.3\server\solr\MYCORE\conf ,
kindly help me to get .
Thank you.
Hi Manish,
schema.xml file is not available by default in Solr.
Solr by default uses Schemaless mode for indexing.
We need to create a new core and inside conf folder of our new core managed-schema.xml file will be created, we need to rename it as schema.xml to achieve schema mode.
Please check Create Core in Solr article for more details on how to do it.
Thank you!!