February 22, 2023
If you’re considering SearchStax Studio or have recently purchased a site search solution, you’re probably thinking through data ingestion. Data ingestion is often a major challenge in getting site search up and running, and developing a sound approach to data import is critical as it directly impacts the quality and relevance of search results.
In the context of site search, data ingestion is the process of importing and loading data from one or more data sources and making it available in a structured format that can be indexed and searched by a search engine. The data may include website content, product information, user behavior data, documents and more. Data ingestion also involves repeatedly pulling in data on a real-time or regular batched basis.
The goal of data ingestion for site search is to ensure that the search engine can quickly and accurately retrieve relevant information for a user’s search query and improve the overall user experience.
With SearchStax Studio, data ingestion means getting data into a Solr-based search index so it can be accessed via a search request on a website or in a custom application. This post looks at the various ways to load data into Studio, identifies the sources and types of data we support and provides recommendations for best practices.
There are three main ways to load data into SearchStax Studio:
Let’s take a look at each one of these options in more detail.
If you use Sitecore or Drupal for your content management system (CMS), SearchStax has integration modules that automate the data indexing process and accelerate the implementation process.
The SearchStax Studio Connector for Sitecore is available for Sitecore versions starting with version 9 through version 10.3. The Connector integrates with the Sitecore Indexing Manager and automatically indexes all Sitecore content items out-of-the-box. Additional information can be found in the Sitecore Connector product documentation.
The SearchStax Studio Connector for Drupal automatically tracks all search results known to the Drupal Search API. Once the Drupal Connector is installed and configured, it automatically indexes any new or updated content in the Drupal environment. The module adds search functionality while requiring virtually no changes to the Drupal website.The Drupal integration was developed by Thomas Seidl (drunken monkey), the creator and maintainer of the Drupal Search API, and follows all Drupal open source code guidelines. Additional information can be found in the Drupal Connector product documentation or from the Drupal Connector module page at Drupal.org.
The SearchStax Data Ingest API is a service that allows you to index and search structured data in your SearchStax search service. The API enables you to send data to your search service in real-time, making it immediately searchable by users. Customers can also use the SearchStax Ingest API to load documents into their Studio application. On the Settings page, the Ingest endpoint is the /update endpoint and uses the “Read-Write” Search API credentials.
The Ingest APIs simplify the data ingestion process by enabling a customer or an implementation partner to create a small piece of code to get data from any source and push it into SearchStax Studio. You can index individual JSON documents, multiple JSON documents or a JSON file with an array of JSON objects. You can also index XML documents by sending one or multiple tags. Additional information on using the Ingest APIs can be found in the Studio product documentation.
SearchStax also has a web crawler that can crawl the data on any website. The crawler lets the website know that it is crawling your website, and then bombards the site with a lot of queries to gather the metadata needed for the Solr index. To control the SearchStax crawler, a number of variables are passed so it knows where to start, the types of pages to crawl and what pages to exclude, if any.
A brilliant feature of the SearchStax Crawler is that it can crawl single page applications. For example, if you build a website using Salesforce (like hub.nashville.gov which our Crawler crawls for the City of Nashville), then all of the content you see on the page is loaded dynamically when you click on that page. Our Crawler is smart enough to detect the dynamic pages and crawl the page.
The SearchStax Crawler is available for a separate one-time setup charge and an on-going monthly charge. Limitations of the SearchStax Crawler are that it is limited to 100,000 pages and the data cannot be uploaded to the Solr index until the entire crawl has been completed. Contact SearchStax to learn more about the SearchStax Crawler and pricing.