New SearchStax Managed Solr users sometimes encounter a flurry of 502 Bad Gateway errors. This page presents the usual causes and potential solutions to this experience.
The 502 Bad Gateway error is a gateway timeout. The error is issued by the cluster’s Load Balancer (called an “Application Gateway” in the Azure world, and an ELB, NLB, CLB, or ALB in the AWS world), or by the Nginx service on a stand-alone node. It means that you gave Solr a task that took more than 90 seconds (default) to complete.
What tasks might that be?
Common Causes of Timeout Errors
- Expensive updates: New users sometimes overload Solr indexing by demanding a commit after every add (commit=true) or within a second of every add (commitWithin=1000). Frequent commits are expensive and can overload the system to the point of putting replicas into recovery mode. See Timeouts during ingestion: Too many commits!
- Expensive queries: Another way to cause a gateway timout is to send Solr a stream of expensive queries. New users often request a million response-items (rows=1000000) when ten would do. Queries using “deep pagination” can take a long time to complete. Logically complex queries (with many AND/OR clauses) can also be very slow. See Solr Out of Memory (OOM): Causes and Solutions.
- Very large index: Rebuilding a very large index can require hours. Users often require increased gateway timeout settings and added memory.
- Sitecore issue: There is a known issue in some versions of Sitecore where the Indexing Manager returns a 502 Error. See Sitecore mentions 10.x.x.x:8983.
- Clean restart: If you stop all nodes of a cluster and restart them one-by-one (as opposed to doing a rolling restart), Solr can issue 502 Bad Gateway errors until Zookeeper takes control of the replicas again.
- Recovery mode: A variety of issues can push a replica into recovery mode. While recovering, queries involving that replica can return 502 Bad Gateway timeouts. For instance, see Is 100% CPU a bad thing?.