Production Incident Log

Production Incident Log

Production Incident Log

Some custom domains result in error 503

Time of incident: March 3rd, 2024 4:00 am EST

During a 4-hour period, some custom domains experienced difficulties accessing microsites from specific locations.

What happened? We deployed a fix  to address Celero form visibility issues, which did not fully propagate across all lambda edge locations. As a result, some customers encountered an intermediate state error resulting in a 503 return code.

Who was affected? Affected customers were  notified, and only a limited number of end users experienced the issue during a scheduled maintenance window.

What did we learn? To prevent similar incidents, we have enhanced our monitoring system by adding monitors for all custom domains and implementing more comprehensive checks to detect related issues. Furthermore, we have identified opportunities to enhance our deployment process to minimize the likelihood of such occurrences in the future.

Some content was showing as a white screen

Time of incident: November 24th, 2023 4:49 am

For a period of 20 minutes starting at 4:49am EST, some users saw a white screen when accessing Celero content.

What happened? We deployed a routine minor version update that includes several fixes. One of the dependencies was not deployed to production. Therefore some of our servers returned an error that resulted in a white screen for some users.

Who was affected? Only some users were affected. As we run on multiple servers, the code is deployed gradually and immediately verified. As soon as we identified the issue, we started the process to revert to the previously deployed version which took several minutes to complete. Once completed (20 min after the initial deployment), all of the services were back to normal.

What did we learn? We followed our internal process of identifying the root causes so that issues like this don't repeat.As a result, we will be making a few fixes:

  1. We have made some changes to our staging deployment process to ensure that any dependencies get deployed together.
  2. Unrelated to the issue, we found an opportunity to improve our monitoring and will be adding another monitor for more fine-grained internal alerting.
  3. The team followed the process as intended and did the maximum they could to minimize the impact. We are still taking extra precautions to prevent issues like this from repeating in the future.

Follow us on: