Production Incident Log
Time of incident: March 3rd, 2024 4:00 am EST
During a 4-hour period, some custom domains experienced difficulties accessing microsites from specific locations.
What happened? We deployed a fix to address Celero form visibility issues, which did not fully propagate across all lambda edge locations. As a result, some customers encountered an intermediate state error resulting in a 503 return code.
Who was affected? Affected customers were notified, and only a limited number of end users experienced the issue during a scheduled maintenance window.
What did we learn? To prevent similar incidents, we have enhanced our monitoring system by adding monitors for all custom domains and implementing more comprehensive checks to detect related issues. Furthermore, we have identified opportunities to enhance our deployment process to minimize the likelihood of such occurrences in the future.
Time of incident: November 24th, 2023 4:49 am
For a period of 20 minutes starting at 4:49am EST, some users saw a white screen when accessing Celero content.
What happened? We deployed a routine minor version update that includes several fixes. One of the dependencies was not deployed to production. Therefore some of our servers returned an error that resulted in a white screen for some users.
Who was affected? Only some users were affected. As we run on multiple servers, the code is deployed gradually and immediately verified. As soon as we identified the issue, we started the process to revert to the previously deployed version which took several minutes to complete. Once completed (20 min after the initial deployment), all of the services were back to normal.
What did we learn? We followed our internal process of identifying the root causes so that issues like this don't repeat.As a result, we will be making a few fixes: