Resolved
We have implemented some safeguard to avoid our yesterday's cluster issues. We're closing this incident & keeping an eye on the metrics.
Monitoring
The issue affecting our Aurora cluster is fixed and jobs haved catched up. We are still investigating why our cluster behaved in such a way to not have it happen again.
Identified
Our jobs are pilling up du to an issue with our infrastructure, we are implementing counter measures
Investigating
We are investigating an issue affecting our application as we see an increase in our error rate and timeouts happening