Due to an unexpected high load on the cluster, we were hitting a threshold limit on our backend compute instances, which inturn caused failed MLSkill deployment issue at AICenter side. We worked with our cloud provider team to diagnose and fix the issue at backend.
The deployments are working fine now and we are marking this incident resolved now. We will continue to keep the system under monitoring and work on preventing this issue to resurface. We are sorry for all the inconvenience caused and thank you for your patience and understanding in this.
Posted Dec 30, 2022 - 06:18 UTC
A new fix was implemented to mitigate the incident. The new ML Skill deployments are working fine now. We are performing a few tests and keeping the system under monitoring.
Posted Dec 29, 2022 - 15:17 UTC
After performing a few tests, we have experienced some failures in the ML Skills deployment again. We are looking further into it.
Posted Dec 29, 2022 - 09:00 UTC
We have applied a fix on the cluster and all ML Skills are getting deployed fine now. We are currently monitoring the results and testing a few different use cases as well.
Posted Dec 29, 2022 - 06:43 UTC
As of now cluster is in healthy state . Existing deployments will not be effected until if customers try to make any changes to existing deployment and also new customers having some intermittent issues with creating new deployments. Seems like there are some hardware issues while scheduling new instances on the nodes. We are actively working with MSFT team to resolve the issue and also we are actively working on creating completely new cluster in parallel to unblock. Since it is holiday time its taking more time than expected. Apologies for any inconvenience this has caused.
Posted Dec 28, 2022 - 17:28 UTC
The mitigation actions taken so far for this incident is not resolving the issue completely. We are continuing to work on this issue with our backend team. We apologise for all the inconvenience this incident has caused. Please be rest assured that this issue is being worked upon with high priority.
Posted Dec 28, 2022 - 06:03 UTC
We are making progress in resolving the issues with backend compute instances. Now, only few of the instances are pending to be in healthy state now. Once completed, we should see improvement in ML Skill deployment state.
Posted Dec 27, 2022 - 16:41 UTC
We are continuing to work on a fix for this issue with our cloud provider.
Posted Dec 27, 2022 - 10:33 UTC
The issue has been identified and a fix is being implemented.
Posted Dec 27, 2022 - 03:31 UTC
Backend infrastructure is partially in good state now and only some customers should see issues while creating deployments . We are still working to recover full cluster infrastructure.
Posted Dec 27, 2022 - 01:55 UTC
We are working with Microsoft team to investigate if there are any issues with the backend infrastructure
Posted Dec 26, 2022 - 18:42 UTC
The issue has been found with the backend compute instance. We are working with our backend team on this.
Posted Dec 26, 2022 - 15:18 UTC
ML Skill deployment in our Europe clusters are having some issue, due to which they are getting failed after sometime. We are actively looking into it.