Outage in Sao Paulo Region (MH1)
Incident Report for Maxihost
Postmortem

MH1 Outage

On 9:13 PM UTC, May 20, 2021, our operations team identified a major outage at MH1, affecting one of the floors in the site. One of three locations that Maxihost operates in the region, MH1 is located in São Paulo. As soon as we identified the issue, we posted an incident on the status page and updated it every 30 minutes. 

What we discovered was that the local power distribution company (ENEL) had planned to conduct maintenance on the electrical distribution network from 2:00 PM UTC, May 20, to 9:00 PM UTC, May 20. Effectively the maintenance by ENEL lasted from 3:00 PM UTC to 10:00 PM UTC. By 1:00 PM UTC, our electrical team had manually started a power generator to handle the Datacenter workload. The power generator supported the Datacenter load until 8:30 PM UTC, when we expected a second generator to automatically come online. Unfortunately, this didn't take place. 

A system of four redundant UPS battery units handled the Datacenter's workload until they were completely drained. In the meantime, the electrical team on-site was attempting to bring the second generator up manually. When the UPS units were used up, that generator was not yet operational, and the result was the outage.

Service reliability is a top priority for Maxihost and we understand how critical our services are to your company. MH1 has been in operation since 2016 and has an uptime track record of over 99.99%. Tests on electrical systems are made regularly by our on-site teams and contracted vendors. We are deeply sorry for this outage, and will make every effort to prevent it from ever happening again.

At present, our electrical engineers are combing through all our logs and will work closely with our vendors over the next few weeks to implement effective measures to identify and resolve the root cause. One of the measures that is being put in place right now is to set our generators to work in parallel so that, in the event of extended downtime from the power distribution company, we can avoid switching between generators.

As the CEO of Maxihost, I accept full responsibility for this failure. We are all keenly aware of the impact of this type of outage on our customers and I'm really sorry to have failed you. If you have any questions, or if we can help in any way, reach out to our support team, which has been addressing each report individually.

I also want to express my gratitude to everyone who has been patient and those who have expressed kind words of encouragement and support during and after the incident. There's nothing more important to us than your trust. We can and will do better.

Maxihost is committed to and deeply invested in improving our technology and field operations. We've learned a lot from this incident and we will not rest until all measures are taken to prevent a repetition of the experience. 

I thank you for your business and for trusting Maxihost as your Bare Metal Cloud provider.

Guilherme Soubihe Alberto

 CEO

Posted May 21, 2021 - 17:30 GMT-03:00

Resolved
The issue is resolved now. We are still working with a few clients that are still facing some difficulties. If by any means you might need further support information about the situation, please reach out to us via [email protected]. We will provide a statement about the occurrence soon.
Posted May 20, 2021 - 22:26 GMT-03:00
Update
All our cabinets and devices in the MH1 region are back online, some services might still present some issues while we get back the configuration and fine-tune mainly Maxihost own services, as the dashboard in the website and API. We will provide an update on these services shortly.
Posted May 20, 2021 - 21:03 GMT-03:00
Update
We continue working towards bringing up 100% of our services. Some clients might still experience issues and all could be having problems accessing our dashboard on the website. Be assured that all our engineers are focused on this situation. We will provide another update within the next 30 minutes.
Posted May 20, 2021 - 20:33 GMT-03:00
Update
We are reaching 100% of all services re-established as of now. However, there are some minor verifications in some cabinets still pending, we are working closely with all our teams to minimize the impact and have all services fully operational promptly. You might experience some difficulties in accessing your dashboard on our website. We will provide another update within the next 30 minutes.
Posted May 20, 2021 - 20:06 GMT-03:00
Update
Some services were re-established in the past 30 min, but there are some instabilities still in some of our cabinets. Our team is working and as soon as we have the service 100% operational we will investigate and communicate the root cause. We will provide another update within the next 30 minutes.
Posted May 20, 2021 - 19:35 GMT-03:00
Identified
We have identified an adverse situation involving one of our generators in MH1 in the region of São Paulo. This situation affected Maxihost's structure, generating a drop in traffic and bringing some devices down for a great part of our clients. Our team is working on it to quickly recover and bring these servers back online as fast as possible, pushing our redundant solutions to alleviate the situation. Our team is also replying to our support providing more information for the affected clients via our support system. We apologize for the inconvenience. If you need more information, please contact us via [email protected]. We will update this status with more information within the next 30 minutes with more details.
Posted May 20, 2021 - 18:54 GMT-03:00
This incident affected: Regions (MH1 (Sao Paulo)).