Host crash
Resolved
Nov 17 at 06:02pm GMT
After an extended outage and exhaustive collaborative efforts between Assurestor Engineering and Zerto HPE Engineering, the underlying cause of the service disruption was identified late this afternoon. Within the core DR2Cloud VLAN at our new London data centre, we detected abnormal levels of packet loss and high-latency network transactions.
A detailed investigation of our core 100-gig switching fabric revealed that a 10Gb DAC was experiencing intermittent packet drops, which in turn caused an LACP trunk to flap unpredictably. This instability directly impacted traffic flows across the DR2Cloud platform. The faulty DAC has now been disabled and is scheduled for replacement during an emergency site visit tomorrow. In the meantime, one of our management servers will operate temporarily on its 1GbE interfaces until the 10GbE DAC is fully restored.
We believe the initial host crash that occurred at 12:40pm on Sunday the 18th was unrelated to this hardware fault. However, the subsequent recovery process generated a sudden increase in platform load and network traffic volumes. This surge exposed and amplified the existing issue on the compromised DAC, resulting in the widespread packet loss and degraded performance observed.
Following the isolation and removal of the faulty component, packet loss levels have dropped significantly. Multiple customer sites have begun reconnecting, and bitmap synchronisation processes are now progressing as protected virtual machines synchronise approximately the last 26 hours of changes.
Due to the intermittent and non-deterministic nature of the fault, root-cause identification took longer than anticipated. Throughout the incident, although new checkpoints were temporarily unavailable, all historic checkpoints and full recovery capabilities remained intact for all DR2Cloud customers.
Planned Improvements and Monitoring Enhancements
To strengthen platform resilience and reduce time-to-diagnosis for similar issues, we are progressing with the following enhancements:
Enhanced Real-Time Telemetry
Increasing the granularity of switch-level telemetry and flow analytics to provide earlier visibility into abnormal packet-loss patterns and LACP instability.
Automated Threshold-Based Alerting
Deploying automated alerts for micro-burst packet loss, CRC errors, and trunk-flap events to ensure hardware degradation is surfaced immediately.
Affected services
Updated
Nov 17 at 04:51pm GMT
We are going to make an adjustment to the ZVMs database setup and this will bring down the platform for around 30 minutes. We will report back once everything is back online and see if that alleviates any problems.
Affected services
Updated
Nov 17 at 04:04pm GMT
Apologies for the lack of communication. We are actively collaborating with Zerto Engineering to identify the root cause of the issue. While we do not yet have definitive answers, a dedicated resource has now been assigned to expedite the resolution process.
Affected services
Updated
Nov 17 at 09:42am GMT
Thank you to clients who responded to our request we have now performed the requested action from Zerto at a client site, unfortunately this did not resolve the issue as hoped, data has been passed back to Zerto Support for further feedback.
Affected services
Updated
Nov 17 at 08:57am GMT
We have had some feedback from Zerto and are requesting any impacted users to raise a ticket and be available for a teams call to perform a remote session with one of our engineers
Affected services
Updated
Nov 17 at 05:48am GMT
We have still been working since our last update with the Zerto global support teams and their follow-the-sun regions, at this point we have no further updates to report until Zerto provide more feedback to us.
Affected services
Updated
Nov 17 at 01:54am GMT
We continue to work with the Zerto L2 team to get a resolution, currently all collected logs are being parsed and analysed to try to discover the root issue, at this point in time all connected client sites remain in a synchronisation state with the DR2Cloud ZVM but are not parsing replication traffic to the cloud VRAs
Affected services
Updated
Nov 16 at 10:38pm GMT
Our joint investigation is still ongoing with Zerto support, unfortunately so far we have not been able to identify the root cause and thus fix the issue that is impacting connectivity to the platform, further updated will continue to follow via this incident report
Affected services
Updated
Nov 16 at 07:32pm GMT
The DR2Cloud platform remains in a degraded state we are currently running through several process to try to get service resumed in conjunction with Zerto L2 Support, during this period VPGs will be showing in various error states, however recovery remains available if required.
Affected services
Updated
Nov 16 at 02:42pm GMT
Although all hosts and Zerto components are operational and reporting healthy we continue to see VPGs fall out of sync, a SEV1 case has been opened and an initial call completed with Zerto Support detailed logs are currently being collected for analysis by Zerto
Affected services
Updated
Nov 16 at 01:21pm GMT
Host is back online after reboot, RCA is currently underway, we expect bitmap syncs to trigger shortly to bring VPGs back into normal RPO tolerances
Affected services
Created
Nov 16 at 12:49pm GMT
One of the DR2Cloud ESX hosts crashed and is currently being rebooted to bring it back online, currently some VPGs homed on the impacted host will be reporting an error state, we expect this to clear within an hour and will post updates here
Affected services