Helios News
Back to News
Information regarding Helios Network Outage
Author: Jeremy Herring
November 28th 2012 -

We want to take an opportunity to send you a follow up report regarding the Helios Network outage on November 17th and to provide you with an outline of our response following this incident. We apologize that the incident occurred and were as taken by surprise as you at the magnitude of the situation.

What happened

As you may already know, the entire situation began with the local power utility, Indianapolis Power and Light (IPL), which provides utilities to our data center, LightBound (formerly IQuest). LightBound experienced a sudden and sustained power surge which overwhelmed the surge suppression capabilities of their equipment. Even after their circuit breakers disconnected them from the outside power grid and their standby power generators kicked in to power the data center, the damage to the surge protection equipment was already done. These devices are designed to sacrifice themselves in order to protect valuable electronic equipment but the circumstances of this particular incident were certainly unprecedented and therefore outside of the scope of normal mission-critical contingency planning. Since LightBound supplies the power protection equipment as part of the co-location services provided to Helios, their onsite staff immediately swapped these out to return the Helios Network to powered functionality.

Once the Helios servers were confirmed to be powered back up and our Network Administration Team was able to access these systems remotely, we expected the network services to automatically resume normal functionality. When the network services failed to become available over the internet, they began remote diagnostics and discovered that the load balance servers, which are the first line of interface from the internet, had suffered damage as a result of the power surge. These are critical components of the Helios Network and not supplied by LightBound; therefore, LightBound is restricted in how much action they can take on our behalf when a situation calls for direct interaction with Helios property. In order to remove these damaged servers from the network and reroute all of the incoming internet traffic, one of the Helios Team members had to physically go to data center, remove these devices from the mix, and reroute all of the incoming traffic to properly connect to the related servers (Version 12 Enterprise, Backoffice, credit card processing, Marketing Hub, etc.). This took a considerable amount of time to trace each connection and then confirm each one was properly rerouted.

While all of this was taking place, the Helios Support Team was unexpectedly overwhelmed with calls and emails related to the outage. Weekend staffing levels are not as high as during weekdays and therefore the volume of calls overwhelmed the phone system at Helios / New Sunshine. Many callers were simply getting a busy signal once there were no more available incoming lines. The focus of each tech was simply on conveying to each caller the current status of the incident. The outage even affected the call center’s help desk application which further aggravated the condition since the team could not even log the calls for callbacks. Reinforcements were quickly summoned and they scrambled to create a makeshift shared log on the internal network that would suffice until they could restore the help desk application. Only once the Helios Network was restored to full functionality were they able to clear the call queue and begin callbacks.

There was also a brief subsequent network outage on November 20th which was the result of a cascade failure; without the load balance servers in place to equalize the workload, one server experienced an error that would be easily recoverable under normal circumstances but in the temporary network configuration without load balancing it couldn’t be recovered without affecting all of the other services. One of the reasons why we have two load balance servers in the network is specifically to provide for redundancy in network load management yet the statistical probability of both servers failing at the same time under any other circumstance is practically incalculable.

What we learned

Obviously, the first thing we learned was that our first responders did exactly what they are trained to do. The duration of this incident was not a reflection on any shortcoming on the part of either LightBound or the Helios Network Administration Team. Certainly the circumstances were extraordinary but the same response plans that were designed to address an ordinary or predictable situation proved to be equally effective in addressing this situation. We apologize that the solution took as long as it did to be resolved but the timeframe to restore full functionality was simply a factor that is beyond anyone’s control.

We also learned that our support team is only as good as the tools at their disposal. In this situation, their primary help desk application was effected which limited their ability to adequately cope with the situation. The phone system is also insufficient to accommodate sudden and unpredictable surges in call volume. Even the ability to record a greeting on demand would have helped to curb the call queue.

There is also the outgoing communication factor. Since we use the New Sunshine Marketing Hub for our email communications and that service was also affected by the outage, we were unable to even broadcast an email with information about the situation. In frustration, many of you turned to Facebook and Twitter to inquire and to vent but we don’t always monitor these in real-time so the lack of update posts was interpreted as callousness on our part. For that we are truly sorry. By the time we were able to take our focus off of the crisis at hand and consider addressing the concerns being posted, most of our responses in a public forum would have been inadequate or inappropriate to the tenor of the discussion.

Finally, we have gained a measure of perspective on the degree to which you rely on us as more than simply a business partner but actually as a lifeline for your business. In the wake of this entire situation we have had personal conversations with many who were affected by this outage. Several salon owners were actually reasonably satisfied with the Helios response to the circumstances even if they were not happy about the fact that it occurred in the first place. One common denominator among these reactions is the fact that they had taken steps on their end to build their own contingency plans, going so far as producing an emergency response manual with instructions on how to operate the business manually, copies of contracts and current price lists. In a technology-driven world this may seem like an unnecessary precaution but a better portion of success is simply in preparation for the unexpected.

Steps we are taking

Meetings have already taken place with representatives at LightBound to address future contingency plans and also to discuss remediation for the loss of property and revenue.

Replacement load balance servers are ready to be installed and tentatively are scheduled to be installed this evening (Wednesday overnight/Thursday). The replacement of the load balance servers will return the Helios Network to full functionality including the mobile dashboard application. Multiple load balance servers already represent a significant redundancy in the network

The phone system at Helios and New Sunshine is already scheduled to be replaced with a new digital phone network. The benefits of this new system include the ability to remotely record and apply a greeting message for incoming calls. The number of incoming lines will likely not be increased because there is little evidence that the added expense for these lines could be justified.

The outgoing voice of Helios has been improved to provide more timely communications regarding situations which may affect you. One of the steps already taken has been to add more team members to the group able to send out these communications. Another step already taken has been to configure a redundant email provider in the event that the New Sunshine Marketing Hub is affected by a future incident. By utilizing an external portal, we can send out communications regardless of the status of the Marketing Hub.

Your voice matters

If you would like to share your thoughts, questions or concerns regarding this event, please email us at feedback@gohelios.com. Our management team will be reviewing any and all communications via this email address and will make contact with you personally if you so desire.

Again, we cannot apologize enough for the unfortunate events which took place and we can only hope that with your patience we can work on regaining your trust as a valued partner.

 
Credit Card Processing

Integrated Credit Card
Processing

Learn More  ►
  • sales: (888) 936-5160
  • support: (317) 554-9911

8001 Woodland Dr., Indianapolis, IN 46278       info@gohelios.com

Helios, LLC is a division of New Sunshine, LLC. Copyright © 2024. All Rights Reserved. indianapolis web design by: imavex

Accepted Credit Cards