Most serious IT outages are not won or lost in the moment of failure. They are won or lost in the months before, in the quality of the backups, the clarity of the recovery plan, and the discipline of the maintenance work that nobody notices. What follows is what a competent response actually looks like during an incident, and the prevention work that decides whether the next outage is a half-day disruption or a multi-day crisis.
The first hour
The first hour of an outage sets the trajectory for the rest of the response. The instinct in the room is usually to act fast. The more useful instinct is to slow down by ninety seconds and define the problem with precision before anyone touches a keyboard. We have seen well-meaning attempts to "just try a restart" turn a recoverable incident into a much longer one because the underlying scope was never established.
Start with scope. Is a single workstation affected, or the whole office? Is email down on the desktop but working on phones? Is internet reachable but a specific application unresponsive? "Everything is broken" is almost never the accurate description, and it is almost always the description your provider gets in the first call. Two minutes spent narrowing the scope, before anyone is paged, will save thirty minutes of guesswork after.
Calling your provider
Call your provider as soon as the scope is clear. Give them the specifics in one pass: what stopped working, when it started, how many people are affected, what changed recently. A patch window over the weekend, a vendor update, a power event, new equipment in the rack. The faster they have those facts, the faster they get to the actual cause.
Pay attention to what happens next. In a real emergency, you should hear back inside fifteen to thirty minutes during business hours, with a human on the case and a working theory by the end of the first hour. If you are waiting hours for a callback while the business is down, the gap between the contract you signed and the service you are receiving has stopped being theoretical. The MSP Performance Scorecard covers what reasonable response standards look like in writing.
Communication during the outage
Two audiences need to hear from you during a serious incident: your team and your clients. Both will tolerate a problem and neither will tolerate silence. For your team, the message is what is happening, what is being worked on, and what they should do in the meantime. Without that, people will start their own troubleshooting, and well-intentioned fixes during an active incident are how small problems become bigger ones.
For clients, the message is shorter. "We are experiencing a technical issue and are working to resolve it. We expect to be back to normal by [time]. We will update you if that estimate changes." That sentence, sent proactively, is worth more goodwill than any apology email sent after the fact. Clients forgive problems. They do not forgive being kept in the dark while their work is stuck.
The post-mortem
Once systems are back, the work is not done. Schedule a post-mortem with your provider inside the following week, while the timeline is still fresh and nobody has had a chance to revise the story. A useful post-mortem covers the root cause in plain language, the timeline from detection to acknowledgment to resolution, the measurable impact on the business, and a specific written list of changes that will prevent a recurrence.
The quality of that conversation tells you a great deal about the provider. Vague answers about "a server issue" or "a glitch" are not a root cause. "We will keep an eye on it" is not a prevention plan. If the provider cannot explain what happened and what changes as a result, the same incident is going to happen again, and the cumulative cost of those repeat outages tends to exceed what most owners realize until it is mapped out.
Backups that have actually been restored
A backup that has never been restored is just an expensive hope. The question to ask your provider is not whether you have backups, because the answer is always yes. The question is: when was the last time data was actually restored from those backups to verify the restore worked, and what was the result?
A competent provider can answer that with a date and a description. An evasive answer ("we monitor them daily," "the jobs are completing successfully") is the answer to a different question. Completion of a backup job confirms only that something was written. It does not confirm that what was written is usable, that the retention is correct, or that the restore window meets your tolerance for downtime. Backup verification is one of the security baseline items that should already be running on a documented cadence.
A recovery plan you can reach when systems are down
The most common failure mode we see in recovery plans is not that one does not exist. It is that the plan is stored on the file server that is currently offline. A useful recovery plan lives somewhere reachable during an outage, on paper or in a cloud document accessible from a phone, and it answers a small number of questions. Which systems come back first, and in what order. Who is responsible for each step. Who needs to be notified, with phone numbers that are current. What the step-by-step procedure is for the failure scenarios that are realistic for your environment.
The plan does not have to be elaborate. It has to stay specific and current, and it has to be reachable from outside the affected systems. Most small businesses can fit theirs on two pages.
Redundancy where it pays for itself
Redundancy across every system is not realistic for a small business, and providers who suggest otherwise are usually selling the wrong product. The work is to identify the handful of systems the business genuinely cannot operate without, and protect those specifically. For most small businesses, the short list is internet connectivity, email, and one or two line-of-business applications. A secondary internet circuit on a different carrier, cloud-based email with a tested failover, and documented recovery procedures for the core applications cover the majority of realistic failure scenarios.
Everything else can tolerate a few hours of downtime. The discipline is being honest about which systems are which, and resisting the urge to protect things that do not need that level of protection.
Maintenance is what prevention actually looks like
The IT emergencies we see during assessments are rarely caused by exotic threats. They are caused by hardware that should have been replaced two years ago, by patches that were deferred and then forgotten, by monitoring alerts that fired into an inbox nobody read. Prevention is the discipline of doing the boring work on schedule, every month, for years, and no product can substitute for that.
A proactive provider catches these issues before they become outages: lifecycle planning for aging hardware, a documented patching cadence with reporting, monitoring that someone actually responds to, and quarterly conversations about what is approaching end-of-life. If your provider only appears when something breaks, the emergencies you are surviving are mostly the ones that proactive work would have prevented in the first place. The MSP Frustration Quiz is a quick way to surface whether the relationship you have is proactive in fact or only in marketing.
Find out where you stand before the next outage tests it.
The Technology Confidence Assessment is an independent review of your backup posture, recovery planning, and infrastructure resilience. You get a written read on where the real exposure is, and what to address first.
Request the AssessmentIf you recognize gaps as you read this, that is useful information rather than a verdict. The point of seeing the gap is to close it before an incident closes it for you.