In the news we have this report about a huge failure of Netflix’s internal computer systems.
In brief, the video-rental firm was unable to ship customer orders for two and a half days. Netflix announced rebates to affected customers that will cost millions of dollars. The hit to their reputation — and the possible defection of customers and loss of new customers — will cost millions more.
Netflix is trying to play it quiet. They’ve not publicly stated what the cause of the outage was, nor how it was fixed, nor what they’ll do to prevent it recurring.
Far from hiding anything, this story tells us volumes about Netflix’s internal workings — none of it good. We now know that Netflix has:
1. Poor internal IT infrastructure
2. Weak IT leadership in key positions
3. Inadequate risk reporting to (and by) senior management
4. Little or no effective Board supervision of technology risks and results
Poor internal IT infrastructure
First, Netflix doesn’t have a mature IT infrastructure. If it did, its key systems would have had working failover capability and the outage would have lasted no more than four to six hours. (Really good ones can fail over in a few seconds.) Grown-up companies that rely on internal computer systems for crucial operations will have well planned and frequently rehearsed failover procedures, and even the ability to continue operations temporarily through manual systems.
Hospitals are stuffed with computer systems, and they also have manual procedures that workers are trained to use when a system goes out. US Hospitals have to practice these manual procedures as part of their state health department accreditation. Key hospital systems have failover capability as well.
Failover plans have to be rehearsed, and are remarkably easy to set up — I once managed a program of 24 simultaneous projects to create failover plans for the key systems of a large insurance firm, and we had them in place and rehearsed in less than a year.
1. Adopt ITIL in IT Operations
2. Create failover plans for all crucial systems, and manual procedures for non-crucial systems
3. Test all failover procedures quarterly — and live-test the failover of crucial systems
4. Invest in automated testing and regular reliability rewrites of workhorse software — no more “stepchild” product lines
Weak IT leadership in key positions
Nobody in IT today is unaware of the need for reliability in key operational systems. Yet Netflix still went down for 2.5 days, costing the firm millions. The primary ingredient to building reliability into crucial computer systems is not technological — the ingredient is leadership.
Strong IT leaders will understand the business implications of down-time, and will make the business case upwards to their bosses for what resources are needed to deliver reliability, and will make the case downwards to their teams on what they must do to make the systems and their operations reliable. A strong IT leader will set clear goals, assign people to achieve them, and hold them accountable for results. This obviously is not happening at Netflix.
Looking at Netflix’s history, we see a similar one-day outage in March of this year and an 18-hour outage in 2007. Each time these are “resolved” without ever telling the public — or stockholders — the cause or how it was fixed. This sort of secretiveness is a very bad sign.
Successful firms will publicly describe their problems, causes and resolutions because they are interested in what is right, not who is right — that is, they are interested in fixing the future, not fixing blame for the past. (Electric utilities have very good systems of reporting and responding to outages.) When a company like Netflix is secretive about outages, it’s a clear indicator of a culture of fear, low accountability, blame, and politics. This can easily be verified or tested by talking with front-line workers in Netflix’s IT Operations department — I invite them to write me directly to confirm or correct my hypotheses about their culture.
Additional evidence of a secretive culture that doesn’t admit mistakes openly or address them courageously: Netflix’s corporate PR web site lists dozens of press releases — none of which address the March or August 2008 service outages. I don’t see anything on the 2007 outage either.
1. Train all IT managers on key Leadership behaviors (admitting errors; listening; managing performance; effective praise)
2. Adopt nameless-rankless After Action Reviews to identify root causes and harvest lessons
3. Eliminate naming, blaming and shaming from the IT culture
4. Nurture an IT culture of continuous improvement based on fearlessly confronting reality — being hard on issues and soft on people
Inadequate risk reporting to (and by) senior management
The Netflix 2007 Annual Report flatly admits that the firm’s internal systems are not “entirely redundant” — yet goes on to offer no assessment of the state of the firm’s investment in proprietary software, the need for continual reinvestment in that core code, and the like.
Worse, the two 10-Q quarterly reports immediately following the March 2008 outage say nothing about that outage: “There have been no material changes from risk factors as previously disclosed…”
To anyone familiar with in-house software development, it’s obvious that the “sexy” software projects at Netflix — the ones that get adequate resources and direct attention from senior management, and attract the best internal people — are the front end web site, the proprietary recommendation algorithms, and the new Video on Demand product. The workhorse shipping-and-return software — the likely culprit in both 2008 outages — is almost certainly being treated internally as a stepchild product line, with inadequate investment, no quality-enhancement rewrites, no serious automated testing, and an array of “B” talent assigned to it.
The reports of “rude” customer service phone reps give evidence that even internally, Netflix workers are being left in the dark regarding service reliability. When workers are left uninformed, they get unhappy and frustrated. It’s up to the IT leadership to provide that internal and external communication, building trust and keeping people informed. They didn’t.
1. Senior management should get and review a portfolio report of the status and maturity all IT systems and should spend some time on each, not just the sexy or new ones.
2. Senior management needs to model the behavior it needs to see from IT leadership — namely, honest and open communication, especially about errors, problems, mistakes, and downtime.
Little or no effective Board supervision of technology risks and results
The low visibility in the official documents of so critical a business issue as IT infrastructure reliability, combined with the lack of any evidence of an IT culture of honesty, courage, results orientation and accountability, tell us ultimately that the Board is out of touch with this aspect of the business.
As someone who has served on a few boards can tell you, it’s simple for a board to track any metric or issue that they consider crucial to the business — the Board tells the corporate officers to report on it, and the Board then discloses the results to shareholders and stakeholders through formal documents.
Shareholders might want to invite board members to address this in future. If they don’t, future outages will inevitably occur.
1. The Board should hire an outside auditor to assess the IT culture, the adequacy of reports on IT systems being given to senior management, and the real level of technical risk in the workhorse operational systems
2. The Board should begin reporting to shareholders explicitly on operational risks including IT system down time
3. The Board should assign — or have the CEO assign — direct responsibility to one officer for improving IT systems and operations, and institute clear metrics for measuring progress.
Update 8/29/2008: A more condensed and focused version of the ideas in this blog posting are in this Portland Business Journal column.