View Poll Results: Is RIM fully at fault for the outage?

Voters
41. You may not vote on this poll
  • Yes absolutely

    27 65.85%
  • No they didn't make the equipment that failed

    10 24.39%
  • RIM should make their own switches using QNX!

    4 9.76%
  1. dentynefire's Avatar
    Just wanted to point out that the BB outage wasn't a result of something RIM made. If a switch fails and a backup fails (which I don't understand how) the blame in some way is the manufacturer of the faulty equipment.

    RIM still has to take responsibility for this even though it wasn't their own product. So do you think they deserve all the blame?
    10-13-11 07:24 AM
  2. Mystic205's Avatar
    you do know the failures (plural) was in multiple countries over a period of a week or more?... this was not the case of a plug pulled out by a cleaning lady...
    10-13-11 07:30 AM
  3. dentynefire's Avatar
    I'm aware of that. One NOC is used for multiple countries so a major piece of equipment failing can cause a domino effect. They say their backup system failed too which I find strange.

    Not to mention the back log of messages that had to clear. Facebook had something similar when they went down (a back log of messages) if I'm not mistaken. Load balancing of these huge servers isn't something I want to start thinking about
    10-13-11 07:33 AM
  4. PDSchofield's Avatar
    The switch that failed was not made by RIM, probably Cisco or Juniper or someone like that. But these switches have excellent high availability features in their operating software and that was configured by RIM network engineers. RIM can only pass on a portion of the blame to the manufacturer if it is proven that the issue was with the product, not just a misconfiguration.
    10-13-11 07:34 AM
  5. lnichols's Avatar
    Just wanted to point out that the BB outage wasn't a result of something RIM made. If a switch fails and a backup fails (which I don't understand how) the blame in some way is the manufacturer of the faulty equipment.

    RIM still has to take responsibility for this even though it wasn't their own product. So do you think they deserve all the blame?
    I don't make any of the network hardware used in the network I design and support either either, but I'm responsible for the engineering of the network and testing out the backup capabilities. Also if a single switch fails, the hot active didn't kick in, then their should be either a way to force the transition or reprogram things to get things working again in a short amount of time. I expect equipment to fail, I backup existing equipment, and if it fails install the replacement and put in the backup configs.

    Part of what I'm wondering now is how many of the recent layoffs happened in the NOC/Datacenters and if possibly the top level talent that knew what was going on was let go so that cheaper less knowledgeable people could perform the same roles (which didn't happen)? Anyway I'm not buying the single core switch explanation simply because the outage was way too long for that to have been the issue. If it indeed was the case then RIM is in deep trouble because it would imply that they have severely neglected the infrastructure to the point of not having replacement equipment on hand and knowledgeable people on site to address critical issues.
    10-13-11 07:38 AM
  6. dentynefire's Avatar
    Part of what I'm wondering now is how many of the recent layoffs happened in the NOC/Datacenters and if possibly the top level talent that knew what was going on was let go so that cheaper less knowledgeable people could perform the same roles (which didn't happen)? Anyway I'm not buying the single core switch explanation simply because the outage was way too long for that to have been the issue. If it indeed was the case then RIM is in deep trouble because it would imply that they have severely neglected the infrastructure to the point of not having replacement equipment on hand and knowledgeable people on site to address critical issues.
    I hope that isn't the case with the layoffs. There was that ex-employee that went to the paper and talked about the problems with the legacy NOC code. I'd believe that to be a possible reason after the initial failure. Either way some lessons learned I'm sure!
    10-13-11 07:56 AM
  7. Caymancroc's Avatar
    RIM making their own switches?

    Don't they have their hands full just trying to keep up with the phone business? On top of that, they are dealing with a failed tablet launch and a potential glimpse into how QNX is going to be received into the market. Now you want them to make switches? Dios mio amigo!
    10-13-11 08:00 AM
  8. rdkempt's Avatar
    I could care less if Apple and Google made the switches for RIM - they were not properly configured or tested for the failover... these switches have redundancy and failover options but were obviously misconfigured by RIM engineers. 100% their fault for this, no need to make excuses for them here.
    10-13-11 08:09 AM
  9. dentynefire's Avatar
    RIM making their own switches?

    Don't they have their hands full just trying to keep up with the phone business? On top of that, they are dealing with a failed tablet launch and a potential glimpse into how QNX is going to be received into the market. Now you want them to make switches? Dios mio amigo!
    Well I had to put a third option
    Been a fan of QNX even before RIM purchased them. Its a really solid base OS. I think the market generally likes QNX. Poor PB reviews primarily diss the lack of apps esp. PIM or the power button lol
    10-13-11 08:25 AM
  10. d3adcrab's Avatar
    I've done Disaster Recovery consulting work for a number of global multinational organizations including some big name financial and telecoms institutions. One thing I have learnt in this time is that no-one can gaurantee 100% availability. You can spend hundreds of millions of dollars on DR infrastructure and redundant failover equipment but cannot gaurantee a 100% switch over success to DR facilities in the event of an incident. Have lost count of the number of times simulated switch overs to DR facilities have failed during periodic DR testing or when infrastructure fell over just days after a successfuly test....its life, it happens. Expecting it not to is just plain unrealistic.

    Incidentally, here is an interesting article that talks a bit on RIM's infrastructure and why seemingly small issues affect it in a big way: HowStuffWorks "Reasons Behind BlackBerry Service Outages"
    dentynefire and Superfly_FR like this.
    10-13-11 08:27 AM
  11. Caymancroc's Avatar
    Oh. Cool. Never made a poll. Now I know.
    10-13-11 08:28 AM
  12. anon(3678875)'s Avatar
    It doesn't matter if the equipment was made by RIM or Cisco or whoever.... If they do not have strong change control, they can simply overlook an ACL changefor example on their TIER1 devices which could cause such cases, believe me.

    In these cases, it takes time to determine the root cause especially if they do not control their activities on these switches/routers etc.
    10-13-11 08:47 AM
  13. dentynefire's Avatar
    My experience is with electrical and so I have had equipment fail at times also. You scramble to trouble shoot the fault and repair it. Most times easy, sometimes not so much. For RIM to get all the blame tossed their way is understandable, it is their system people pay to use and depend on. As some have pointed out already, technology isn't perfect. We can do our best to prevent downtime but chances are with complicated systems **** will hit the fan at some point.

    @d3adcrab
    that was an interesting read. Perhaps if they did invest in another NOC that would limit any future problem. But then again who want to pay more? 99.9% is good enough for me
    10-13-11 09:38 AM
  14. JeepBB's Avatar
    Telecoms switches are very reliable beasties, but they will break, and itll always be at the worst time.

    Thats why competent network operators build redundancy, failover, and protection routines into their networks. Who made the switch and why it failed is irrelevant. RIMs cascade failure was to not have redundancy and have a failover solution that didn't work and then take 3-days to recover whilst maintaining a stony silence as the world and wife screamed at them and the press had a field day writing "failing company fails!" stories!

    Thats a failure on so many levels that a lesser company would be incapable of achieving all of it.
    10-13-11 09:54 AM
  15. T
    Perhaps if they did invest in another NOC that would limit any future problem. But then again who want to pay more? 99.9% is good enough for me
    yea, people cry here all the time about the cost of BIS like it's some big deal. I think it's fine the way it is. A 99% uptime (after this outage) is still reasonable as far as I'm concerned.

    As for the matter of layoffs, Mike just denied in the conference that it had anything to do with the outage. Though I don't know the details, I find it hard to believe RIM laid off vital in-the-know people rather than simply expendables ...
    10-13-11 09:54 AM
  16. DenverRalphy's Avatar
    In the field of Security & Disaster Recovery, 99% uptime is nowhere near important as the duration length of any one downtime.
    Buzz_Dengue and Laura Knotek like this.
    10-13-11 10:01 AM
  17. Altarocks's Avatar
    How about substituting the word "responsible" in place of "at fault".
    10-13-11 10:11 AM
  18. Kg810's Avatar
    Just wanted to point out that the BB outage wasn't a result of something RIM made. If a switch fails and a backup fails (which I don't understand how) the blame in some way is the manufacturer of the faulty equipment.

    RIM still has to take responsibility for this even though it wasn't their own product. So do you think they deserve all the blame?
    You are so delusional and in such denial of RIM's constant failures.

    Here's what I imagined you saying during the BP oil spill

    "Just wanted to point out that the explosion at the BP gulf oil rig and the oil spill wasn't a result of something BP made. If something failed and caused an explosion which then backup systems failed (which I don't understand how) the blame in some way is the manufacturer of the faulty machinery and equipment."

    BP still has to take responsibility for this even though it wasn't their own product. So do you think they deserve all the blame?"

    /facepalm ... get your head out of RIM's *******
    10-13-11 10:15 AM
  19. dentynefire's Avatar
    You are so delusional and in such denial of RIM's constant failures.

    Here's what I imagined you saying during the BP oil spill

    "Just wanted to point out that the explosion at the BP gulf oil rig and the oil spill wasn't a result of something BP made. If something failed and caused an explosion which then backup systems failed (which I don't understand how) the blame in some way is the manufacturer of the faulty machinery and equipment."

    BP still has to take responsibility for this even though it wasn't their own product. So do you think they deserve all the blame?"

    /facepalm ... get your head out of RIM's *******
    wow!
    I don't know the reasons why this happened at all. Regarding BP. I think its clear what happened. I'm giving RIM the benefit of the doubt, I'm on the fence about it. I created the thread and poll so that people more knowledgeable than me can discuss a topic which is far over my head.

    I'm not sure what world you live in where everything is perfect, 100% of the time. You obviously never had one of these:
    car or truck part fail
    computer part fail
    electricity fail
    internet/phone network fail

    Nooooooo these things NEVER happen.

    What you are saying is that every street light should have battery backup lol would be nice. Why don't you get on that. I sure your municipality will agree lol
    Last edited by dentynefire; 10-13-11 at 11:01 AM.
    10-13-11 10:56 AM
  20. ichat's Avatar
    wow!
    I don't know the reasons why this happened at all. Regarding BP. I think its clear what happened. I'm giving RIM the benefit of the doubt, I'm on the fence about it. I created the thread and poll so that people more knowledgeable than me can discuss a topic which is far over my head.

    I'm not sure what world you live in where everything is perfect, 100% of the time. You obviously never had one of these:
    car or truck part fail
    computer part fail
    electricity fail
    internet/phone network fail

    Nooooooo these things NEVER happen.

    What you are saying is that every street light should have battery backup lol would be nice.
    I'm with you. Hey KG810 or if I spelled It wrong sorry. Tell me you never had a problem before and I will agree. Everyone has problems. Big and small. We will forget soon. You will see......

    Posted from my CrackBerry at wapforums.crackberry.com
    10-13-11 11:01 AM
  21. Kg810's Avatar
    wow!
    I don't know the reasons why this happened at all. Regarding BP. I think its clear what happened. I'm giving RIM the benefit of the doubt, I'm on the fence about it. I created the thread and poll so that people more knowledgeable than me can discuss a topic which is far over my head.

    I'm not sure what world you live in where everything is perfect, 100% of the time. You obviously never had one of these:
    car or truck part fail
    computer part fail
    electricity fail
    internet/phone network fail

    Nooooooo these things NEVER happen.

    What you are saying is that every street light should have battery backup lol would be nice. Why don't you get on that. I sure your municipality will agree lol
    Who said anything about being perfect? Don't go off on a tangent, please.

    As I look at the title of your thread and your original post... you are hardly looking for a discussion, you are clearly defending RIM and trying to pawn off the blame to someone else.

    I think you fail to realize the magnitude of their fck up. This wasn't some physical part of a product failing, this was their entire BIS failing. Your examples is like my Shaw modem breaking vs Shaw's entire TV, phone and internet services going out for all their customers in Canada.
    10-13-11 12:37 PM
  22. SharpieFiend's Avatar
    If the switch that failed was made by Cisco (very likely) then it *was* running QNX. Cisco has used QNX in iOS for a very, very long time...

    However, when they say that the switch failed we don't know if it was a hardware or provisioning issue. If it was provisioning then it certainly was RIM's fault because their network engineers configured it.
    10-13-11 08:37 PM
  23. dentynefire's Avatar
    If the switch that failed was made by Cisco (very likely) then it *was* running QNX. Cisco has used QNX in iOS for a very, very long time...

    However, when they say that the switch failed we don't know if it was a hardware or provisioning issue. If it was provisioning then it certainly was RIM's fault because their network engineers configured it.
    I would agree with you if you could convince me that IOS uses QNX, Oooops see how easy it is to make a mistake. With your clear knowledge of the matter you should apply to RIM to clean up their apparent mess.
    10-14-11 06:43 AM
LINK TO POST COPIED TO CLIPBOARD