It is quite normal that a complex IT system like an SAP ERP is occasionally subject to disruptions. But does so much time really always have to pass before the system and its users are freed from the pain?
This is an appeal to remember the real goal.
Recently I was able to observe it again: a small error in a transport order with a big impact. It happened like this: at the customer's site, there is a control table for calling function modules developed in-house in user exits. In this table the function modules are maintained as well as an "active" flag. Before the function module is called, it is checked whether the exit is active. If so, the function module is called.
Now it appears that the change in the control table has surpassed the development of the corresponding function module, and the transport request with the "active" flag arrived in the production system before the function module. It happened as it had to: within a very short time, important transactions in the logistics area terminated with runtime error CALL_FUNCTION_NOT_FOUND. Not only about 800 dialog users, but also several interfaces were affected. The system was busy writing short dumps, and new error messages about this error were piling up in the ticket system.
Within a relatively short time, the cause was clear - the above-mentioned new record in the control table. The malfunction in the production system would now have been very easy to be fixed: just delete the record or remove the "active" flag, and the failures would not have occurred anymore. Well, easier said than done. The control table could not be managed in the system, but is delivered to the production system exclusively via transport requests. So the detour via the development system had to be taken, where the "active" flag was removed and this change was written to a new transport request. The transport request must be accompanied by documentation, and then an "emergency transport" process must be performed to import this transport request into the production system. A standardized e-mail is then sent to the trip for approval. From the beginning of the error, it took 1:45 hours to fix this error in the production system. During this time, the system suffered from almost 10,000 transaction failures, each with a short dump. And for the same period of time the ticket system received more and more fault reports, all related to this error.
In my opinion, the reason of this waste of time is that the repair of the malfunction in the production system had to take an unnecessary detour via the development system. If the "active" flag had been removed directly in the production system as soon as the cause was identified, the duration of the malfunction would have been reduced to about 20 minutes. Of course I don't mean that every malfunction should be fixed directly in the production system. But in this case, where the cause was clear and the elimination of the malfunction (the removal of the "active" flag) did not even have to be checked in a quality or test system, an unnecessary amount of time was lost, during which the system, its users, the interfaces, and ultimately the company and its business processes suffered. It was painful to have to look at this in the light of the fact that immediate troubleshooting would have been actually quite easy. With our product "Shortcut for SAP systems", the "active" flag or the whole record could have been removed within 1 minute.
By the way, the customer values 1 hour downtime of the system at over 200.000€. If we assume that the malfunction could have been fixed within 20 minutes, the delayed troubleshooting after 1:45 hours caused additional costs of about 280.000€. Ok, the system was not really "down", but working with the system was only possible in a very limited range (and not at all in the concerned logistics transactions). So, there still remains a considerable financial loss.
Hence my appeal to the many brave employees in IT support and especially to the superiors responsible for IT support:
The highest goal must be to get the productive system running smoothly again as soon as possible.
Adjustments on the development system can be carried out in peace after the malfunction in the production system has been eliminated - that is only secondary. Equally secondary are process descriptions, documentation, approvals etc. This can also be done after the fault has been resolved.
So, here are my thoughts and my recommendation:
- Do not rigidly adhere to processes and approval paths that might delay fast and efficient troubleshooting. Don't blindly trust the voices of auditing companies telling you that every change must be documented, approved and tested before it reaches the productive system. Also, be careful when security experts try to convince you of a "minimum principle" of authorizations for each and every user working in the system - it is likely that at least for the people that have to fix a problem the limited authorizations won't be sufficient in an emergency case.
Both, auditors and security experts are not your customers! Your customers are those who can expect you to deliver the goods on time. These customers should be the focus of the troubleshooting - not the consultants whose customer you yourself are.Process descriptions describe the normal, everyday situation. In an emergency situation acting purposefully and rationally should be allowed to beat any process description! Do not allow rigid process descriptions to paralyze you resp. your company and leave no opportunity for appropriate, reasonable action.
- Set up a "SWAT" team to troubleshoot your production systems and provide them with the necessary authorizations, tools and, most importantly, the necessary confidence.
This can be a really small team. And these people don't have to be module experts, but employees who have a good knowledge of the SAP environment and who can act as an extended arm of the module experts in the systems if necessary. Think of this team as part or as an extension of your regular support team.
It is not even necessary to recruit new staff. The members of this team may well have other full-time tasks, it is not necessary to reserve the working time of the members of this team exclusively for possible emergencies. But if your system - and therefore your company! - gets into a precarious situation, it is important that at least one member of this team is available and can act without first having to overcome bureaucratic or formalistic hurdles.
If you have a "minimum authorisation" strategy - leave them out of it and allow them to rescue your company. Give these employees the confidence they need - don't let them, their commitment and motivation fail and despair because of missing authorizations and missing tools.
It pays off - by the satisfaction of your customers with your company's reliability.