How to prevent and deal with the five major risks of data center infrastructure operation and maintenance
Time of issue:2022-10-11
As a data center operation and maintenance personnel, the biggest headache is that with the increase of operating years, the "growing pains" faced by the data center, dealing with various risks has become the most important content of the operation and maintenance work.
Risk 1: flaws in the construction process lead to congenital defects
The construction process of the data center is actually to achieve the goal of "what you need is what you need, what you build is what you get, and what you get is what you use". "What you need is what you build, what you build is what you get" occurs before the data center is put into production, and "what you get is what you use" is mainly manifested in the operation and maintenance stage.
Before a data center operates normally, it usually goes through several important links including planning, scheme design, equipment selection, project implementation, and commissioning and acceptance. In this process, if any link is flawed, it may directly lead to congenital defects in the final delivered data center infrastructure. Therefore, "what you need is what you build, what you build is what you get" requires a very professional collaboration of many departments and links to achieve.
The key to solving such risks lies in early avoidance, but to effectively guide customers to achieve reasonable early avoidance, the service team must have rich experience in planning, design, construction, and operation and maintenance at the same time.
Risk 2: Changes in the operating environment of the data center
With the rapid development of user business, the scale of IT equipment will gradually increase, which also leads to the carrying capacity of the data center exceeding the pre-planned level. In terms of capacity and configuration of infrastructure such as power supply and thermal management, it is difficult to adapt to the needs of business development. In terms of reliability, the operation of the data center faces a major security risk.
In view of the changes brought by this factor to the operating environment of the data center, the main solution is to continuously and comprehensively monitor the resources of the data center and equip a professional resource management system, so that the operation and maintenance personnel can have a better understanding of the use of the existing resources of the data center. Real-time and comprehensive control, timely discovery and handling of problems through scientific means.
Risk 3: The performance of data center equipment is aging
After long-term operation of data center related equipment, due to wear and tear and other reasons, the performance of the equipment will be greatly reduced, or even suddenly stop operation, which will bring great hidden dangers to the operation of the data center. From the perspective of maintenance, data center equipment and systems can be divided into wear parts, maintenance parts and maintenance-free parts. Lost parts usually include various mechanical moving parts and electronic devices, which have a short design life and need to be replaced regularly. Common wear parts such as UPS capacitors, fans of outdoor units of air conditioners, etc. Although the maintenance parts are not prone to aging and damage, they also need regular maintenance and maintenance, and if the daily maintenance is not in place or inappropriate, the aging will also be accelerated, so it needs to be replaced. Typical maintenance parts such as various water valves, parts piping, etc.
Data center operation and maintenance should establish a management file of wear and tear parts, so as to keep abreast of equipment operating beyond the design life at any time. At the same time, for maintenance parts and non-scheduled replacement parts, in addition to the necessary daily maintenance, regular testing and evaluation should also be carried out, so as to discover hidden dangers of equipment aging in time.
Risk 4: Improper operation and maintenance habits
The biggest reason for this risk comes from the lack of professional level of operation and maintenance personnel. Operation and maintenance work requires a complete system and management process to support and standardize. When the structure and organization of the operation and maintenance system is missing or unreasonable, the operation and maintenance work may be in a state of out of control. The quality and rationality of operation and maintenance cannot be effectively evaluated. Among them, effectively improving the technical level of operation and maintenance personnel is an important factor in determining the effect of operation and maintenance.
It is suggested that users need to look at the operation and maintenance work from the perspective of the full life cycle of the data center, and do a good job in several key links such as third-party acceptance, the connection between construction and operation and maintenance, the operation and maintenance process, and the technical training of operation and maintenance personnel.
Risk 5: Unexpected events that exceed expectations
Data centers are planned, designed and constructed based on certain assumed environmental boundaries, and these assumed environmental boundaries are often based on relevant national regulations, past local measurement data inferences, and other regional construction and operation experience. However, in the long life cycle of a data center, unexpected events that exceed expectations may occur, causing major hidden dangers to the operation of the data center.
Based on the different scope of possible impacts of unexpected events, data center users should adopt corresponding strategies in advance. For example, off-site disaster recovery facilities, real-time or asynchronous backup mechanisms, emergency handling procedures and drills, backup technical support systems, and corresponding-scale spare parts warehouses.