System Maintenance Bsc Mcbsc

GSM/EDGE BSS, Rel.GSM 17, Operating Documentation, Issue 03, Change Delivery 01 System Maintenance DN0975145 Issue 2 Approval Date 2016-08-12 System Maintenance The information in this document applies solely to the hardware/software product (“Product”) specified herein, and only as specified herein. Reference to “Nokia” later in this document shall mean the respective company within Nokia Group of Companies with whom you have entered into the Agreement (as defined below). This document is intended for use by Nokia's customers (“You”) only, and it may not be used except for the purposes defined in the agreement between You and Nokia (“Agreement”) under which this document is distributed. No part of this document may be used, copied, reproduced, modified or transmitted in any form or means without the prior written permission of Nokia. If You have not entered into an Agreement applicable to the Product, or if that Agreement has expired or has been terminated, You may not use this document in any manner and You are obliged to return it to Nokia and destroy or delete any copies thereof. The document has been prepared to be used by professional and properly trained personnel, and You assume full responsibility when using it. Nokia welcomes your comments as part of the process of continuous development and improvement of the documentation. This document and its contents are provided as a convenience to You. Any information or statements concerning the suitability, capacity, fitness for purpose or performance of the Product are given solely on an “as is” and “as available” basis in this document, and Nokia reserves the right to change any such information and statements without notice. Nokia has made all reasonable efforts to ensure that the content of this document is adequate and free of material errors and omissions, and Nokia will correct errors that You identify in this document. Nokia's total liability for any errors in the document is strictly limited to the correction of such error(s). Nokia does not warrant that the use of the software in the Product will be uninterrupted or error-free. NO WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY OF AVAILABILITY, ACCURACY, RELIABILITY, TITLE, NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, IS MADE IN RELATION TO THE CONTENT OF THIS DOCUMENT. IN NO EVENT WILL NOKIA BE LIABLE FOR ANY DAMAGES, INCLUDING BUT NOT LIMITED TO SPECIAL, DIRECT, INDIRECT, INCIDENTAL OR CONSEQUENTIAL OR ANY LOSSES, SUCH AS BUT NOT LIMITED TO LOSS OF PROFIT, REVENUE, BUSINESS INTERRUPTION, BUSINESS OPPORTUNITY OR DATA THAT MAY ARISE FROM THE USE OF THIS DOCUMENT OR THE INFORMATION IN IT, EVEN IN THE CASE OF ERRORS IN OR OMISSIONS FROM THIS DOCUMENT OR ITS CONTENT. This document is Nokia proprietary and confidential information, which may not be distributed or disclosed to any third parties without the prior written consent of Nokia. Nokia is a registered trademark of Nokia Corporation. Other product names mentioned in this document may be trademarks of their respective owners. Copyright © 2017 Nokia. All rights reserved. f Important Notice on Product Safety This product may present safety risks due to laser, electricity, heat, and other sources of danger. Only trained and qualified personnel may install, operate, maintain or otherwise handle this product and only after having carefully read the safety information applicable to this product. The safety information is provided in the Safety Information section in the “Legal, Safety and Environmental Information” part of this document or documentation set. Nokia is continually striving to reduce the adverse environmental effects of its products and services. We would like to encourage you as our customers and users to join us in working towards a cleaner, safer environment. Please recycle product packaging and follow the recommendations for power use and proper disposal of our products and their components. If you should have questions regarding our Environmental Policy or any of the environmental services we offer, please contact us at Nokia for any additional information. 2 © 2017 Nokia DN0975145 Issue: 2 System Maintenance Table of Contents This document has 67 pages Summary of changes..................................................................... 7 1 Introduction.................................................................................... 9 2 System maintenance....................................................................10 2.1 System maintenance overview.................................................... 10 2.1.1 Maintenance tasks....................................................................... 10 2.1.2 System maintenance concept...................................................... 11 2.2 System supervision...................................................................... 12 2.2.1 Hardware supervision.................................................................. 13 2.2.2 Software supervision....................................................................13 2.2.3 Time supervision.......................................................................... 13 2.3 Alarm system............................................................................... 14 2.3.1 Alarm functions............................................................................ 14 2.3.1.1 Collection of alarms......................................................................14 2.3.1.2 Saving of alarms.......................................................................... 16 2.3.1.3 Output of alarms...........................................................................16 2.3.1.4 Control of alarm outputs indicating general alarm situation......... 17 2.3.1.5 Startup of recovery functions....................................................... 18 2.3.1.6 Informing about alarms................................................................ 18 2.3.2 Implementation of alarm functions............................................... 19 2.3.3 Structure of the alarm system...................................................... 23 2.3.4 Classification of alarms................................................................ 23 2.3.5 Structure of the alarm printout......................................................25 2.4 Recovery system..........................................................................29 2.4.1 Functional unit working states......................................................30 2.4.2 Functional unit hierarchy.............................................................. 32 2.4.3 Redundancy models.................................................................... 33 2.4.4 State transitions........................................................................... 34 2.4.5 Working state administration ....................................................... 37 2.4.6 Implementation of recovery functions.......................................... 39 2.5 Diagnostics...................................................................................44 2.5.1 Diagnostics procedures in normal and special failure situations...... 45 2.5.2 Changes in unit states during a diagnosis................................... 46 2.5.3 Initial conditions for diagnoses..................................................... 46 2.5.4 Fault situations of the diagnostic system..................................... 47 2.5.5 Total and partial unit tests............................................................ 47 2.6 General maintenance procedures................................................51 2.6.1 Actions before a system restart....................................................54 2.6.2 CCS7 30 Minutes freezing done.................................................. 56 2 Appendix: Terminologies.............................................................. 57 DN0975145 Issue: 2 © 2017 Nokia 3 System Maintenance 3 BCSU/BCXU Recovery and Alarms 690, 691 and 1001..............58 4 Expiry of licenses......................................................................... 60 5 PCU2 serial port usage................................................................ 63 6 PCU plug-in units without unique MAC HW address................... 64 7 Recommendations for PCU2 Black Box Saving State................. 66 4 © 2017 Nokia DN0975145 Issue: 2 System Maintenance List of Figures Figure 1 Fault management principles..............................................................11 Figure 2 Implementation of the alarm system in classic NE ............................ 19 Figure 3 Implementation of the alarm system in ATCA.................................... 20 Figure 4 Implementation of the alarm system in BCN...................................... 21 Figure 5 Alarm printout fields............................................................................25 Figure 6 STMU - hierarchical unit..................................................................... 33 Figure 7 ATCA hierarchical unit........................................................................ 33 Figure 8 Example of the BCN hierarchical unit.................................................33 Figure 9 Time slot based units (units that have no redundancy or complementary N+1 redundancy units)..............................................35 Figure 10 Other backed-up units (2N or replaceable redundant N+1 units)....... 35 Figure 11 State transitions of I/O devices...........................................................35 Figure 12 State changes in computer units with no redundancy........................ 35 Figure 13 State changes in units with 2N redundancy....................................... 36 Figure 14 State changes in computer units with replaceable N+1 redundancy...... 36 Figure 15 State changes in functional units with complementary N+1 redundancy.........................................................................................37 Figure 16 Structure of the recovery function.......................................................40 Figure 17 Changes in unit states........................................................................ 46 Figure 18 Interdependencies between partial diagnoses in processor diagnostics ............................................................................................................50 Figure 19 Interdependencies between partial diagnoses in switching network diagnostics..........................................................................................51 Figure 20 Daily maintenance routines................................................................ 52 Figure 21 Weekly maintenance routines............................................................ 52 Figure 22 Monthly maintenance routines............................................................53 Figure 23 Six-monthly maintenance routines..................................................... 53 Figure 24 Yearly maintenance routines.............................................................. 53 Figure 25 External equipment maintenance routines......................................... 54 DN0975145 Issue: 2 © 2017 Nokia 5 System Maintenance List of Tables Table 1 Alarm numbering................................................................................ 24 Table 2 Reserved number for possible external alarms.................................. 24 Table 3 Working states of the functional units................................................. 30 Table 4 Relationship between unit working states and the interface setting... 42 Table 5 Relationship between unit working states and wired switch over control.................................................................................................43 Table 6 Relationship between unit working states and blocking of time-slots..... 43 Table 7 Relationship between unit working states and logical addresses.......44 Table 8 Switching network diagnostics............................................................49 Table 9 Processor diagnostics........................................................................ 49 Table 10 Validity................................................................................................ 55 Table 11 Validity................................................................................................ 56 Table 12 Terminologies in classic DX 200, ATCA and BCN.............................. 57 Table 13 ............................................................................................................58 Table 14 Validity................................................................................................ 60 6 © 2017 Nokia DN0975145 Issue: 2 System Maintenance Summary of changes Summary of changes Changes between document issues are cumulative. Therefore, the latest document issue contains all changes made to previous issues. Changes between issues 2((2016/08/12, GSM16)–3(26/04/2017) The following technical support notes are added: • PCU2 serial port usage • PCU plug-in units without unique MAC HW address • Recommendations for PCU2 Black Box Saving State Changes between issues 1–3 (2011/11/18, RG20(BSS)) and 2 (2016/08/12, GSM16) Following sections are added: • Actions before a system restart • CCS7 30 Minutes freezing done Changes made between issues 1–3 (2011/11/18, RG20(BSS)) and 1–2 (2011/10/25, RG20(BSS)) The following chapters have been renamed: • Fault management to Alarm system • Diagnostics and testing to Diagnostics The following chapter has been removed: • Recovery and unit working state The following chapters have been added: • System supervision • Recovery system • General maintenance procedures Changes made between issues 1–2 (2011/10/25, RG20(BSS)) and 1–1 (2011/09/13, RG20(BSS)) • Removed alarm 71087 from the list of mcTC internal supervision alarms mapped to BSC alarm’s 3594. Changes made between issues 1–1 (2011/09/13, RG20(BSS)) and 1–0 (2011/08/05, RG20(BSS)) • Title changed to System Maintenance in BSC and mcTC due to introduction of multicontroller transcoder (mcTC). • Chapter Introduction added to describe the purpose and how to use the document. DN0975145 Issue: 2 © 2017 Nokia 7 Summary of changes System Maintenance • Added section System Maintenance in mcTC to describes the procedures used in maintaining mcTC. 8 © 2017 Nokia DN0975145 Issue: 2 System Maintenance Introduction 1 Introduction This document provides information on different mechanisms implemented in BSC which increases the system tolerance in fault situations. Section System maintenance in BSC and mcBSC describes the main areas on system supervision in BSC consisting of system supervision, alarm system, recovery system, and fault location system. It provides details of alarm classification and structure. The different redundancy models and state information of the different network elements are also described. Details of system diagnostic and testing are provided to test and diagnose the system for any faults. DN0975145 Issue: 2 © 2017 Nokia 9 System maintenance System Maintenance 2 System maintenance 2.1 System maintenance overview System maintenance consists of different mechanisms which increase the system’s fault tolerance in order to guarantee high availability and required performance in fault situations. System maintenance activities can usually be performed remotely, except when the hardware needs to be replaced. To be able to receive all the information you need to carry out maintenance activities remotely, you may have to change some logical file outputs, for example, the diagnostic and observation outputs. 2.1.1 Maintenance tasks System Maintenance consists of the following tasks: • preventive maintenance • fault control • fault correction The need for preventive maintenance actions is limited and is mainly associated with peripherals in a digital system. Fault control is an automatic function and is described in details in section Fault locations system. Maintenance tasks are based on alarm and diagnosis reports generated by the system. The following tasks cannot be done in a remote session: • replacing failed on suspected blades and other hardware units. • performing any O&M commands in case of a severe OMU failure or a failure in the data transmission network. • all control operations done by using blade switches, which require special on-site aids. • on-site maintenance actions on the terminals, if there are any, such as printers. • any physical operations for installing expansions • performing some service terminal commands, like running programs step by step. The majority of maintenance terminal commands and all MML commands can be executed in a remote session. g Note: Service terminal is mostly used for software troubleshooting and debugging. System maintenance operation can be controlled with MML commands. 10 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance 2.1.2 System maintenance concept The concept of system maintenance is presented in the figure below. Figure 1 Fault management principles SUPERVISION ALARM RECOVERY FAULT LOCATION SYSTEM SYSTEM SYSTEM SYSTEM Activation Activation Alarm Alarm of offault FAILURE message printouts recovery location Fault Fault Fault detection Updatingof elimination location failure information System supervision The system supervision is responsible for fault detection and used both hardware and software for this purpose. Supervision functions have been dedicated to certain supervised objects: • Hardware supervision relies on the equipment database and routine tests executed on hardware components. It also relies on hardware configuration data stored in the configuration database. • Software supervision detects loss of control in software. The supervision is based on watchdog timers and supervision messages. • Real-time supervision keeps the units of a system and systems in the O&M network at the same time. When the supervision system detects a fault, it issues a fault or disturbance observation. Alarm system The alarm system handles the fault and disturbance observations that occur in the network element and in the remote objects controlled by the network. The alarm system • analyses fault information received from different sources. These are hardware alarms and fault observations from program blocks and preprocessors. • makes decisions on the basis of predefined rule bases, and starts up automatic recovery functions when necessary. • stores alarm data in an alarm log file. • informs the user by means of alarm printouts and alarm LEDs. The system has the following alarm functions: • collection of alarm data • storage of alarms • output of alarms • control of alarm outputs • activation of recovery functions when a unit fails • informing about alarms DN0975145 Issue: 2 © 2017 Nokia 11 System maintenance System Maintenance Alarm functions have a user interface with which you can set alarm parameters, examine the alarm situation and alarm history, and define handling rules for new events. Recovery system The recovery system • eliminates the effects of faults by utilizing the redundancy of the hardware. Processor and preprocessor restarts are also used. • controls the restarting of the system so that restarts are carried out in the correct order quickly and reliably. • starts up automatic fault location. For a description of redundancy models, see Fault management Recovery includes the following functions: • restarting program blocks, computers, preprocessors, and the whole system • automatic recovery from faults • a user interface, for example, for manual recovery Recovery starts diagnostic programs automatically. Diagnostic programs are executed on functional units which are in the test state (TE-EX). The same programs can be activated with the commands of the UD MML command group. Fault diagnosis (Diagnostics) relies on the hardware configuration data which is stored in the configuration database, and nonconformity between the database. The actual hardware situation usually leads to a diagnostic report. In this case, fault localization functions may not be able to complete its actual task properly. Fault location system The fault location system • locates faults in the system hardware. • complies with the ITU-T requirement for the average active repair time of 30 minutes. • uses hardware configuration data to determine which diagnostic programs should be executed and which configuration parameters should be used. The data is also used for determining which blades should be suspected as faulty when a fault has been observed. • informs the user by means of alarm printouts and alarm LEDs. 2.2 System supervision System supervision functions System supervision is a function group which consists of the following elements: • hardware supervision • software supervision • time supervision 12 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance 2.2.1 Hardware supervision Hardware supervision is based on routine tests and on continuous supervision executed as a background process. Hardware supervision is divided into sub functions on the basis of hardware types. • supervision of microcomputers Executed in all computer units as a background process so that the normal operation of the unit is not disturbed. • supervision of preprocessors In some preprocessor blades, supervision is executed by a control program belonging to the SUVSEB (System Supervision) service block. This supervision typically includes testing a blade’s read-only memory and the checksum supervision of the blade software. Depending on the blade, the program memory and working order of the signal processors can also be supervised. • supervision of switching networks The supervision of all switching networks is executed as a background process, without disturbing the call traffic and by using the testing properties integrated into the switching network. • supervision of time-slot-based units Detects possible hardware faults in the units and in the witching network between the unit and the maintenance computer. An alarm is generated when a fault is detected,. If all the tested time slots of the unit are detected as faulty, (it is always possible to test at least two time slots), an alarm is set about a failure in the whole unit. • supervision of MFST unit Covers the time slot connections configured to the MFST units and the operation of the signal-handling functions implemented by the blades. The supervision is run as a constant background process, with tests depending on the configuration of the blade. 2.2.2 Software supervision Software supervision reveals fault conditions in which control over the software is lost. The program block supervision is based on watchdog timers and special supervision messages. Every control processor and preprocessor in the system must set its own watchdog timer at regular intervals; otherwise the hardware restarts the processor. Program block supervision requires fast reaction capability, and thus the message-based supervision of the program block has been separately implemented for each unit as a part of the recovery function. If a program block does not respond to the supervision message during the given time, it is restarted. The control computers supervise each other according to the supervision hierarchy. Each preprocessor is supervised by its master computer unit. These supervisions also form part of the recovery function. 2.2.3 Time supervision Time supervision can be used in 3 different ways: Time supervision using Operation and Maintenance network DN0975145 Issue: 2 © 2017 Nokia 13 System maintenance System Maintenance Executed in a hierarchical manner so that the network element functioning as the main maintenance center on the Operation and Maintenance (OMU) network supervises the time in the other systems of the network.The OMU supervises the time in the units that the system use. By using a delivery-specific parameter, you define whether the network element supervises the times of the other network elements and whether it automatically corrects the time differences in those network elements which cause too wide a diversity in the times of the network. If the times of the main clock units differ from each other by more than 100 ms, or if the main clock units do not reply to time inquiries, supervision is not carried out. Instead, the operating personnel are called in to check the situation. Time supervision using NetAct or NMS NetAct/NMS is possible to configure to supervise the calendar time of every network element in an operator’s network. This is an alternative solution for using O&M network for this purpose. It is possible to configure NetAct/NMS to a) inform only or b) inform and correct the calendar time deviation in the network element. From other parts, the internal supervision of the calendar time is similar to the O&M network. Simple Network Time Protocol (SNTP)-based calendar time management It is possible to configure a network element and use the NTP or SNTP server as calendar time reference. In this case, the network element operates as an SNTP client. The SNTP client maintains the calendar time of the network element’s main clock. The calendar time references address, supervision period as well as minimum corrected deviation are configurable. The SNTP client in the network element shall inform the user in case it cannot connect to the calendar time reference. From the other parts, the internal supervision of the calendar time is similar to the O&M network. By using a delivery-specific parameter, you define whether the network element is an SNTP client only or an SNTP client and server. 2.3 Alarm system The alarm system handles the fault and disturbance observations that occur in the network element and in the remote objects controlled by the network element. It is part of the network element’s maintenance system. Both the supervision system and normal application programs detect the faults in their operational environment. When they do, they issue a fault or disturbance observation. Alarms can be caused by both hardware and software. If possible, the alarm system specifies the functional unit, in which the fault or disturbance has occurred and the recovery of the unit can be started. The user is informed of the fault situation through alarm printouts. 2.3.1 Alarm functions 2.3.1.1 Collection of alarms Properties 14 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance The following alarm data is collected in different ways: • hardware alarms • fault observations given by the program blocks • fault observations given by the preprocessors • base station alarms Hardware alarms Supervision logic has been integrated into different types of hardware. The supervision logic can inform the alarm system about malfunctions in the hardware. These devices include: • power sources • blade which need basic timing signals for their timing In addition, the following are supervised by the hardware: • the fuses and adapters of the power supply of the cabinet • the premises In a system implemented by using shelf construction, the hardware alarms are wired to the AAL, EAL, or SSAI blades. The alarm situation is read from the blades by software and transmitted to the alarm system for normal processing. In a system implemented by using shelf construction, the hardware alarms of a shelf are wired to the CLAC shelf in each cabinet, from which they are read by the HWAT or CLAB blade. From the HWAT and CLAB, the alarms are submitted by software to the alarm system for handling. The alarms are wired in the remote subscriber stage are collected in the RSAI blade, and then transmitted by software to the alarm system for handling. The hardware alarms of a shelf are wired to the CLAC shelf of each cabinet, from which they are read by the HWAT or CLAB blade. From the HWAT and CLAB, the alarms are transmitted to the alarm system by software for normal processing. Fault observations of the program blocks The observations can be transient disturbances detected in the system, ON/OFF-type fault conditions, or updates of the error ratio counters. The error ratio counters are used to indicate statistical errors. They can be used to observe errors greater that 0.002 percent (1/65535). The alarm limit of the error ratio is not visible to the application program block. Capacity Collecting alarm data has been synchronized in such a way that the distributed part of the alarm system functioning in each functional unit can send one alarm event (confirmation or cancellation of a hypothesis) at a time to the centralized part of the alarm system. User interface Alarms are usually transferred to the NMS systems (for example, NetAct) via Q3- interface and the NMS systems provide similar alarm handling functions like on OMU. In a system implemented by using the shelf construction, the user can define hardware alarms on the basis of the equipment. Commands are used to add and remove alarm interface blades, to open and close alarm inputs, and to define new external alarms. DN0975145 Issue: 2 © 2017 Nokia 15 System maintenance System Maintenance New external alarms can be defined with a user command. The alarm limit of the error ratio counters can be changed by using the error ratio counter handling commands. Fault conditions The system sets an alarm if unknown fault observations or file errors in the alarm system are detected. 2.3.1.2 Saving of alarms Properties All the alarms in a network element are saved in binary-coded format in the alarm system’s log file. The log file is a ring buffer file stored on the disk. The amount of data to be saved can be affected by changing the size of the file on the disk. Capacity The storage capacity depends only on the size of the log file on the disk. Not every single observation is separately updated on the disk, but they are first buffered in the maintenance computer. User interface The alarm history handling commands are used to examine the alarms generated by the system and their cancellations, as well as the current alarm situation. The parameters given in the commands are the object unit of the alarm, the alarm number or the urgency level of the alarm, and the time. Fault conditions If the creation or disk update of the log file fails, an alarm is generated. 2.3.1.3 Output of alarms Properties When the system detects a fault, an alarm is immediately printed out. The alarm system itself does not see the output device, but rather directs the alarm output to the predefined logical output files. The logical output files are directed to the physical output devices in the IOESEB (I/O Services) service block, or alternatively, to another logical output file. The layout of the alarm printouts is presented in the Structure of the alarm printout section. Alarms can also be output later from the alarm system’s log file by using the commands of the AH command group. The system automatically outputs the alarms in the ON state at times defined by the operating personnel, at the maximum of three times per 24 hours. Capacity 16 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Only one alarm output task can be sent to the Text Forming Service (TXFSEB) service block for formatting at a time. For this reason, the alarm system must buffer its output tasks. User interface By using the general commands of the IOESEB, the user can link the logical files used by the alarm system to the desired output devices. The administration commands for the alarm system are used to prevent or to allow the output of alarms entirely on the basis of the alarm number. As far as the output of alarms and their sending to the Q3 core are concerned, they can be prevented and allowed on the basis of the alarm class or alarm number. In addition, the commands can be used to change the times when the currently active alarms are output again. Fault conditions The alarm is generated when the file errors occur. 2.3.1.4 Control of alarm outputs indicating general alarm situation Properties In the ATCA system, there are alarm outputs from the Operation and Maintenance Unit (OMU) and from the Message Switch (MSW) in the network element. The overall alarm situation in the network element is set into the OMU’s alarm outputs. The number for an alarm output, which is controlled on the basis of an individual alarm, is determined on the basis of the device type and urgency level of the alarm. The alarm outputs of the MSW and OMC show that overall alarm situation in the whole network beneath them, and the overall alarm situation in the network element themselves. Ana alarm thus control the same alarm output in the MSW and OMC as in the lower network element. the overall alarm situation in the network element, excluding base station alarms, can be set in the OMU’s alarm outputs. The number for an alarm output, which is controlled on the basis of an individual alarm, is determined on the basis of the device type and urgency level of the alarm. Depending on the delivery involved, part of the alarm outputs can be reserved for other controls. Capacity The hardware allows the use of 16 alarm outputs. Parameters By using a delivery-specific parameter, the alarm outputs can be divided into those that are controlled dynamically by the ALARMP, Alarm System Program Block (Centralized Part) on the basis of the alarm situation, and into the permanent ones which are set by the commands of the AL command group. Normally, all alarm outputs are used by the ALARMP. A delivery-specific parameter can be used to define that the lamp panel buzzer becomes active only when the first alarm controlling some alarm output occurs, or so that the buzzer becomes active every time a new alarm controlling some alarm output occurs. User interface DN0975145 Issue: 2 © 2017 Nokia 17 System maintenance System Maintenance The lamp panel handling commands are used to modify and output the state of the lamp panel (that is, of alarm outputs) and its controlled data. The control data determines which of the alarm outputs in the alarm interface blade equipped in the OMU is updated on the basis of the alarm situation in the network element. In addition to the local alarm outputs, the update can be directed to the alarm outputs of other systems. Normally, the state of the lamp panel is updated in the system’s own lamp panel and also in the lamp panel of the operation and maintenance center (generally the MSW or OMC). The state of the RSU lamp panel can be updated to the parent exchange. The alarm parameter handling commands are used to modify alarm outputs controlled by alarms. 2.3.1.5 Startup of recovery functions Properties When the alarm system detects a fault in the hardware or software on the basis of fault and disturbance observations, it reports the fault to the operating personnel and, if necessary, to the Recovery service block (RCYSEB). The RCYSEB is informed when the fault can be localized into a particular functional unit and the fault disturbs the normal operation of the unit. When the RCYSEB receives the notice, it removes the unit from use. The operating personnel and the RCYSEB are also informed when the fault situation is over. User interface Whether the recovery functions are started on the basis of a given observation depends on the supporting amount of the observation. If necessary, change the supporting amount with the ARM command. 2.3.1.6 Informing about alarms Properties The alarm system provides a service which allows the application programs to be informed of the desired alarms and of their cancellations. An application program can ask the informing rule concerning a given alarm to be added to the informing rule base of the alarm system. The alarm system then informs the program block defined in the rule about the settings and cancellations of the alarm in question. The application program can also ask a rule it has previously added to be removed from the rule base. By using this service, an alarm can be connected to start up an MML command sequence defined in the command calendar. User interface The commands of the IC command group are used to connect an MML command sequence to an alarm. 18 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance 2.3.2 Implementation of alarm functions The below figures show the implementation of the alarm system on classic NE, ATCA and BCN. Alarm functions in classic NE Figure 2 Implementation of the alarm system in classic NE OMU NMSSEB FILESINCENTRALIZEDPART RW AOPINH FLSSEB ABLHAN ALARMP ACAHAN Handlingof IOESEB AHIHAN MML commands ALPHAN RCYSEB APAHAN APRHAN TXFSEB ARBHAN Handlingof alarms LAMP PANEL HWALCO MMSSEB ERCHAN preprocessor XXX STM1 ETIP and unit AMSSEB LinDXUnit ASYLIB LAPSEB DPALAR ASYLIB QALARM lalarm RW RW ETPx FILESINCENTRALIZEDPART QALIBR liblnx_lalarm Alarm functions in ATCA DN0975145 Issue: 2 © 2017 Nokia 19 System maintenance System Maintenance Figure 3 Implementation of the alarm system in ATCA OMU NMSSEB FILESINCENTRALIZEDPART RW AOPINH FLSSEB ABLHAN ALARMP ACAHAN Handlingof IOESEB AHIHAN MML commands ALPHAN RCYSEB APAHAN APRHAN TXFSEB ARBHAN Handlingof alarms HMZSEB HW0ALA MMSSEB ERCHAN XXX LinDXUnit AMSSEB DPALAR ASYLIB lalarm RW RW FILESINCENTRALIZEDPART liblnx_lalarm Alarm functions in BCN 20 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Figure 4 Implementation of the alarm system in BCN OMU NMSSEB FILESINCENTRALIZEDPART RW AOPINH FLSSEB ABLHAN ALARMP ACAHAN Handlingof IOESEB AHIHAN MML commands ALPHAN RCYSEB APAHAN APRHAN TXFSEB ARBHAN Handlingof alarms HW0ALA HMZSEB MMSSEB RWALAR ERCHAN XXX AMSSEB DPALAR ASYLIB LinuxUnit (e.g.PCUM, RW RW ETME) FILESINCENTRALIZEDPART Alarm system (AMSSEB) The alarm system received inputs (fault and disturbance observations) from application program blocks belonging to different service blocks, and handles these on the basis of the parameter files and the dynamic state of the system. The observations are handled at two levels. The ALARM System Program Block, Distributed Part (DPALAR) first handles the observations at the unit level, including any other fault and disturbance observations which have been previously set in the unit concerned and which are still valid. The unit-level handling takes place in the unit where the application program block that set the observation is located; when setting the observation, the services provided by the ASYLIB library are used. The unit-level handling is implemented for observations sent by the preprocessors in the control computer of the preprocessor unit concerned. The Alarm System Program Block, Centralized Part (ALARMP) handles the observations sent by the DPALAR at system level, taking into account the fault situation of the entire system (that is, of all units). As a result of the handling, the alarm system generates an alarm of a detected fault to the user, activates recovery actions for the faulty functional DN0975145 Issue: 2 © 2017 Nokia 21 System maintenance System Maintenance unit, and if necessary, updates the alarm outputs. The alarm events are stored in a log file maintained in the centralized part of the system. The log file is a ring buffer file on the disk. The user interface (the alarm system’s MML programs) provides an interface for the output of the alarm situation and alarm history, and for examining and modifying the parameters and handling rules of the alarm system. File services (FLSSEB) The file services are used for disk updates, distribution and loading of files. I/O Services (IOESEB) The services of the IOESEB are used in operations aimed at the Alarm System Log File (ALHIST) located on the disk. Recovery (RCYSEB) The alarm system notifies the recovery about a functional unit that has been detected faulty on the basis of fault and disturbance observations. Based on this information, the recovery performs recovery actions for the functional unit. The recovery notifies the alarm system that the maintenance computer has changed. Based on this information, the Alarm System Program Block, Centralized Part (ALARMP) performs the actions required when changing the maintenance computer. O&M Network Management (NMSSEB) The alarm output updated messages are sent in the O&M network by using the services provided by the NMSSEB service block. Text Forming Service (TXFSEB) The text forming service is used to form and output alarms on the alarm local output device and to form alarms to be sent to the O&M Centre through the Q3 interface. MMI System (MMSSEB) The interface to the MMI system informs the Command Calendar Program Block (COKALE), which belongs to the MMSSEB service block, about the occurrence of certain alarms. The COKALE activates a command calendar sequence on the basis of these alarms. PCM-based Internal Data Transfer (ACOSEB) The alarm messages received from the SMU preprocessor units and the acknowledgement messages sent to them are transmitted by using the ACOSEB service block. LAPD Protocols The alarm messages received from the preprocessor units connected to the network element through a D-channel connection, and the acknowledgement messages to be sent to them, are transmitted by using a service provided by the LAPSEB service block. Alarm handling subsystem on LINUX-based computer units (lalarm) 22 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Applications running on LINUX-based computer units, like ETIP1-A, can set/reset alarms by using library liblnx-lalarm. The alarms are handled by subsystem lalarm. Alarm handling subsystem on Chorus-based computer units (QALARM) Applications running on Chorus-based computer units, like STM-1/OC3, can set/reset alarms by using library QALIBR. The alarms are handled by subsystem QALARM. 2.3.3 Structure of the alarm system The alarm system consists of a distributed and a centralized part. The distributed part of the system handles, on the unit level, the fault and disturbance observations set by the application program blocks, and the cancellations of fault observations. The distributed part sends the hypotheses that have become certain on the unit level, and their cancellations, to the alarm system’s centralized part located on the maintenance computer of the system. The centralized part post-processes the data received. The distributed part also transmits the notices set by the application program blocks to the centralized part. The centralized part processes hypotheses, cancellations of hypotheses, and notices received from the distributed part. The centralized part provides information for the user by producing alarm printouts and activating lamp panel controls. If necessary, it also starts the automatic recovery actions. Moreover, the centralized part saves the alarm history data. 2.3.4 Classification of alarms Severity The user is responsible for repairing some of the faults. Faults like this involve, for instance, changing a faulty blade. The asterisks in the alarm printout indicate the alarm urgency level and whether an alarm requires user actions. The alarm urgency level is displayed for all alarm printouts, except for notices (NOTICE). *** Alarms marked with three asterisks require immediate actions from the user. An alarm like this is set when the system has become faulty to the extent that the functionality important to the operator has stopped, or is in danger of stopping. The maintenance personnel must take immediate action. ** An alarm marked with two asterisks does not threaten the operation of the system, but if the fault occurs during working hours, it must be corrected at once. If the fault occurs outside working hours, it can be repaired the next day. * A transient disturbance or an alarm marked with one asterisk does not usually require user actions. Numbering The alarms have been divided into four groups and always contains the following parts: DN0975145 Issue: 2 © 2017 Nokia 23 System maintenance System Maintenance Alarm number Decimal number on the upper corner of the page. Alarm text The text in the alarm printout. Meaning Verbal explanation of the reason for the alarm and of its impact. Supplementary Interpretation of supplementary information fields, if there information fields are any. The supplementary fields are indexed from 1 to 32 from left to right. Instructions There are instructions for all alarms with urgency level ** or ***. Instructions can also be given for alarms or disturbances with the urgency level *, but normally these require no actions from the operating personnel. The instructions given here are in written form. The instructions given here are in written form. If the instructions are given in the form of an operating sequence, they are presented in a document of their own. The electronic documentation gives the operating sequence here. Cancelling Information on whether the operating personnel should cancel the alarm or whether the system does it after the fault situation is over. Diagnosis reports' descriptions contain the same parts as the alarm description has. However, the diagnosis report is never cancelled. The alarms are numbered in ascending order as follows: Table 1 Alarm numbering Alarm Number ranges Notices (NOTICE) 1 - 999 Disturbance printouts (DISTUR) 1000 - 1999 Failure printouts (ALARM) 2000 - 3999, 14001 - 14999, 16000 - 16999 The following numbers have been reserved for possible external alarms: Table 2 Reserved number for possible external alarms Number range switching equipment 4000 - 4799 24 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Table 2 Reserved number for possible external alarms (Cont.) Number range O&M equipment 4800 - 4899 transmission equipment 4900 - 4999 power equipment 5000 - 5499 external equipment 5500 - 5999 User defined alarm system parameters You can change the following alarm-specific parameters: • alarm class, also known as alarm urgency level • alarm output number • informing delay • cancelling delay • live time Changing the alarm parameters requires system expertise, because it influences the operation of the system. 2.3.5 Structure of the alarm printout Figure 5 Alarm printout fields 1 2 3 4 -5- :5: 6 7 8 9 10 11 12 13 14 15 16 17 (17) 18(optional) 19(optional) (19) g Note: In the figure above, the fields marked with 'optional' contain user-defined data. If no such data has been defined for an alarm, the line is not printed so that there are no empty lines in the alarm printout. According to the same principle, if an alarm does not have any supplementary information fields, the line containing supplementary information fields is not printed. The second line reserved for supplementary information fields (17) and alarm operating instructions (19) is used only if the data in question does not fit into one line. 1. Type of alarm printout Standard alarm printout <UPDT> Alarm update printout (when printing out all live alarms at DN0975145 Issue: 2 © 2017 Nokia 25 System maintenance System Maintenance defined time of a day <HIST> Alarm history printout 2. Name of the network element 3. Computer sending the alarm 4. Alarm equipment type SWITCH switching equipment O&M operation and maintenance equipment TRANSM transmission equipment POWER power equipment EXTERN external equipment Unknown equipment type is printed as ?????? 5. Date and time Start or termination time of the alarm. 6. Urgency level *** requires immediate actions ** requires actions during normal working hours * normally no actions required The urgency level is output in all alarm printouts except notices (NOTICE). The urgency levels of terminated alarms are indicated by dots (.) instead of asterisks (*). 7. Printout type ALARM fault situation CANCEL fault terminated DISTUR disturbance NOTICE notice 8. Alarm object The functional unit which is the object of the alarm. If the alarm is not targeted to any particular object, the field displays eight dots. 9. Position coordinates of alarm object For Classic position coordinates are expressed in the form RTK-V for Classic. For ATCA position coordinates are expressed in the form RTK-V-X or RTK-V-Y-Z. R (1...64) is the cabinet row T (A...Z) is the cabinet K (001...255) is the vertical position of the shelf V (00...99) is the horizontal position of the shelf M is the module type N is the module number RTK-V is for identifying the physical location on the cabinets, cartridges and blades. If a cabinet is the object of an alarm, only RT is displayed. RTK-M-N is for indicating the location of hardware components either in a cartridge backplane or in a cartridge track. An unknown position coordinate is printed as ??????-??. 10. Alarm issuer The program block issuing the alarm. If the name of the program block issuing the alarm is not available, the family identifier of the program block is output in hexadecimal form instead of the name. If the alarm is set in a preprocessor blade, the blade name and index are output in this field. In this case, the alarm concerns the functioning of the blade in question. 11. Trial information If the network element has been divided into a traffic transmitting part and a trial part, this field displays the text TRIAL if the alarm was issued in the trial side. 12. Recovery information 26 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance When recovery is informed of the alarm in order to start the automatic recovery actions, this field displays *RECOV*. 13. Processing information If the alarm is set before the start-up of the distributed part of the alarm system, this field displays LIB. Note that this kind of alarm does not stay as active alarm and thus there will be no cancel printout for it. 14. Consecutive number Failure printouts (***, **, *) are numbered in ascending order. With the help of the number the operating personnel can follow the update and cancel printouts of the original failure printout. 15. Alarm number Alarm number is an unambiguous identifier for an alarm. It is also a search index for the description of the alarm. 16. Alarm text Alarm text is a short description of the alarm. 17. Supplementary information fields A maximum of 32 fields which are separated from one another by one or several spaces. The following values are the possible values of the field: - a hexadecimal number (for example, 0120) - a decimal number (for example, 288d) - a BCD [binary-coded number] (for example, 0288) If the number is a BCD, it is mentioned in the alarm reference manual in the explanation field in question. - a single character of characters (for example, ALHISTGX) - a functional unit (for example, OMU) - a blade (for example, ACPI4-A) - a working state of a unit (for example, WO-EX) - a date (for example, 2010-02-17) - a time (for example, 10:21:42.19) - an IPv4 address (for example, 131.255.0.12) - an IPv6 address (for example, 1080:0:FE:0:0:255:34:FFFF) The value of a certain field (for example, the index of functional unit or the index of the blade) can be displayed as two dots (..). This refers to a case where there is no single value to be given to the field according to its meaning. If the amount of supplementary information data does not match with the formatting information, a question mark (?) is printed at the end of the fields. 18. Supplementary text A more detailed text printed out in some alarms. 19. Alarm operating instructions The user defines an operating instruction with the AOA MML command for an alarm. If the instruction has been defined, then it is displayed in the alarm printout. Example of an alarm printout Example 1: Inter OMU-1 SWITCH 2009-10-26 05:27:28.61 ** ALARM SWU-61 1A001-00-9 RCXPRO (0001) 2692 INCORRECT WORKING STATE SE-OU Example 2: HORNET OMU-1 SWITCH 2009-06-15 10:45:53.90 * ALARM SHMU-2 1A002-00-SHM-1 HW0ALA DN0975145 Issue: 2 © 2017 Nokia 27 System maintenance System Maintenance (1153) 3438 MINOR TEMPERATUREDEVIATION DETECTED ARMGR_A 01 41 FF 03 01 02 Local Temp g Note: Example 2 exists only in ATCA and BCN network elements. An example from the alarm documentation: CLAB FAILURE 2761 MEANING The CLAB blade is faulty. Either the blade does not respond to the supervision messages sent to it by the control computer, or the software of the CLAB has detected a fault in its interface, in the RAM of the blade, in the phase-locked loop, in the power supply, in the changeover signal of the blade or in the basic timing bus interface of the CLAB, or the check sum of the program memory of the CLAB has changed, or the value of the cabling delay measured by the CLAB has changed. When the alarm is on, the CLAB does not function at all or its function is unreliable. SUPPLEMENTARY INFORMATION FIELDS 1 failure specification: 00 CLAB does not respond to supervision messages 01 failure detected in interface testing 02 check sum of program memory changed 03 fault in RAM 04 configuration contradiction (HCLTBL - SCDFLE) 05 fault detected in phase-locked loop 06 fault in 5V power supply 07 fault detected by hardware 08 fault in field changeover trial 09 fault in the basic timing bus interface of the CLAB 0A cabling delay measured by CLAB has changed 2 number of SBUS; in use only when the value of supplementary information field 1 is 00 00 CLAB is not responding to supervision messages on SBUS-0 01 CLAB is not responding to supervision messages on SBUS-1 02 CLAB is not responding to supervision messages in either bus INSTRUCTIONS Check the cabling of the cartridge, see Equipment list for cables, Site documents. If there is nothing wrong with the cabling, the fault is likely to be in the blade. Replace the faulty blade, see instructions for Replacing hardware 28 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance units. If according to the first supplementary information field, the fault is detected in the basic timing bus interface of the CLAB and replacing the CALB, which is the object unit of the alarm, does not remove the alarm, replace the standby CLAB. As a result of the alarm, the recovery system starts the final diagnosis program which generates a diagnosis report in which a description of the fault related to the alarm is output. The alarm supports the activation of the automatic recovery for the object unit. You can check the supporting amount of the alarm with the MML command ARO. CANCELLING Do not cancel the alarm. The system cancels the alarm automatically when the fault has been corrected. 2.4 Recovery system The recovery block controls the operating state of the functional units. The recovery functions are: • elimination of the effects of faults • restart control • user interface Faults are eliminated by using the hardware redundancy and restarts of the functional units. At a functional unit level, processes and preprocessors can also be used. Recovery has a real-time date on the states of the functional units. By using this data, it controls the restarts of the system and functional units, so that the restarts are carried out quickly and reliably in the correct order. Redundancies and working states of the functional units are hidden from the applications program blocks by using addressing. When the state data on the functional units is updated in real-time, the recovery maintains a table on the basis of which the operating system is able to direct the logically addressed messages to the correct physical units. The recovery system consists of a centralized and a distributed part. The centralized part is situated in the OMU, or in the case of an active OMU failure, on the OMU’s spare side (in the central memory unit in the OMU is not redundant). The distributed part is located in each computer unit. The centralized part controls the recovery of the functional units as a whole, and the distributed part is responsible for the actions at a unit level. Capacity DN0975145 Issue: 2 © 2017 Nokia 29 System maintenance System Maintenance The recovery system is implemented so that several recovery actions can be executed simultaneously in the system. Parameters The unit types differ with respect to state handling. Recovery is parameterized in such a way that new unit types can be implemented in a flexible manner. These parameters cannot be changed by using commands. Fault conditions Since recovery operates when there is a fault condition in the system, special attention has been paid to its fault tolerance. Fault tolerance is reached, for example, by both using the message buses and repeating the attempted actions. If a recovery action fails, an alarm output is given about the situation. At regular intervals, recovery makes repeated attempts at a recovery measure on the faulty units in operation. Recovery user interface Use the commands of the US MML command group to • change the state of a functional unit • interrogate the state of a functional unit • output the functional units in a given state • restart a functional unit • warm up a functional unit • restart the system 2.4.1 Functional unit working states The system recovers from faults most reliably if its configuration is complete. Therefore, the states have been divided into correct and incorrect ones. A ‘correct’ state is either the unit’s normal working state or such a spare when the unit can be used any time. The correct states are WO-EX and SP-EX. As far as a complete configuration is concerned, all other states are ‘incorrect’ states. If a unit is permanently in incorrect state, an alarm output is given at regular intervals. Table 3 Working states of the functional units Main state Sub-state Name WO WO-EX working, executing WO-RE working, restart BL BL-EX blocked, executing BL-ID blocked, idle BL-RE blocked, restart SP SP-EX spare, executing SP-UP spare, warm up 30 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Table 3 Working states of the functional units (Cont.) Main state Sub-state Name SP-RE spare, restart TE TE-EX test state SE SE-OU separated, out-of-use SE-NH separated, no hardware The following table lists the main states of the functional units. Abbreviation State WO working BL blocked SP spare TE test SE separated Units are normally in the working state. A blocked unit does not accept any new tasks and, for example, the time-slot-based units are blocked from hunting. A unit in the spare state is a spare unit ready for use. A unit in the testing state is being tested and it is not performing its normal functions. A unit in the separated state is totally separated from the rest of the system. The main states are further divided into sub-states. The state of a unit is determined by its main state and its sub-state. The two-letter abbreviation symbolizes the main state and the sub-state in the interface between you and the system. When the system outputs the state, the main state and the sub-state are separated from each other with a hyphen (-). The sub-states of the working state (WO) are: • executing (EX) • restart (RE) The WO-EX state is the normal state of a unit in the working state. A unit in operation is in the WO-RE restart state when it is being restarted. The sub-states of the blocked state (BL) are: • executing (EX) DN0975145 Issue: 2 © 2017 Nokia 31 System maintenance System Maintenance • idle (ID) • restart (RE) When a unit is in the BL-EX state, it still performs the operations not completed at the moment of blocking. In the idle state BL-ID, all the operations which were uncompleted at the moment of blocking, are executed. In the BL-RE restart state, blocked units are restarted. The sub-states of the spare state (SP) are: • executing (EX) • warm-up (UP) • restart (RE) The SP-EX working state is the normal state of a spare unit. It can be changed into the WO-EX state at any time. In the SP-UP warm-up state, the spare unit warms up its data to match the data of the unit in operation. In the SP-RE restart state, the spare unit is restarted. The sub-state of the test state (TE) is: • executing (Ex) A unit in the TE-EX state is ready for testing. The sub-states of the separated state (SE) are: • out-of-use (OU) • no hardware (NH) Recovery takes a unit to the separated, out-of-use state (SE-OU, if the fault diagnosis finds the unit faulty. A unit in the separated, no hardware state SE-NH has all the tables reserved that are needed in the system, but the hardware is not installed. 2.4.2 Functional unit hierarchy The hardware in some network elements includes the blades that contain several processors. To be able to handle these, the functional unit concept is expanded so that the functional units can contain other functional units. A group of nested functional units is called functional unit hierarchy. All functional units, regardless of their place in the hierarchy, are uniquely identified by the functional type and index. This means that unit indexes (of a single unit type) are system-wide. As a result, an application can identify any functional unit even if it does not know its place in the hierarchy, or does not know whether the hierarchy exists (unless it explicitly needs to know about hierarchy. When creating functional units, the entire hierarchy ‘tree’ is created with only a couple of commands. The hierarchy relations are determined at the time of the configuration and then stored by Hardware Configuration Management and Recovery System. The below figures show specific examples of hierarchical units. 32 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Figure 6 STMU - hierarchical unit Figure 7 ATCA hierarchical unit VMU virtualized unit virtualized unit Figure 8 Example of the BCN hierarchical unit MCBC PCUM ETME ETMA BCXU OMU MCMU 2.4.3 Redundancy models Redundancy is a method of providing the system with redundant equipment and using it to improve the fault tolerance of the system. It is achieved by using backup units of functional units. 2N active/standby (duplicated) A redundancy scheme in which two units are used to complete a task for which one is enough at any given time. One unit is always active, that is, in the working state (WO), while the other unit is kept in the hot standby state, that is, in the spare state (SP). N+1 DN0975145 Issue: 2 © 2017 Nokia 33 System maintenance System Maintenance A redundancy scheme in which a unit type consists of several active units and one spare unit. If one of the active units gets broken or has to be separated from traffic (for example, because the hardware must be replaced), the spare unit takes over. An idle spare unit can replace any of the active units of the same unit type. Complementary N+1 and SN+ load sharing A redundancy where one or more extra computer units are reserved for use so that the system can bear the failure of the unit. 2.4.4 State transitions The CPU blades are hot-swappable and it is technically possible to remove these units without changing their working state to SE-NH. It is recommended that the state transitions are issued manually (with MML commands) also for the hot-swappable units, because it is a more controlled way to handle the transitions than an automatic swap and has a smaller effect on the traffic. The state transition sequences for the units are the following: From WO-EX into SE-NH WO-EX --> SP-EX SP-EX --> TE-EX TE-EX --> SE-OU SE-OU --> SE-NH From SE-NH into WO-EX SE-NH --> SE-OU SE-OU --> TE-EX TE-EX --> SP-EX SP-EX --> WO-EX Before turning a unit back into the working state after hardware replacements, run diagnostics on it first to check that it is healthy. When the diagnostic report shows that the unit is healthy, turn it into the WO-EX state to see that it is in working order and handles the switch over smoothly. Keep a number of ready-initialized hard disks available on the site in case of failure. The initialization of a new disk takes a considerable amount of time. Functional units can be controlled with commands when they have been configured for the system. In general, you can change the working state of a unit to some other state only according to the figure below. The sub-state is not defined in the modification command. g Note: Not all units have BL state. The possible working state transitions are the following: 34 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Figure 9 Time slot based units (units that have no redundancy or complementary N+1 redundancy units) WO BL TE SE Figure 10 Other backed-up units (2N or replaceable redundant N+1 units) The below figure illustrates the state transitions of I/O devices. Figure 11 State transitions of I/O devices MI-SY MI-US BL -US BL -SY WO-ID TE-ID WO-BL The state changes for the functional units are presented in the following figures. Some state changes are performed only by the system, while some can only be done with forced control commands. Figure 12 State changes in computer units with no redundancy SE-NH SE-OU TE-EX WO-RE WO-EX The above figure presents the possible states and state changes in computer units which have no redundancy in the system. DN0975145 Issue: 2 © 2017 Nokia 35 System maintenance System Maintenance Figure 13 State changes in units with 2N redundancy SP -UP SE-NH SP -RE SP -EX TE-EX SE-OU WO-RE WO-EX Figure State changes in units with 2N redundancy presents the possible states and state changes in functional units that have 2N redundancy, which can be restarted by using a command of the US MML command group, and which have a warm up of the spare unit. Only computer units have the SP-IP warm-up state. Units that have no warm-up cannot be in the SP-UP state. Units that have no redundancy cannot be in the SP state. It is possible to change from all SP states into the TE-EX state and from all SP states into the WO-RE state by using forced control. In addition, when a 2N redundant unit pair is created in the system for the first time, the unit that is selected as the active one can be taken into use from the SE-OU or TE-EX state directly to the WO state, because the SP state is not possible in that situation. Figure 14 State changes in computer units with replaceable N+1 redundancy SE-NH SP -RE SP -EX SP -UP TE-EX SE-OU WO-RE WO-EX Figure State changes in computer units with replaceable N+1 redundancy presents the possible states and state changes in computer units with replaceable N+1 redundancy. 36 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance It is possible to change from all SP states into the TE-EX state and from all SP states into the WO-RE state by using forced control. Figure 15 State changes in functional units with complementary N+1 redundancy SE-NH SE-OU TE-EX WO-RE WO-EX BL -EX BL -RE BL -ID Figure State changes in functional units with complementary N+1 redundancy presents the possible states and state changes of functional units that have an N+1 complementary redundancy and that can be in the BL state. For some functional units which cannot be restarted, the state change from TE to WO does not occur through WO- RE, but directly to the WO-EX state. 2.4.5 Working state administration Recovery Recovery is used in error conditions and system startups. Generally, it is performed automatically by the maintenance software. Recovery commands are only given in exceptional situations to restart a functional unit or system, in commissioning, when making software or hardware changes or to change the working states and the additional information of the functional units. The commands of the recovery and unit working state administration fall into the following functional entities: • restarts • warming up the spare unit • management of the units’ working states • management of the units’ additional information In addition to the commands given from the user terminal (the US command group), the front panel control switches of the blades can be regarded as the user interface for the recovery system. You can operate the system with these switches if you cannot use the recovery and unit working state administration commands. The following switches are available: • switches on the MCU blade • reset push button for the CPU blade Unit warm-up DN0975145 Issue: 2 © 2017 Nokia 37 System maintenance System Maintenance The recovery system controls the warm-up of redundant computer units. You can only warm-up N+1 unit (2N units are warmed up automatically). It is also possible to select a spare unit for N+1 and warm it up before executing a controlled unit switch over to make the switch over faster. Faulty active units cause fewer disturbances to the traffic. You can set a spare unit as the permanent pair for an active unit if the purpose is to make the traffic disturbance in error conditions as short as possible, but only if there are several spare units in the configuration because this prevents the use of this spare unit as a spare unit for other WO units. The spare unit of the N+1 redundant computers is known as the cold spare unit, which means that its files and program blocks must be updated to the level of the active unit and a warm-up must be performed. Warming up the N+1 redundant unit may take several minutes depending on the type of unit (if the unit has a number of programs to be warmed up), and the loading situation of the source, for example. In the testing phase, it is easier to use the USW warming command to investigate the warm-up times and how the N+1 redundant units stay in the input synchronism than to use the USC switch over command alone. The warm-up of the 2N redundant computer spare unit is always activated automatically at the end of the restart of the spare unit, but the warm-up of the N+1 redundant computer spare unit is activated at the end of the restart of the spare unit only if the spare unit has been set for a warm-up. The state of the warm-up phase is SP-UP. You will receive a message when the unit is hot (SP-UP -> SP-EX). At the conclusion of the warm-up, the switch over of the unit is started immediately. The warm-up pair can be on as long as you want it, but during that time this spare unit cannot be used as a spare unit for other WO units. If you output the state of the unit with the USI command, you can distinguish the hot and cold N+1 redundant SP-EX units from each other by the additional information IDLE, which is only set for cold units. When the spare unit is warmed up with the USW command, SP-UP state is updated as the spare unit state and the SI X additional information is set for the active and spare unit. For more information, see Additional information for the functional units of the Recovery and Unit Working State Administration document. After the warm-up, the spare unit stays hot and the user may, for example, perform a switch over with the USC switch over command. In this case, the switch over does not include any warm-up. When the switch over is completed, the warm-up of the spare unit can be terminated with the USW command. The system then restarts the spare unit in order to restore the unit as a cold spare unit. If the warm-up is not finished, you can perform a forced switch over, but only in exceptional circumstances. You can use the MML terminal to find out which program block is currently being warmed up by repeating the command to start the warm-up. In this case, the error messages 804 WARMUP FAILURE and 821 REWARMING IN PROGRESS will be output on the MML terminal, as well as information on which program block in the unit was warmed up. You can monitor the results of warming up the spare unit in the printouts of the service terminal in Printouts of starting phases in DMX units and in the alarm printouts. 38 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance g Note: A forced switch over should only be used in exceptional situations when issuing the command is otherwise not possible. A forced transition always causes the spare unit to restart. You can also interrupt the warming up of the spare unit with the USW command during the warm-up phase, the spare unit will restart and go to SP-EX state. Unit restarts A functional unit (computer units or some preprocessors, for example ET) can be started with the restart command. When a functional unit is restarted spontaneously or with an MML command given without parameters, code is not usually loaded unless the check sums of the programs have changed. However, all files are always loaded. Restarting a unit with an MML command affects the object unit and the interface computers, if they are connected to object units. With the interface unit type parameter value (for example, ET and STMU), the interface computers that are connected to the object unit can be restarted. The unit restarts into the same state in which it was before the command was given. When the WO unit of a 2N redundant unit is restarted, the SP unit starts as well. The restart of a unit caused by the use of the reset push button or commands given by the service terminal takes place spontaneously from the point of view of the recovery system, and therefore entails recovery operations. The recovery operations to be carried out depend on the state in which the object unit was before the restart and whether a repeated startup is in question, in which case it is interpreted as a fault. In case of the duplicated units, for example, an immediate switchover is performed after the unit in the WO-EX state has restarted if the spare unit is in the SP-EX state. The 2N-redundant OMU makes an exception to this if you give the ZAU service terminal command. In this case, the recovery system interprets the restart as a command given by you and then, the OMU starts into the same state in which the unit was before the command was given. System restarts The system restart caused by an MML command restarts all computer units at the same time. By using command parameters, the active operation and maintenance computer and interface units (for example, ET and STMU) can be included in the system restart. By default, they are not restarted in system restart. By using parameters, you can also enter the loading mode for the initially loadable code (loading all codes or only the code the check sums of which have changed) and files (immediate loading or virtual loading). In system restart, in general, no code loading takes place unless the check sums in the memory have changed. By default, files are always loaded virtually. 2.4.6 Implementation of recovery functions The structure of the recovery function is presented below: DN0975145 Issue: 2 © 2017 Nokia 39 System maintenance System Maintenance Figure 16 Structure of the recovery function CENTRALIZEDPART RYEPRO RCXPRO RYSSIX RIAMAN RXSPRB UNIT LEVEL SVATOR RUMMAN FUZNLM USAPRO RYSSIX RXFPRB QXFPRB QIESER QXUPRB RIESEN PS0PRB PREPROCESSORLEVEL COUNTER PSAPRO PART Program blocks g Note: Chorus-based computer unit is only supported in classic BSC. COUNTERPART Counterpart Program Block of PS0PRB FUZNLM FUNLIB Library Manager program block PSAPRO Preprocessor Unit State Administration Program Block PS0PRB Preprocessor Supervisor QIESER Recovery Event Announcing Service for Chorus Units QXFPRB Functional Unit Level Restart Controller Program Block in Chorus Unit QXUPRB Functional Unit Level State Manager in Chorus Unit RCXPRO Recovery Program Block RIAMAN Recovery Event Announcing Service (centralised part) RIESEN Recovery Event Announcing Service (unit level) 40 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance RUMMAN Unit Supervision Management Program Block (used only for interface computers) RXFPRB Functional Unit Level Restart Controller Program Block (used only for interface computers) RXSPRB System Level Restart Controller Program Block (used only for interface computers) RYEPRO Recovery action Execution program block RYSSIX Recovery Interface Converter Program Block SVATOR Actor Supervisor in Chorus Computer Units USAPRO Unit State Administration Program Block Centralized part of recovery The tasks of the centralized part of recovery are: • selecting recovery actions on the basis of the system’s overall situation • prioritizing recovery actions so that overlapping recovery actions are done in the correct order • centralized administration of recovery actions • executing user interface commands The centralized part (RCXPRO) has been backed up in the following way: • The centralized part is normally active in the Operations and Maintenance Unit (OMU). • If the OMU becomes faulty (or if its state is other than WO), the centralized part becomes active in the CM unit (spare side OMU if the OMU is backed up). Recovery system unit level The distributed part of the recovery system (USAPRO) handles the states of its units. This includes, for example, the following functions: • controlling unit activation • controlling state transitions (for example, unit switch over) • supervising software • handling and supervising certain preprocessors in the unit During startups, the distributed part ensures that startup operations, such as loading files and activating applications, are performed in the correct order. During transitions, the distributed part controls the state transitions in the applications. Software supervision is implemented by supervision messages sent at regular intervals, to which the application program blocks reply. The handling of these preprocessors is responsible for controlling the state changes and startups in the preprocessors. Specific handling programs blocks handle some of the preprocessors. DN0975145 Issue: 2 © 2017 Nokia 41 System maintenance System Maintenance All computers have a distributed part. Some computer units have preprocessors which are not functional units and they run DMX or Chorus operating systems. Such preprocessors are supervised by PS0PRB in those computer units. Recovery system preprocessor level The preprocessor part of the recovery system (PSAPRO or STWPRO) has the following tasks: • controlling preprocessor activation • controlling state transitions • supervising software Unit state handling mechanisms The recovery system controls the system configuration by using the following unit state handling mechanisms: • setting of the state for the message by media interface (MBIF or Ethernet interface) • wired switch over control • blocking of time-slots • handling of logical addressing • control messages On the basis of the functional unit classification, the recovery system knows the control type of each functional unit (the classification date on the functional units is given in section Additional information for network element functional units in the Recovery and Unit Working State Administration document). The state of the message bus media interface (MBIF or Ethernet interface) is set by using a specific command (only in computer units). The interface has two states: in use and separated. By using the command, the unit can also be restarted. The relationship between the unit working states and the settings of the interface is as follows: Table 4 Relationship between unit working states and the interface setting Unit state Interface state WO in use SP in use BL in use TE in use SE separated 42 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance By using the wired switch over control, the doubled computer units are controlled so that one of the units is active and the other is passive. Doubled preprocessor units (for example, the synchronization CLS unit) independently determine which unit is active and which is passive. Table 5 Relationship between unit working states and wired switch over control Unit state Wired switch over control WO active SP passive TE passive SE passive The blocking of time-slots is used in the state handling of time-slot-based units. The following table shows the relationship between the blocking of time-slots and the unit working states. Table 6 Relationship between unit working states and blocking of time-slots Unit state Time-slots ET time-slots WO not separated not blocked BL separated blocked TE separated blocked SE separated blocked In case of the ET, blocked time-slots are also barred. The management of logical addressing depends on • whether the unit’s logical address is known can be found in some general control file (for example, PCMCON), or • whether the unit’s logical address is known to an application controlling the functional unit (for example, the logical unit has active connections or channels). The recovery systems checks the above mentioned information from the general control files, or the applications provide the necessary information to the recovery. By changing the mapping of logical addressing, the recovery system determines which physical units are in use and which are spare. DN0975145 Issue: 2 © 2017 Nokia 43 System maintenance System Maintenance Table 7 Relationship between unit working states and logical addresses Unit state Logical address in control file Logical address known by application WO yes yes SP no no TE no no SE no no Control messages are used to: • inform a functional unit about its state transitions • distribute the working states • control the mapping of logical addressing All functional units (a computer unit or preprocessor) with state-management software (the distributed part or preprocessor part of the recovery system) can be managed by using control messages. 2.5 Diagnostics Diagnostic and testing system accurately locates hardware failures in the network element and verifies if it functions properly. If errors occur, the system informs you of the necessary actions by producing test diagnostics with fixed headers. The software automatically detects failures and produces a diagnostic report or an alarm printout. The diagnostic work done by the system is directed to the functional units of the system (total diagnoses) or to the functional entities of the system as seen from the point of view of the diagnostics (partial diagnoses). A total diagnosis is divided into partial diagnoses and the partial diagnoses further into diagnostic programs. The diagnostics system can be divided into a control part and an execution part. The control part is located in the Operation and Maintenance Unit (OMU) (or when there is only one OMU which is under testing, in a logical Central Memory (CM) type unit), whereas the execution part is distributed among all functional units in the network element. To repair the fault, you must replace the blade that the report shows to be faulty, or you must determine the location of the failure according to the instructions that you can find in the following sources: • the outputs of the network element • alarm descriptions and instructions • troubleshooting instructions 44 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance • the instructions of Replacing hardware units in Software Platforms Troubleshooting and Maintenance When you have repaired the fault, you have to verify that the network element is in working order by issuing a test command which activates the tests of the repaired unit. After this, you must transfer the repaired unit into its normal working state so that it can perform the tasks assigned to it. For detailed descriptions of the specific commands, see Diagnostics Handling, UD Command Group 2.5.1 Diagnostics procedures in normal and special failure situations The diagnostic system function is activated when a diagnostic report with fixed headers is received. It aims at locating failures to accuracy of one blade. When this is not possible, the system produces a list of suspected blades along with the faulty one. These are listed in the order of probability. The diagnostic report is generated after it has carried out a total diagnosis of the functional unit. At the same time, the system transfers the functional unit into the out-of-use state (SE-OU). This makes it safe to replace blades. The diagnostic report contains the following important fields: LOCATION shows where a faulty blade shelf is located ROW shows the cabinet row in question CABINET shows the location of the cabinet in the cabinet row (A is the first cabinet from the left) HEIGHT shows the height of the shelf The height can be seen at the bottom end of the shelf and the shelf height on a scale that you can find on the side of the cabinet. DISPLACEMENT shows the shelf's displacement from the left-hand side of the shelf TRACK shows the track number of the faulty blade in a shelf If a system is unable to perform a diagnosis for some reason, a text diagnosis is printed out on the alarm printer. For example, a loop test could fail because the loop cannot be established. In such a case, the diagnostic system cannot decide whether the tested unit is in working order or not. Like alarms, the text diagnoses also have a number and fixed headings and it describes the observed failure in words. It also often contains instructions on what to do in those failure cases. It can also contain a list of the suspected blades and also describes maintenance activities. The repair actions that the alarms and the text diagnoses call for are described in the alarm descriptions. If you cannot locate the failure by using diagnostic commands, see the alarm printouts and alarm descriptions. For instructions on printing out alarms, see Alarm administration overview in Alarm Administration. DN0975145 Issue: 2 © 2017 Nokia 45 System maintenance System Maintenance If you need to locate a fault on the basis of alarms, see Unit-specific hardware alarms and alarms from peripheral devices. Failures in a power supply unit can be located so accurately that you can repair the fault immediately by replacing the faulty blade. Instructions are given in the document Replacing Plug-in Units and other Hardware Units in SGSN. The Alarm descriptions contains instructions also for cases when a unit has totally failed (for example, diagnostic report 3726). 2.5.2 Changes in unit states during a diagnosis The figure Changes in unit states explains the changes in unit states that are associated with automatically executed diagnostics. Figure 17 Changes in unit states Totaldiagnosticsdoes notfindafault TE Test begins Fault TE SE intest located faulty Test TE begins faulty Plug-inunit replaced * COMMAND 1. The object of the diagnostics is in the TE state. 2. At the start of the diagnostics, the IN TEST state data of the unit is placed in the state file maintained by the recovery system, or the peripheral device is transferred from the TE-ID state into the TE-AC state. 3. If a fault is located, the unit state changes to the SE-OU state. 4. If the total test does not find a fault, the faulty (FLTY) flag of the unit is cleared (if the system has set it on earlier) and the unit state changes to the SP-EX state. 5. Finally, the control of the unit is returned to the recovery system. * Partial diagnosis does not find a fault and the recovery has previously registered the unit faulty. 2.5.3 Initial conditions for diagnoses A failure can only be diagnosed in a unit when it is in test state (TE). The recovery system must completely transfer the control of the unit to the diagnostic system for the duration of the diagnosis. 46 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance In order to know the exact object of the test, the diagnostic system must have an access to the hardware description in the equipment database. The hardware description shows the details of the unit down to each single plug-in unit. The database is supplied, with all the necessary data entered into it, together with the software. If you modify the hardware (for example, by adding a plug-in unit), you must update the hardware description accordingly. You can update the hardware description by using the MML programs for hardware description handling. You can use the WTP command to add a plug-in unit into the equipment database, and the WTI command to list the description of the units. If the hardware description of a functional unit in the database is faulty, the results of the diagnostic system are inaccurate and even misleading. 2.5.4 Fault situations of the diagnostic system If the diagnostic system functions incorrectly or executes a diagnostic task the results of which you do not need at the moment, you can take over the control and decision making in the diagnostic situation with the UDS interrupt command. For instructions, see Performing diagnostics for units 2.5.5 Total and partial unit tests You can activate the total diagnosis of a unit by using the UDU command. The total diagnosis consists of the consecutive execution of all diagnostic programs in the unit until a fault is located or the unit is found operational. You can activate the partial test of a unit by using the UDU command and the partial diagnosis parameter. A partial diagnosis consists of the consecutive execution of all diagnostic programs included in the partial diagnosis in question (until a fault is located). You can assign priority to a diagnosis by using the priority parameter with the UDU command. Functional description of total and partial unit tests If the diagnostic program is idle when it receives a request to perform a diagnostic job, it prepares the execution of the job and informs you that the job has been started. There can be ten simultaneous tests active at the same time. However, only one test per unit can be active at the same time. The tests of the OMU (if there is only one OMU) and the Message Bus (MB) or the Ethernet Message Bus (EMB) have been prioritized so that no other tests can be active while these are being executed. The printouts of the diagnostic programs (the diagnostic report and the intermediate printout on the execution of partial diagnoses) are directed to a logical file called DIAGNOS. The diagnosis results can be checked using the UDH-MML command. If the diagnostic program queue is already busy executing a diagnostic job, or if a problem arises at the preparation phase, the diagnostic process informs you of the error. However, by using the UDU command and the priority parameter, you can place a diagnostic job at the head of a full queue and start the job. General instructions on total and partial unit tests DN0975145 Issue: 2 © 2017 Nokia 47 System maintenance System Maintenance Use the total diagnosis command to verify the fault repair actions when: • the list of blades to be tested is short • the execution time for the total diagnosis is short • the unit has only one partial diagnosis The partial diagnosis command is best for verifying repair actions when, for example, the fault has been detected in the computer controlling the switching network or in the switching network itself, and when the list of suspected plug-in units is lengthy. The name of the partial diagnosis that has detected the fault is shown in the diagnostic report. g Note: The diagnostic system does not regard a unit as operational until all the diagnostic programs (the total diagnosis) have been executed successfully. After you have performed a partial diagnosis, start the total diagnosis by using the UDU command. The use of partial tests when replacing blades When the list of suspected blades is long, you can speed up the repair operation by starting the partial test (instead of the total unit test) that has detected the fault. The name of the partial test is shown in the PARTIAL DIAGNOSIS NAME field of the diagnostic report. If the diagnosis shows that the unit is in working order, it is likely that the repair action has been successful. However, the system leaves the unit in the faulty (FLTY) functional state, from which you can transfer it into the operational state by starting the total diagnosis for the unit with the UDU command. If replacing a blade does not help and the same diagnosis recurs, reinstall the old blade where it was originally, and replace the next blade on the list of the suspected blades. When the total diagnosis of a unit has been run and it has not ended in a diagnostic report or in an alarm, in other words, the TOTAL DIAGNOSIS EXECUTED - UNIT OK message has been received, you can bring the unit from the test state (TE) back to its normal working state (SP or WO) by using the USC command. Partial diagnoses The system includes the following partial diagnoses and diagnostic programs, grouped by sub functions: • power test • processor diagnostics • diagnostics for peripheral devices • MB or Ethernet-based Message Bud (EMB) diagnostics Partial diagnoses and tested plug-in units are presented in the following tables. 48 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Table 8 Switching network diagnostics Partial diagnosis Diagnostic program Plug-in unit types tested by the partial diagnosis WAT Wired alarm test CLB, TG, TGFP MBIF_U, MBIF_T of M or CAC Wired alarm partial diagnosis GSW Accurate test for the network SWCOP, SWCSM, SWSPS GSW partial diagnosis SWI partial diagnosis SPLRT Test for serial-to-parallel converter SWSPS, SWCOP, internal line receivers PCMs Partial diagnosis for serial- to-parallel converter line receivers Table 9 Processor diagnostics Partial diagnosis Diagnostic program Plug-in unit types tested by the partial diagnosis CPU CPU test CPU RAM RAM test CPU ETHER Ethernet test CPU SYSB PCIe test CPU ETHAM ETHAM test CPU PROC All above test programs CPU Each time slot-based unit has one partial diagnosis, which is named according to the name of the equipment. Each preprocessor has one partial diagnosis, which is named after the blade. The peripherals (HDU, LPT, VDU) have no partial diagnoses. They are always subject to total diagnosis. The diagnostics of the MB or Ethernet-based Message Bud consists of the MB/EMB partial diagnosis. Objectives of partial unit tests The objectives of a partial diagnosis are: • to accelerate repair actions • to make it possible to test an object which is smaller than the functional unit DN0975145 Issue: 2 © 2017 Nokia 49 System maintenance System Maintenance • to give the user an idea of the scope of the test and of the tested unit parts as the test progresses Partial diagnoses are thus not intended for screening the faults where you execute various partial diagnoses and try to deduce the location of the fault on the basis of the results. This kind of screening is mostly included in the diagnostic programs as a built-in function where the analysis part of the programs locates the fault. Screening of faults comes into question in two cases: • You can sometimes get further confirmation of the location of the fault by activating a partial diagnosis that is not executed in the total test. This is because the diagnostic system usually executes partial diagnoses and diagnostic programs up until the first failure. However, any result achieved in this manner is usually of little value. • Screening of faults with partial diagnosis commands comes into question when a fault arises in the switching network interface of units connected to the switching network, such as ET. In such cases, you can apply the SPLRT partial diagnosis for the switching network. When you are using the SPLRT partial diagnosis, the units connected to the switching network must be in the normal working state (WO). Execution order of partial tests One of the principles in the diagnostic system is to test the remaining parts on the basis of those parts that have previously been found operational. Therefore, the execution order of partial diagnoses within a total diagnosis bears certain significance. Figures Interdependencies between partial diagnoses in processor diagnostics and Interdependencies between partial diagnoses in switching network diagnostics illustrate the dependency of the execution order on partial diagnoses according to sub functions. At the bottom of each figure, you can find the basis on which the execution of the upper tests is built. Processor diagnostics Figure 18 Interdependencies between partial diagnoses in processor diagnostics SYSBtest PROC RAMtest test CPUtest Basicdefaults The figure shows, for example, that the partial diagnosis PROC corresponds to the CPU, RAM, and SYSB partial diagnoses. The default for processor diagnostics is that the unit can execute its own diagnostic programs. The tests shown in this figure must be successfully executed before the switching network tests can be done. 50 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Switching network diagnosis Figure 19 Interdependencies between partial diagnoses in switching network diagnostics SPLRT test GSW0 GSW1 GSW2 GSW3 test test test test GSWtest CMEtest SMItest WAT test This figure shows that the diagnostics for the switching network has been presented so that it has been separated from the processor test. The figure should, however, be interpreted as if it were completely above the processor diagnostics. The test of the serial-to-parallel converter line receiver in the switching network (SPLRT) may find a fault not only in the blades, but also in the internal PCM circuits (cables) and devices connected to them. It is capable of detecting a large part of the failures in the switching network interface of a device connected to the switching network by an internal PCM circuit. There are special instructions for failure cases of this kind in Alarm descriptions, where you can find the detailed description of each text diagnosis. 2.6 General maintenance procedures The recommended maintenance routines can be carried out by any regular personnel of the network element, as these do not require special training or disassembling of equipment. They can be carried out during normal working hours unless stated otherwise. The customer is recommended to keep a network-element specific diary, but can also be stored in the Operation and Maintenance Centre if the network element is not usually manned. The network element diary should be filled out when the network is already being set up and installed. The following events are recommended to be recorded in the diary: • hardware changes • software and hardware updates (change notes, correction deliveries, etc.) • essential modifications to the configuration or routing in the network element • safecopying • operational failures • any other relevant information DN0975145 Issue: 2 © 2017 Nokia 51 System maintenance System Maintenance Entries must include the date, time and the maintenance personnel’s name. Performing daily maintenance routines During normal hours, alarms must be investigated as they are reported. Figure 20 Daily maintenance routines alarmhistory previous24hours Output ifthereisalargenumberofalarms currentalarms (AHO) workingstates abnormalworkingstates ofcomputerunits areacoountedfor(USI) Check workingstates ofI/Odevices(ISI) paperenoughfor operating Inspect Printers thenext24hours correctly Update updatethe checkinthelogfilethat FBbuild * FBbuild(WKS) backupwassuccessful(WKP) *needtohaveFBbuildcreatedalreadybyusingmode=fullparameter(WKS) Performing weekly maintenance routines Maintenance tasks to be carried out weekly. Figure 21 Weekly maintenance routines Printingandsaving printunitstates unitstates (USI) saveprintoutasanaccuraterecordoftheunitstates blockedalarms lookforanyblockedalarms (ABO) thatcannotbeaccountedfor dateandtimeofthe Check networkelement(DCD) cleantheprinters printerribbons ifrequired Performing monthly maintenance routines Maintenance tasks to be carried out at one-month intervals. 52 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Figure 22 Monthly maintenance routines hardwarealarmsand Check alarminputs(WAE) VisualDisplay wipethescreen,thecase, adjustbrightnessandcontrast Unit(VDU) andthekeyboard ofthedisplayifnecessary Clean disconnectfrom open vacuumcleanthe Printer mainpowersupply insideoftheprinter thecover connectfrom close changetheribbon mainpowersupply thecover ifnecessary Performing six-monthly maintenance procedures Maintenance tasks to be carried out at six-month intervals. g Note: Routines should be carried out in low traffic periods. Figure 23 Six-monthly maintenance routines ifanyofthe changetheunit Runningfault changethe runthe faultdiagnosesfail, statesofthe diagnosesand unitstate faultdiagnosis investigateand diagnosedunits changingunitstates to TE(USC) (UDU) correctthefault toWO-EX(USC) Performing yearly maintenance routines Maintenance tasks to be carried out yearly. Figure 24 Yearly maintenance routines Checking measurethesupplyvoltageatthe checkandmeasurethevoltages voltages cabinet-specificpowersupplyunits oftheverticalpowerbusses measurethevoltagedifference checktheearthings betweentheearth(DOV)and+lead(0V) connectionsvisually Ifmeasurementresultisstillnotwithinlimits, notethisdowninthenetworkelementdiary Cleaningthe moistenthecleaningdisk insertthedisk enterthecommand floppydiskunit withthecleaningagent intothedrive IWI removethecleaningdisk maketheredLEDindicatoronthe fromthedrive frontpanelglowfor30seconds Performing maintenance routines on external equipment DN0975145 Issue: 2 © 2017 Nokia 53 System maintenance System Maintenance Here are a few general maintenance recommendations for external equipment. Figure 25 External equipment maintenance routines General checkinregularintervalsthatthe servicetheenvironmental maintenance standbypowersuppliesandtheir equipmentsregularlyand recommendations alarmssystemsareinworkingorder testthealarmsystems testtheequipment notedownresults roomfirealarm inthenetwork systemregularly elementdiary Marking unused blocks on a hard disk Hard disk drives need no service. However, with time there will be read and write errors on the disk due to corrupted blocks. These blocks can be marked as ‘bad blocks’, that is, unused blocks (IWB). Actions before a system restart If system is restarted before all the information is saved to disks, most probably the databases will be corrupted. Especially after the massive downloading of parameters from the Netact the steps described in instructions section should be taken before giving a system restart command. Before restarting the system the following checks should be done: 1. Make sure that fallback package is available in the BSC. 2. Copy the Databases to the disk: ZDBC:BSDATA,0; ZDBC:OEDATA,0; ZDBC:EQUIPM,0; ZDBC:ILDATA,0; ZDBC:PDDATA,0; 3. Check the states of databases with commands: ZDBS:BSDATA,0; ZDBS:OEDATA,0; ZDBS:EQUIPM,0; ZDBS:ILDATA,0; ZDBS:PDDATA,0; The states of databases have to be NORMAL. 4. Check the database consist both (W0 and W1) disks with command:ZDBD:OMU; 5. Make sure that there is no disk updating in the queue ZDUQ;. The values of SIMQUE and SEQQUE have to be zero (0). 2.6.1 Actions before a system restart Purpose This document contains generic information about products. These can be instructions that explain problem situations in the field, instructions on how to prevent or how to recover from problem situations, announcements about changes or preliminary information as requirements for new features or releases. Validity 54 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Table 10 Validity Product SR20 SR40 S16_2 BSC16 SRAN16.2 BSC3i - - X X - Flexi BSC X X X X - TCSM2A/E - - X X - TCSM3i - - X X - mcBSC - - X X - mcTC - - - - - Single RAN - - - - X Compatibility / Dependencies to other products None Keywords System restart, database Executive summary It has been noticed that a system restart right after modifying some of the BSC databases can cause problems in the system. When adding the data to databases the system works the way that first the information is saved to the memory and then after a short while copied to databases of the Winchester disks. Detailed description If system is restarted before all the information is saved to disks, most probably the databases will be corrupted. Especially after the massive downloading of parameters from the Netact the steps described in instructions section should be taken before giving a system restart command. Solution/Instructions Before restarting the system the following checks should be done: 1. Make sure that fallback package is available in the BSC. 2. Copy the Databases to the disk: ZDBC:BSDATA,0; ZDBC:OEDATA,0; ZDBC:EQUIPM,0; ZDBC:ILDATA,0; 3. Check the states of databases with commands: ZDBS:BSDATA,0; ZDBS:OEDATA,0; ZDBS:EQUIPM,0; ZDBS:ILDATA,0; ZDBS:PDDATA,0; The states of databases have to be NORMAL. 4. Check the database consist both (W0 and W1) disks using ZDBD:OMU; command . DN0975145 Issue: 2 © 2017 Nokia 55 System maintenance System Maintenance 5. Make sure that there is no disk updating in the queue using ZDUQ; command. The values of SIMQUE and SEQQUE have to be zero. Note pIn the worst case if the system is restarted before all the information is saved to Winchester disks the only way to get BSC back to working is to activate the fallback package. Reference None 2.6.2 CCS7 30 Minutes freezing done Purpose This section contains generic information about products. These can be instructions that explain problem situations in the field, instructions on how to prevent or how to recover from problem situations, announcements about changes or preliminary information as requirements for new features or releases. Validity Table 11 Validity Product SR20 SR40 S16_2 BSC16 SRAN16.2 BSC3i - - X X - Flexi BSC X X X X - TCSM2A/E - - X X - TCSM3i - - X X - mcBSC - - X X - mcTC - - - - - Single RAN - - - - X Compatibility / Dependencies to other products None Keywords CCS7, ZNMM Executive summary This section has instructions to prevent CCS7 30 MINUTES FREEZING DONE-printout with ZNMM command. Detailed description Notice from freezing of counters CCS7 30 MINUTES FREEZING DONE is printed every 30 minutes. This printout is only for information, and it is possible to prevent it. 56 © 2017 Nokia DN0975145 Issue: 2 System Maintenance System maintenance Solution/Instructions Preventing the notice CCS7 30 MINUTES FREEZING DONE is done by MML command ZNMM:F:F6=P;. Note None Reference None Appendix: Terminologies The terminology used in the ATCA hardware-based software platform user interface has not been changed yet to comply with the ATCA terminology. This part lists the terms used in classic DX documentation and the comparable terms used in ATCA and BCN hardware documentation. Table 12 Terminologies in classic DX 200, ATCA and BCN Classic DX 200 ATCA BCN plug-in unit blade add-in card cartridge shelf cartridge chassis shelf chassis subrack shelf subrack rack cabinet rack DN0975145 Issue: 2 © 2017 Nokia 57 BCSU/BCXU Recovery and Alarms 690, 691 and 1001 System Maintenance 3 BCSU/BCXU Recovery and Alarms 690, 691 and 1001 Purpose This document contains generic information about products. These can be instructions that explain problem situations in the field, instructions on how to prevent or how to recover from problem situations, announcements about changes or preliminary information as requirements for new features or releases. Validity Product SR20 SR40 S16_2 BSC16 SRAN16.2 BSC3i X X Flexi BSC X X X X TCSM2A/E X X TCSM3i X X mcBSC X X Single RAN X Compatibility / Dependencies to other products None Keywords BCSU, BCXU, recovery, alarm, 0690, 0691, 1001 Executive summary Radio Network Recovery and State Management uses alarms 0690, 0691 and 1001 during BCSU/BCXU switchover and restart for recovery purposes. These alarms may not be blocked from the BCSU/BCXU. Detailed description Radio Network Recovery and State Management uses alarms 0690, 0691 and 1001 during BCSU/BCXU switchover and restart for recovery purposes. These alarms may not be blocked from the BCSU/BCXU. If alarm 0690, 0691 or 1001 has been blocked, controlled BCSU/BCXU switchover, forced BCSU/BCXU switchover and BCSU/BCXU restart will not work properly. Solution/Instructions Blocked Alarms Handling –MML is used to find out which alarms have been blocked. Print out the blocked alarms: ZABO; I f alarm 0690, 0691 or 1001 has been blocked from the BCSU/BCXU or from all the units, unblock it with command: ZABU:<alarm_number>; 58 © 2017 Nokia DN0975145 Issue: 2 System Maintenance BCSU/BCXU Recovery and Alarms 690, 691 and 1001 Note Alarms 0690, 0691 and 1001 may be blocked from units other than the BCSU/BCXU. E.g. Alarm 0690 may be blocked from the VTP to avoid unnecessary notices from Netact connections (ZABB:690:VTP;). Reference None DN0975145 Issue: 2 © 2017 Nokia 59 Expiry of licenses System Maintenance 4 Expiry of licenses Purpose This document contains generic information about products. These can be instructions that explain problem situations in the field, instructions on how to prevent or how to recover from problem situations, announcements about changes or preliminary information as requirements for new features or releases. Validity Table 14 Validity Product SR20 SR40 S16_2 BSC16 SRAN16.2 BSC3i - - X X - Flexi BSC X X X X - TCSM2A/E - - - - - TCSM3i - - - - - mcBSC - - X X - Single RAN - - - - X Compatibility / Dependencies to other products Implementation of this Technical support note doesn’t have any effect or requirements for the other network elements. Keywords License, power, break, default, calendar, clock, boot, time Executive summary Power break of a BSC can lead to a reset of a clock / calendar time which then can cause expiry of license based features and overload of MCMU. Detailed description Power break of the BSC can lead to a reset of a clock / calendar time. In case clock / calendar time is reset by a power break the default time (1989-01-01 00:00:00) is taken into use at the network element. This will cause the expiry of the licenses in 24 hours after default time activation. All licenses will go to CONF state and license based features will be deactivated. Sometimes power break or faulty CPU in OMU may cause an invalid clock / calendar time (2013-1?-21 06:38:01.17) leading to MCMU overload (alarm 1071 PROCESSOR TIME SHORTAGE). Alarm example: FlxieBSC21 MCMU-1 SWITCH 2009-01-01 00:30:00.41 ** ALARM MCMU-1 1A003-06 LMNPRO (0110) 3269 LICENCE EXPIRATION WARNING 60 © 2017 Nokia DN0975145 Issue: 2 System Maintenance Expiry of licenses TIMRES3 0001 The first supplementary field of the alarm gives the TIMRES3 indication if the default time has been activated. The second field gives the time before the licences will expire. In the case of default time field value is one day. <HIST> Kobra-BSC3i MCMU-1 SWITCH 2013-12-21 07:26:45.04 *DISTUR MCMU-1 1A002-03 CPUPRO 1071 PROCESSOR TIME SHORTAGE 00000001 04 02A0 0000 40 0676 0011 40 Solutions/Instructions After a power break of whole BSC, OMU or MCMUs, calendar / clock time of BSC needs to be checked and corrected if it isn’t up to date or the information is invalid (e.g. non- numeric characters). • In the case of default time activation the calendar / clock time needs to be corrected manually through the DC - Clock Handling MML. Example: The default time active on a NE ZDCD; LOADING PROGRAM VERSION 7.5-0 TIME 2009-01-01 00:00:17 TIME ZONE 0, SUMMER TIME IS ON COMMAND EXECUTED Example: Time updated by the DCS – MML ZDCS:2010-12-01,11-53-00:ST=OFF:; TIME UPDATED IN ALL COMPUTER UNITS NEW TIME 2010-12-01 11:53:00 SUMMER TIME IS OFF COMMAND EXECUTED • The status of the licences needs to be checked and if needed re-activated by the W7 - Licence and Feature Handling MML. Example: Licence status check by the W7I - MML ZW7I:FEA,FULL:FSTATE=CONF; LOADING PROGRAM VERSION 2.13-0 BSC3i LANCS641 213737 2010-12-01 14:41:33 FEATURE INFORMATION: ---------------------------------------------- FEATURE CODE:..............72 FEATURE NAME:..............AMR FR FEATURE STATE:.............CONF FEATURE CAPACITY:..........660 COMMAND EXECUTED Example: Licence state modification by the W7M - MML ZW7M:FEA=72:ON; LOADING PROGRAM VERSION 2.13-0 BSC3i LANCS641 213737 2010-12-01 14:42:33 CHANGING STATE OF FEATURES: FEATURE 72 CHANGED TO ON COMMAND EXECUTED Notes DN0975145 Issue: 2 © 2017 Nokia 61 Expiry of licenses System Maintenance The time and date should be checked after restart/power break. There is 24 hours of time to correct BSC time before licences are expired. References NED operating documentation: • DC - Clock Handling • W7 - License and Feature Handling PR NA04796701: Follow-Up Case for Emergency Case NA04794699 - Calls are failing in BSC . 62 © 2017 Nokia DN0975145 Issue: 2 System Maintenance PCU2 serial port usage 5 PCU2 serial port usage This technical note recommends not to use PCU2 serial port while BCSU is in restart phase. Description This document contains generic information about products. These can be instructions that explain problem situations in the field, instructions on how to prevent or how to recover from problem situations, announcements about changes or preliminary information as requirements for new features or releases. Detailed Description Problem When PCU2 boots up during BCSU restart phase it performs a specific handshake procedure to establish the communication with the CPU of BCSU. If BCSU load is high this procedure may need to be repeated multiple times. PCU2-D In this case PCU2 provides a 10s period for the operator to activate the Built in Self Test (BIST) mode from the serial port. If during that time a CR character (ASCII code 13) is received from the serial port the PCU2 enters to BIST mode and discontinues the boot procedure. As a consequence the BCSU fails to restart. PCU2-E On PCU2-E the user must write bist command from the serial port to get the PCU2 to BIST mode during the 10s time period. Thus any single character entry does not disturb the normal booting procedure. Solutions/Instructions If any equipment is connected to the serial port of PCU2 during the restart phase it needs to be ensured that no unsolicited characters are transmitted. Before entering the username or any commands one shall wait that the boot procedure is completed (see example output on the chapter Boot up phases and traces on the PCU2 Service terminal Command Manual) included in BSC NED product documentation set. If a service terminal port (e.g. reverse telnet server) is connected to the serial port, local echo shall not be enabled on the server port. This problem does not occur on Telnet sessions or when the remote service terminal session is established from MML session. Trying to establish either type of session even during the boot procedure does not cause this problem. Note The BIST mode is intended to be used in production testing and when investigating faulty plug in units on test bench. It is not to be used in live BSC. Reference PCU2 Service Terminal Commands manual DN0479352. DN0975145 Issue: 2 © 2017 Nokia 63 PCU plug-in units without unique MAC HW address System Maintenance 6 PCU plug-in units without unique MAC HW address MAC HW address has been found to be identical between two separate PCU plug-in units (PIU’s), but not between the internal ports IFETH0 and IFETH1of one PCU PIU. Description This document contains generic information about products. These can be instructions that explain problem situations in the field, instructions on how to prevent or how to recover from problem situations, announcements about changes or preliminary information as requirements for new features or releases. Detailed Description Some delivered PCU plug-in units are not having a unique MAC HW address pre- configured. The address is used for L2 layer routing for Gb/IP traffic. If PCU's with identical MAC HW address are installed into same BSC where Gb over IP feature is used, the IP traffic is not working properly. Failures in IP traffic in PCU LAN due the identical MAC HW address will cause interference into BSSGP layer signaling and also to end-user data transactions. Solution/Instructions To verify that BSC does not have PCU plug-in units with same MAC HW address it is recommended to check the ifeth interfaces of every PCU with ZQRS command. Command syntax for the ZQRS is:ZQRS:BCSU,<BCSU_ID>:<PCU_TYPE>,<PCU_ID>:INS,<INTERFACE>:SYM= NO; Both interfaces of the PCU IFETH0 and IFETH1 has to be checked. For example PCUB 3 in BCSU 3. ZQRS:BCSU,3:PCUB,3:INS,IFETH0:SYM=NO; ZQRS:BCSU,3:PCUB,3:INS,IFETH1:SYM=NO; Example printout from BSC that has PCU-B plug-in units with identical MAC HW address: ZQRS:BCSU,3:PCUB,3:INS,IFETH0:SYM=NO; LOADING PROGRAM VERSION 10.13-0 BSC3i BSC53 2007-07-03 17:18:28 UNIT: BCSU-3 ifeth0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 inet6 fe80:1::240:43ff:feea:ffc prefixlen 64 duplicated inet 10.177.155.70 netmask 0xffffffe0 broadcast 10.177.155.95 ether 00:40:43:ea:0f:fc If there is other PCU plug-in unit in the BSC with same MAC HW address the printout shows duplicated warning in the printout. In case PCU plug-in units with identical MAC HW address are found from the BSC the PCU plug-in units must be changed so that each PCU in the same L2 network (usually same as BSC or BSCs in the same site) has unique MAC HW address. PCUs with identical MAC addresses can be used if there is a L3 router between them. Change can 64 © 2017 Nokia DN0975145 Issue: 2 System Maintenance PCU plug-in units without unique MAC HW address be done either with spare plug-in unit or changing plug-in units between BSC’s because failures are happening only if PCU plug-in units with same MAC HW address are installed into the same L2 network (usually same as BSC or BSCs in the same site). Notes Identical MAC HW address is causing interference only into Gb over IP feature. If Gb over IP feature is not used failures are not seen. DN0975145 Issue: 2 © 2017 Nokia 65 Recommendations for PCU2 Black Box Saving State System Maintenance 7 Recommendations for PCU2 Black Box Saving State This is a request to enable black box information saving in PCU2 cards. Saved black box information will give invaluable information, when investigating events leading alarm 1178 PREPROCESSOR UNIT DISTURBANCE. Description This document contains generic information about products. These can be instructions that explain problem situations in the field, instructions on how to prevent or how to recover from problem situations, announcements about changes or preliminary information as requirements for new features or releases. Detailed Description There is a mechanism in PCU2 to save some SW information to black box in case of software exception leading to PCU2 SW crash leading to alarms 1178 PREPROCESSOR UNIT DISTURBANCE and 2770 PREPROCESSOR UNIT FAILURE. This information is stored in PCU2 flash memory to make it available after PCU2 restart to find out root cause of software failure. PCU2 black box saving to flash memory is disabled by default in PCU2 card revisions older than PCU2-D: C108407.C3A. It is however required to enable PCU2 black box saving in all PCU2s to help investigate possible PCU unit failures (alarms 1178 and 2770). It is also required to disable PCU2 black box saving during BSC release or CD upgrade to avoid minor risk of corruption of PCU2 software in flash. Once black box saving has been disabled or enabled it will stay on that state unless changed with PCU2 service terminal command. PCU2 unit restart does not reset state. In PCU2 cards revisions PCU2-D: C108407.C3A or newer black box saving is enabled by default. No actions related to black box saving state should be done with these cards. Black box saving can be enabled also during BSC release or CD upgrade. Solution/Instructions S14, S15 and onwards for PCU2-D: PCU2 black box saving state can be checked with PCU2 service terminal command mbb status. PCU2 black box saving can be enabled with PCU2 service terminal command mbb enable. PCU2 black box saving can be disabled with PCU2 service terminal command mbb disable. PCU2 service terminal can be connected with following commands: ZDDS; ZRS:3<BCSU_index>,<PCU_terminal_index+1>0BE or ZRS:3<BCSU_index>,<PCU_terminal_index+1>0BF After operation in PCU2 service terminal, session should be ended with commands: 66 © 2017 Nokia DN0975145 Issue: 2 System Maintenance Recommendations for PCU2 Black Box Saving State exit: ZE; Black box saving state change has to be done separately to all PCU2-D units. DN0975145 Issue: 2 © 2017 Nokia 67

Comments

Description