In software engineering and hardware engineering, serviceability (also known as supportability,) is one of the -ilities or aspects (from IBM's RAS(U) (Reliability, Availability, Serviceability, and Usability)). It refers to the ability of technical support personnel to install, configure, and monitor computer products, identify exceptions or faults, debug or isolate faults to root cause analysis, and provide hardware or software maintenance in pursuit of solving a problem and restoring the product into service. Incorporating serviceability facilitating features typically results in more efficient product maintenance and reduces operational costs and maintains business continuity.
Examples of features that facilitate serviceability include:Help desk notification of exceptional events (e.g., by electronic mail or by sending text to a pager)
Event logging / Tracing (software)
Logging of program state, such as
Execution path and/or local and global variables
Procedure entry and exit, optionally with incoming and return variable values (see: subroutine)
Exception block entry, optionally with local state (see: exception handling)
Graceful degradation, where the product is designed to allow recovery from exceptional events without intervention by technical support staff
Hardware replacement or upgrade planning, where the product is designed to allow efficient hardware upgrades with minimal computer system downtime (e.g., hotswap components.)
Serviceability engineering may also incorporate some routine system maintenance related features (see: Operations, Administration and Maintenance (OA&M.))
A service tool is defined as a facility or feature, closely tied to a product, that provides capabilities and data so as to service (analyze, monitor, debug, repair, etc.) that product. Service tools can provide broad ranges of capabilities. Regarding diagnosis, a proposed taxonomy of service tools is as follows:Level 1: Service tool that indicates if a product is functional or not functional. Describing computer servers, the states are often referred to as ‘up’ or ‘down’. This is a binary value.
Level 2: Service tool that provides some detailed diagnostic data. Often the diagnostic data is referred to as a problem ‘signature’, a representation of key values such as system environment, running program name, etc. This level of data is used to compare one problem’s signature to another problem’s signature: the ability to match the new problem to an old one allows one to use the solution already created for the prior problem. The ability to screen problems is valuable when a problem does match a pre-existing problem, but it is not sufficient to debug a new problem.
Level 3: Provides detailed diagnostic data sufficient to debug a new and unique problem.
As a rough rule of thumb for these taxonomies, there are multiple ‘orders of magnitude’ of diagnostic data in level 1 vs. level 2 vs. level 3 service tools.
Additional characteristics and capabilities that have been observed in service tools:Time of data collection: some tools can collect data immediately, as soon as problem occurs, others are delayed in collecting data.
Pre-analyzed, or not-yet-analyzed data: some tools collect ‘external’ data, while others collect ‘internal’ data. This is seen when comparing system messages (natural language-like statements in the user’s native language) vs. ‘binary’ storage dumps.
Partial or full set of system state data: some tools collect a complete system state vs. a partial system state (user or partial ‘binary’ storage dump vs. complete system dump).
Raw or analyzed data: some tools display raw data, while others analyze it (examples storage dump formatters that format data, vs. ‘intelligent’ data formatters (“ANALYZE” is a common verb) that combine product knowledge with analysis of state variables to indicate the ‘meaning’ of the data.
Programmable tools vs. ‘fixed function’ tools. Some tools can be altered to get varying amounts of data, at varying times. Some tools have only a fixed function.
Automatic or manual? Some tools are built into a product, to automatically collect data when a fault or failure occurs. Other tools have to be specifically invoked to start the data collection process.
Repair or non-repair? Some tools collect data as a fore-runner to an automatic repair process (self-healing/fault tolerant). These tools have the challenge of quickly obtaining unaltered data before the desired repair process starts.