Software architecture for developing distributed fault-tolerant systems independent of the underlying
hardware architecture and
operating system. Systems built using architecture components are scalable and allow a set of computer applications to operate in fault-tolerant / high-availability mode, distributed
processing mode, or many possible combinations of distributed and fault-tolerant
modes in the same
system without any modification to the architecture components. The
software architecture defines
system components that are modular and address problems in present systems. The architecture uses a
System Controller, which controls
system activation,
initial load distribution, fault
recovery,
load redistribution, and
system topology, and implements
system maintenance procedures. An Application Distributed Fault-Tolerant / High-Availability Support Module (ADSM) enables an applications( ) to operate in various distributed fault-tolerant
modes. The
System Controller uses ADSM's well-defined API to control the state of the application in these
modes. The
Router architecture component provides transparent communication between applications during fault
recovery and topology changes. An Application
Load Distribution Module (ALDM) component distributes incoming external events towards the distributed application. The architecture allows for a Load Manager, which monitors load on various copies of the application and maximizes the hardware usage by providing
dynamic load balancing. The architecture also allows for a Fault Manager, which performs fault detection, fault location, and fault isolation, and uses the
System Controller's API to initiate fault
recovery. These architecture components can be used to achieve a variety of distributed
processing high-availability system configurations, which results in a reduction of cost and development time.