A cluster
recovery process is implemented across a set of distributed archives, where each individual archive is a storage cluster of preferably symmetric nodes. Each node of a cluster typically executes an instance of an application that provides object-based storage of fixed content data and associated
metadata. According to the storage method, an association or “link” between a first cluster and a second cluster is first established to facilitate replication. The first cluster is sometimes referred to as a “primary” whereas the “second” cluster is sometimes referred to as a “replica.” Once the link is made, the first cluster's fixed content data and
metadata are then replicated from the first cluster to the second cluster, preferably in a continuous manner. Upon a failure of the first cluster, however, a
failover operation occurs, and clients of the first cluster are redirected to the second cluster. Upon repair or replacement of the first cluster (a “restore”), the repaired or replaced first cluster resumes authority for servicing the clients of the first cluster. This restore operation preferably occurs in two stages: a “
fast recovery” stage that involves preferably “bulk” transfer of the first cluster
metadata, following by a “fail back” stage that involves the transfer of the fixed content data. Upon
receipt of the metadata from the second cluster, the repaired or replaced first cluster resumes authority for the clients irrespective of whether the fail back stage has completed or even begun.