Description

Book Synopsis
Computational clusters have long provided a mechanism for the acceleration of high performance computing (HPC) applications. With today''s supercomputers now exceeding the petaflop scale, however, they are also exhibiting an increase in heterogeneity. Thisheterogeneity spans a range of technologies, from multiple operating systems to hardware accelerators and novel architectures. Because of the exceptional acceleration some of these heterogeneous architectures provide, they are being embraced as viable tools for HPC applications. Given the scale of today''s supercomputers, it is clear that scientists must consider the use of fault-tolerance in their applications. This is particularly true as computational clusters with hundreds and thousands of processors become ubiquitous in large-scale scientific computing, leading to lower mean-times-to-failure. This forces the systems to effectively deal with the possibility of arbitrary and unexpected node failure. In this book the address the issue of fault-tolerance via checkpointing. They discuss the existing strategies to provide rollback recovery to applications -- both via MPI at the user level and through application-level techniques. Checkpointing itself has been studied extensively in the literature, including the authors'' own works. Here they give a general overview of checkpointing and how it''s implemented. More importantly, they describe strategies to improve the performance of checkpointing, particularly in the case of distributed systems.

Computation Checkpointing & Migration

Product form

£185.99

Includes FREE delivery

RRP £247.99 – you save £62.00 (25%)

Order before 4pm today for delivery by Wed 17 Dec 2025.

A Hardback by Vipin Chaudhary, Hai Jiang, John Paul N Walters

1 in stock


    View other formats and editions of Computation Checkpointing & Migration by Vipin Chaudhary

    Publisher: Nova Science Publishers Inc
    Publication Date: 01/07/2010
    ISBN13: 9781607418405, 978-1607418405
    ISBN10: 1607418401
    Also in:
    Supercomputers

    Description

    Book Synopsis
    Computational clusters have long provided a mechanism for the acceleration of high performance computing (HPC) applications. With today''s supercomputers now exceeding the petaflop scale, however, they are also exhibiting an increase in heterogeneity. Thisheterogeneity spans a range of technologies, from multiple operating systems to hardware accelerators and novel architectures. Because of the exceptional acceleration some of these heterogeneous architectures provide, they are being embraced as viable tools for HPC applications. Given the scale of today''s supercomputers, it is clear that scientists must consider the use of fault-tolerance in their applications. This is particularly true as computational clusters with hundreds and thousands of processors become ubiquitous in large-scale scientific computing, leading to lower mean-times-to-failure. This forces the systems to effectively deal with the possibility of arbitrary and unexpected node failure. In this book the address the issue of fault-tolerance via checkpointing. They discuss the existing strategies to provide rollback recovery to applications -- both via MPI at the user level and through application-level techniques. Checkpointing itself has been studied extensively in the literature, including the authors'' own works. Here they give a general overview of checkpointing and how it''s implemented. More importantly, they describe strategies to improve the performance of checkpointing, particularly in the case of distributed systems.

    Recently viewed products

    © 2025 Book Curl

      • American Express
      • Apple Pay
      • Diners Club
      • Discover
      • Google Pay
      • Maestro
      • Mastercard
      • PayPal
      • Shop Pay
      • Union Pay
      • Visa

      Login

      Forgot your password?

      Don't have an account yet?
      Create account