论文标题
现场:在云点实例上进行耐故障长期工作负载的检查点框架
Spot-on: A Checkpointing Framework for Fault-Tolerant Long-running Workloads on Cloud Spot Instances
论文作者
论文摘要
现场实例为在云计算环境中运行的应用提供了一种具有成本效益的解决方案。但是,在现场实例上进行长期运行的工作是一项挑战,因为它们会受到不可预测的驱逐。在这里,我们提出了Spot-On,这是一个通用的软件框架,该框架通过检查点和重新启动在现场实例上支持容易耐药的长期工作负载。 Spot-On利用现有的检查点软件包,并且与主要的云供应商兼容。使用基因组应用作为测试案例,我们证明了Spot-On支持应用程序特定和透明的检查点方法。与使用按需实例运行的应用程序相比,它允许完成这些工作负载,以大大降低计算成本。与使用特定于应用程序的检查点机制运行应用程序相比,透明的受检查点保护的应用程序将运行时降低了40%,从而进一步节省了高达86%的成本。
Spot instances offer a cost-effective solution for applications running in the cloud computing environment. However, it is challenging to run long-running jobs on spot instances because they are subject to unpredictable evictions. Here, we present Spot-on, a generic software framework that supports fault-tolerant long-running workloads on spot instances through checkpoint and restart. Spot-on leverages existing checkpointing packages and is compatible with the major cloud vendors. Using a genomics application as a test case, we demonstrated that Spot-on supports both application-specific and transparent checkpointing methods. Compared to running applications using on-demand instances, it allows the completion of these workloads for a significant reduction in computing costs. Compared to running applications using application-specific checkpoint mechanisms, transparent checkpoint-protected applications reduce runtime by up to 40%, leading to further cost savings of up to 86%.