论文标题
部分可观测时空混沌系统的无模型预测
RB2: Robotic Manipulation Benchmarking with a Twist
论文作者
论文摘要
基准提供了一种使用客观性能指标比较算法的科学方法。良好的基准测试有两个特征:(a)它们对于许多研究小组应该广泛有用; (b)它们应产生可重现的发现。在机器人操纵研究中,可重复性和广泛可访问性之间存在权衡。如果基准测试限制性(固定的硬件,对象),则数字是可重现的,但设置变得不那么通用。另一方面,基准可能是一组松散的协议(例如对象集),但是设置的基本变化使得结果不可复制。在本文中,我们将机器人操作作为最新算法实现进行了重新构想,以及通常的一组任务和实验协议。增加的基线实现将提供一种在新的本地机器人设置中轻松重新创建SOTA数字的方法,从而在现有方法和新工作之间提供可信的相对排名。但是,这些本地排名可能在不同的设置之间有所不同。为了解决此问题,我们建立了一种在实验室之间汇总实验数据的机制,因此我们为现有(和拟议的)SOTA算法建立了一个单一的全球排名。我们的基准为基于排名的机器人基准(RB2)的基准,对受临床验证的南安普敦手部评估程序启发的任务进行了评估。我们的基准测试跨越了两个不同的实验室,并揭示了几个令人惊讶的发现。例如,诸如开环行为克隆,胜过更复杂的模型(例如闭环,RNN,离线-RL等)之类的极为简单的基线,这些模型是该领域首选的。我们希望我们的研究人员将使用RB2来提高研究的质量和严格。
Benchmarks offer a scientific way to compare algorithms using objective performance metrics. Good benchmarks have two features: (a) they should be widely useful for many research groups; (b) and they should produce reproducible findings. In robotic manipulation research, there is a trade-off between reproducibility and broad accessibility. If the benchmark is kept restrictive (fixed hardware, objects), the numbers are reproducible but the setup becomes less general. On the other hand, a benchmark could be a loose set of protocols (e.g. object sets) but the underlying variation in setups make the results non-reproducible. In this paper, we re-imagine benchmarking for robotic manipulation as state-of-the-art algorithmic implementations, alongside the usual set of tasks and experimental protocols. The added baseline implementations will provide a way to easily recreate SOTA numbers in a new local robotic setup, thus providing credible relative rankings between existing approaches and new work. However, these local rankings could vary between different setups. To resolve this issue, we build a mechanism for pooling experimental data between labs, and thus we establish a single global ranking for existing (and proposed) SOTA algorithms. Our benchmark, called Ranking-Based Robotics Benchmark (RB2), is evaluated on tasks that are inspired from clinically validated Southampton Hand Assessment Procedures. Our benchmark was run across two different labs and reveals several surprising findings. For example, extremely simple baselines like open-loop behavior cloning, outperform more complicated models (e.g. closed loop, RNN, Offline-RL, etc.) that are preferred by the field. We hope our fellow researchers will use RB2 to improve their research's quality and rigor.