论文标题
缩放对计算到下一个星系调查
Scaling pair count to next galaxy surveys
论文作者
论文摘要
根据距离计数星系或恒星对,是在天体物理学和宇宙学中进行的真实空间相关分析的核心。即将进行的星系调查(LSST,Euclid)将衡量数十亿个星系的属性,挑战我们在与模拟使用相关的微小时间内执行此类计数的能力。该问题仅受到对数据的有效访问的限制,因此属于大数据类别。我们使用流行的Apache Spark框架来解决它并设计有效的高通量算法,以处理数亿至数十亿至数十亿个输入数据。为了优化它,我们基于立方体对称性重新审视非等级球体像素化的问题,并开发出一种新的,称为“相似的半径球体像素化”(SARSPIX),非常接近正方形像素。它为所有与距离相关的计算提供了最适合的索引。使用类似LSST的快速模拟,我们计算包含一亿至10亿个数据点的断层扫描上的自相关功能。在每种情况下,我们使用一种简单的算法在大约2分钟内实现标准的配对直方直方图,该算法显示在适度的节点(16至64)上进行缩放。这说明了这种新技术在天文学领域的潜力,在该领域,数据访问已成为主要瓶颈。它们可以轻松地适用于其他用例,作为最近的邻骨搜索,目录交叉匹配或集群查找。该软件可从https://github.com/astrolabsoftware/sparkcorr公开获得。
Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of galaxies challenging our ability to perform such counting in a minute-scale time relevant for the usage of simulations. The problem is only limited by efficient access to the data, hence belongs to the big data category. We use the popular Apache Spark framework to address it and design an efficient high-throughput algorithm to deal with hundreds of millions to billions of input data. To optimize it, we revisit the question of nonhierarchical sphere pixelization based on cube symmetries and develop a new one dubbed the "Similar Radius Sphere Pixelization" (SARSPix) with very close to square pixels. It provides the most adapted indexing over the sphere for all distance-related computations. Using LSST-like fast simulations, we compute autocorrelation functions on tomographic bins containing between a hundred million to one billion data points. In each case we achieve the construction of a standard pair-distance histogram in about 2 minutes, using a simple algorithm that is shown to scale, over a moderate number of nodes (16 to 64). This illustrates the potential of this new techniques in the field of astronomy where data access is becoming the main bottleneck. They can be easily adapted to other use-cases as nearest-neighbors search, catalog cross-match or cluster finding. The software is publicly available from https://github.com/astrolabsoftware/SparkCorr.