论文标题
计算相关性的机器学习:使用公司级别数据的优势
Machine learning to assess relatedness: the advantage of using firm-level data
论文作者
论文摘要
一个国家或公司与产品之间的相关性是衡量这种经济活动的可行性。因此,它是在私人和机构一级进行投资的驱动力。传统上,相关性是使用由国家级别的产品对共发生得出的网络来衡量的,该网络计算出有多少国家的出口。在这项工作中,我们比较了不仅对国家 /地区数据的网络和机器学习算法进行了比较,而且还对公司进行了比较,由于公司级数据的可用性较低,因此对公司进行了比较。假设更多相关产品的可能性较高,我们可以通过使用它们预测该国和公司级别的出口来进行定量比较不同的相关性度量。我们的结果表明,相关性与规模有关:最佳评估是通过在一个想要预测的数据类型上使用机器学习来获得的。此外,我们发现,尽管基于国家数据的相关性措施不适合公司,但公司级别的数据对于国家的发展也非常有用。从这个意义上讲,建立在公司数据上的模型可以更好地评估相关性。我们还讨论了使用参数优化和社区检测算法来识别相关公司和产品的簇,发现将较高数量的分区缩短了计算时间,同时将预测性能保持在基于网络的基准高于基于网络的基准的同时。
The relatedness between a country or a firm and a product is a measure of the feasibility of that economic activity. As such, it is a driver for investments at a private and institutional level. Traditionally, relatedness is measured using networks derived by country-level co-occurrences of product pairs, that is counting how many countries export both. In this work, we compare networks and machine learning algorithms trained not only on country-level data, but also on firms, that is something not much studied due to the low availability of firm-level data. We quantitatively compare the different measures of relatedness, by using them to forecast the exports at the country and firm-level, assuming that more related products have a higher likelihood to be exported in the future. Our results show that relatedness is scale-dependent: the best assessments are obtained by using machine learning on the same typology of data one wants to predict. Moreover, we found that while relatedness measures based on country data are not suitable for firms, firm-level data are very informative also for the development of countries. In this sense, models built on firm data provide a better assessment of relatedness. We also discuss the effect of using parameter optimization and community detection algorithms to identify clusters of related companies and products, finding that a partition into a higher number of blocks decreases the computational time while maintaining a prediction performance well above the network-based benchmarks.