GPU事件并行算法在NECP-MCX的初步实现

Preliminary implementation of event-based GPU-acceleration in NECP-MCX

  • 摘要: 蒙特卡罗方法进行辐射屏蔽模拟时效率低下,使用特定的降方差技巧是加速辐射屏蔽模拟的方法之一,另一种更通用的方法是使用大规模并行技术从硬件方面提升模拟速度。目前由于人工智能技术发展引起的对算力的庞大需求,各大超算平台对大规模GPU并行架构的支持稳步提升,为了适应目前和未来超算平台的GPU并行架构,开发适用于GPU平台的蒙特卡罗输运算法很有必要。利用GPU并行加速NECP-MCX蒙特卡罗粒子输运程序的固定源计算,进而加速辐射屏蔽输运模拟。分析了GPU事件并行算法在固定源计算模式下的特性,在NECP-MCX程序中初步部署了GPU事件并行算法,基于简单固定源问题进行了测试分析,结果表明,最大同时模拟事件数与模拟速度正相关,对粒子信息排序能够加速28%,GPU并行速度为单核CPU运行速度的25倍。初步的GPU并行加速展现出了显著的加速潜力,然而,若要充分挖掘其能力并优化整体性能,需要进一步的研究。

     

    Abstract:
    Background When using the Monte Carlo method for radiation shielding simulations, the efficiency is low. Employing specific variance reduction techniques is one of the methods to accelerate radiation shielding simulations, while another more universal approach is to use large-scale parallel technology to enhance the simulation speed from the hardware aspect. At present, due to the enormous demand for computing power triggered by the development of artificial intelligence technology, major supercomputing platforms have steadily improved their support for large-scale GPU parallel architectures. To adapt to the current and future GPU parallel architectures of supercomputing platforms, it is necessary to develop Monte Carlo transport algorithms suitable for GPU platforms.
    Purpose This paper aims to accelerate fixed-source calculation of the NECP-MCX Monte Carlo particle transport code by utilizing GPU parallelism, thereby enhancing the efficiency of radiation shielding transport simulations.
    Method This paper analyzes the characteristics of the GPU event-based parallel algorithm under the fixed-source mode. The GPU event-based parallel algorithm has been preliminarily implemented within the NECP-MCX code and was tested and analyzed using a simple fixed-source problem.
    Results The results show that the maximum number of simultaneous simulated events is positively correlated with the simulation speed. Sorting particle information can accelerate the simulation by 28%, and the GPU parallel implementation is 25 times faster than the single-core CPU implementation.
    Conclusions The initial implementation shows significant potential for acceleration; however, further research is essential to fully exploit its capabilities and optimize performance.

     

/

返回文章
返回