文章预览
本文介绍Cute Tiled Copy的抽象结构和相应的内存拷贝流程, 目录如下: 1. Cute Copy范式 1.1 CopyOperation 1.2 Copy_Traits 1.3 Copy_Atom 1.4 TiledCopy 1.5 ThrCopy 2. Cute Copy示例 1. Cute Copy范式 Cutlass Tiled Copy的抽象结构如下所示. 1.1 Copy_Op Copy_Op是原始的PTX指令, 我们在《Tensor-003 TensorCore架构》中介绍了 ldmatrix , cp.async 以及Hopper的TMA等多种内存拷贝指令.在 include/cute/arch 有相应的实现,例如 ldmatrix struct SM75_U16x8_LDSM_T { using SRegisters = uint128_t [ 1 ]; using DRegisters = uint32_t [ 4 ]; CUTE_HOST_DEVICE static void copy ( uint128_t const & smem_src, uint32_t & dst0, uint32_t & dst1, uint32_t & dst2, uint32_t & dst3) { uint32_t smem_int_ptr = cast_smem_ptr_to_uint( _src); asm volatile ( "ldmatrix.sync.aligned.x4.trans.m8n8.shared.b16 {%0, %1, %2, %3}, [%4];\n"
………………………………