Add documentation for how buffers are managed in runtime.

sherry-yuan · sherry-yuan · commit 48b6dd7f062e · 2022-03-14T08:54:26.000-07:00
---------------------------------------------------------------------------

It specifically documents;
1. how it is used
2. how sycl use it
3. some caveats
4. how memory are allocated
5. device address encoding
6. the whole implementation flow of all function that are responsible for allocation, memory transfer
7. some corner cases
diff --git a/docs/buffers.md b/docs/buffers.md
@@ -0,0 +1,137 @@
+# Buffers
+
+Buffer takes care of automatic memory migration between host and device or between device when kernel needs to access the memory. This provides the user convinience of not having to copy memory from host to device manually.
+
+An synopsis of buffer usage is as follows
+
+```cpp
+#define GMEM_DATA_VEC 8
+
+cl_platform_id platform;
+cl_int err;
+// General setups
+err = clGetPlatformIDs(1, &platform, NULL);
+err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
+cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
+cl_command_queue queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
+// aocx_size is the size of aocx file
+// aocx_contents is the content of aocx file
+cl_program program = clCreateProgramWithBinary(context, 1, &device, &aocx_size, (const unsigned char **)&aocx_contents, &binary_status, &err);
+cl_kernel kernel = clCreateKernel(program, "kernel_name", &err);
+
+// Host side data that contains the info needed for kernel execution
+unsigned int *host_data = (unsigned int *)acl_aligned_malloc(global_work_size*sizeof(unsigned int)*GMEM_DATA_VEC);
+
+// Setup the expectation for a buffer, but not actually allocating memory yet
+cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, global_work_size*sizeof(unsigned int)*GMEM_DATA_VEC, NULL, &err);
+
+// Enqueue a memory transfer operation
+cl_event event = {};
+err = clEnqueueWriteBuffer(queue, mem, CL_FALSE, 0, global_work_size*sizeof(unsigned int)*GMEM_DATA_VEC, host_data, 0, NULL, &event);
+
+// Tells the kernel one of its argument is the buffer
+err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &mem);
+
+// Submit kernel execution to queue
+err = clEnqueueNDRangeKernel(queue, kernel, 2, NULL, &global_work_size_2d[0], &local_work_size_2d[0], 0, NULL, &event);
+
+... // Block on kernel finish and clean up resources
+```
+
+## Summary
+
+1. Buffers are not actually allocated on the device until the kernel executes or `clEnqueueWriteBuffer` are called or `CL_MEM_COPY_HOST_PTR` flag is defined when calling `clCreateBuffer`.
+2. Buffer takes care of memory migration between host and device automatically.
+3. SYCL runtime typically calls opencl function in different order compare to OpenCL prefered order. Specifically. SYCL runtime calls functions in the order of `clCreateBuffer` > `clEnqueueWriteBuffer` > `clSetKernelArg` > `clEnqueueNDRangeKernel`. In opencl, the optimal calls is: `clCreateBuffer` > `clSetKernelArg` > `clEnqueueWriteBuffer` >  `clEnqueueNDRangeKernel`. This sometimes lead to unexpected result, but can be avoided if `clSetKernelArg` do not modify `cl_mem` attributes.
+4. The runtime keeps track of which memory address are occupied, and use that to decide where the next allocation should be. Specifically, it uses first-fit allocation algorithm.
+5. The actual memory allocation is made with MMD calls, by passing the device address, size.
+6. Device address has a special encoding, this encoding is the same between buffer and USM pointers.
+7. Even if the memory is not explicitly transferred before launching the kernel, it will still be migrated right before the kernel executes. (i.e clEnqueueWriteBuffer are not necessary).
+
+## The flow
+
+SYCL runtime calls the memory operation in the following order. Each of them are explained in the subsection below.
+1. clCreateBuffer
+2. clEnqueueWriteBuffer
+3. clSetKernelArg
+4. clEnqueueKernel / Task
+
+This is not the preferred order from point of view of OpenCL Runtime, but you may ask why couldn't SYCL change it? It is because SYCL needs to take care of other vendors too. Not all vendors can get away with setting kernel arg first before enqueue write buffer (As discussed in one of the [issues](https://github.com/intel/llvm/discussions/4627)).
+
+### clCreateBuffer
+When [clCreateBuffer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L408-L953) is called, a host pointer need to be provided if this buffer is supposed to move data from the host to other places.
+
+The memory can be allocated in different global memory and different bank within the same global memory.
+
+When CL_MEM_COPY_HOST_PTR is specified, it does not know which device is the memory going to, so it will always allocate it in the first device in context and submitted to auto_queue. Context can contain multiple devices, and if this flag is specified in this case keep in mind this may cause a bug. For more information about the other flags available, see [OpenCL clCreateBuffer Spec](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html).
+
+### clEnqueueWriteBuffer
+[clEnqueueWriteBuffer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L3475-L3510) will allocate memory space for buffer through [acl_bind_buffer_to_device](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L357-L406) and then enqueue a memory transfer to actually copy the memory through [l_enqueue_mem_transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174)
+
+#### Allocate space
+[acl_bind_buffer_to_device](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L357-L406) is responsible for finalizing buffer allocation, it is only called if the allocation is deferred.
+
+1. It first calls on [acl_do_physical_buffer_allocation](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L256-L316) to set the target global memory to [default device global memory](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L277) (as indicated in board_spec.xml) if not specified.
+
+Note: simulation does not know the memory interfaces of any device until an AOCX is loaded, which usually happens after SYCL calls clEnqueueWriteBuffer.
+
+2. It will then reserve the memory through [acl_allocate_block](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4310-L4565). It first tries to allocate memory on the preferred bank, if it fails (i.e bank's memory are full), then it will try on the whole device global memory.
+3. `acl_allocate_block` first needs to decide on the range of memory it can allocate based on user provided info regarding which device, global memory, memory bank. Return range in the form of [pointer to begin address, pointer to end address] (achieved through [l_get_working_range](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4253-L4308)).
+4. The terminology `memory bank` was used. A single device's global memory can be partitioned into multiple banks (the partition can be interleaving or separate, with interleaving being the default). Interleaving memory provide more load balancing between memory banks, user can query which specific bank to access through runtime calls. For more information on memory banks, see [Global Memory Accesses Optimization](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/optimize-your-design/throughput-1/memory-accesses/global-memory-accesses-optimization.html)
+5. Once the range of candidate memory is set, also loop through the set of already allocated blocks (already occupied memory), and identify any gaps in between that met size requirement. The allocation will prioritize on the preferred bank, if all bank's memory occupied, then find other places. Once the address of allocation is decided, set memory object's current `block_allocation` to that address range. The specifics of how memory are reserved is described in Section `Memory allocation algorithm`
+6. Once the address range is set, it will return a device address. The device address is different from the surface representation in runtime, specifically, they are bit or of device id and device pointer as formated [here](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/include/acl.h#L264-L274).
+7. Once the 2D list is ready and the current block allocation is set, it will [enqueue a memory transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174) from context's unwrapped_host_mem to the buffer's for enqueue writes. Described in the next subsection.
+
+##### Memory allocation algorithm
+The memory allocation algorithm is first fit allocation. It start from the beginning of requested global memory, then search for the next available space (gaps or at the end) that satisfy size requirement. If the user request for specific memory bank, then the memory has to be non-interleaving. When user specified bank id, the first fit allocation will start at the address of ((bank_id -1) % num_banks )* bank_size + the start of target global_mem. The implication of specifying bank id is: consecutive memory allocation may not be adjacent to each other. On the other hand, if user never specified bank id, then the consecutive memory allocation should be adjacent assuming there was no deallocation.
+
+#### Transfer Memory
+The second part of enqueue read/write buffer is to enqueue a memory transfer between host and device, as implemented in [l_enqueue_mem_transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174).
+
+The whole enqueue process is as follows:
+
+Upon updating the command queues, memory transfer will first be submitted to the device operation queue through calling [acl_submit_mem_transfer_device_op](https://github.com/intel/fpga-runtime-for-opencl/blob/950f21dd079dfd55a473ba4122a4a9dca450e36f/src/acl_command.cpp#L343)
+([definition](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5313-L5392)). When the device operation is executed, the [acl_mem_transfer_buffer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5395-L5409) is called, which calls on [l_mem_transfer_buffer_explicitly](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5791-L6246)
+
+`l_mem_transfer_buffer_explicitly` will first create a pointer to pointer mapping between the source and destination buffer, and then copy the memories, then use the following MMD function to copy 1 byte from each pointer
+
+It relies on one of the four MMD functions:
+1. [copy_hostmem_to_hostmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1680-L1694) - Uses memcpy system calls.
+2. [copy_hostmem_to_globalmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1696-L1716) - Calls mmd function [`aocl_mmd_write`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879)
+3. [copy_globalmem_to_hostmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1718-L1739) - calls MMD [`aocl_mmd_read`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879)
+4. [copy_globalmem_to_globalmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1763-L1873) - If the source and destination are on the same device, then directly call [`aocl_mmd_copy`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L881-891), otherwise use both [`aocl_mmd_read`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879) and [`aocl_mmd_write`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879) to copy from source device to host, then host to destination device. All operation are blocking, the runtime will keep calling MMD's yield (sleep function) until read and write are done.
+
+### clSetKernelArg: What if clEnqueueWriteBuffer are not called?
+
+When `clEnqueueWriteBuffer` is not called, the memory transfer will automatically happen before launching the kernel that uses it. There is an enqueued memory transfer device operation before every kernel launch device operation. The only difference between calling enqueueWriteBuffer or not is whether the enqueue memory transfer will actually copy the memory. 
+
+In order to make sure the memory will be transferred to the right place, [`clSetKernelArg`](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L314-L725) plays a crucial role.  [`clSetKernelArg`](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L314-L725) is responsible for:
+1. Telling the kernel that one of its arguments is that specific buffer
+2. Set correct buffer attributes (eg. global memory id) according to the kernel argument's attribute.
+3. Create and bind the host channel if the kernel arg is a host pipe.
+
+Important: Given SYCL calls `clSetKernelArg` after `clEnqueueWriteBuffer`. Make sure to never change buffer properties inside `clSetKernelArg`.
+
+### Enqueue Kernel / Task
+
+During [kernel enqueue](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L1644-L2313), it will call [l_copy_and_adjust_arguments_for_device](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L2730-L2983) to:
+1. Create a temporary buffer object that is the aligned copy of buffer arg value
+2. Get the correct kernel's required buffer location
+3. Reserve space at the required device global memory if not already reserved
+4. Copy the reserved address into the kernel invocation image
+5. Prepare memory migration information containing this temporary buffer as the source, destination device id as the target. Note at this point, the temporary buffer could have already gone through memory transfer. This is being taken care of when actually migrating the buffer ([`acl_mem_migrate_buffer`](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5412-L5665)), i.e the function will know whether it should be moving from host to device or device to device (generally device to device operation is faster).
+
+
+Some other note about memory operation during enqueue:
+1. Device local pointer size is 4, device global pointer size is always the device's address bit integer division 8
+
+Before [submitting kernel](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/src/acl_kernel.cpp#L2982-L3093) to device queue. It first checks if the device is programmed, if not, it will [queue reprogram device operation](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/src/acl_kernel.cpp#L3034) to do so. Then it will [arrange memory migration](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/src/acl_kernel.cpp#L3043) for each kernel memory argument.
+
+Different from the device operation resulting from enqueue read/write, the memory migration calls on [`acl_mem_migrate_buffer`](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5412-L5665) (i.e memory transfer and memory migration behave differently). 
+
+#### Memory Migration
+1. It will take the memory object that was passed into clSetKernelArg, and also the destination device as well as global memory id.
+2. Check if the buffer's 2D list has a memory object at the destination global memory.
+3. If so, check if the current buffer's allocation: `block_allocation` is the same as the one located in the destination global memory of the 2D list. If true, then the memory is already located in the right place, there will be no copy operation in this case. If not, then it will call the same MMD function as memory transfer to write from host memory to device memory ([copy_hostmem_to_globalmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1696-L1716) - Calls MMD function [`aocl_mmd_write`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879)).
+
+You may wonder what is the difference between memory migration and memory transfer? Memory migration's functionality is almost a subset of memory transfer operation, because memory transfer also takes care of the situation where we have offsets, and all the checks on image buffers.