Add documentation for how buffers are managed in runtime.

sherry-yuan · sherry-yuan · commit c4b9b692f1bb · 2022-02-15T09:19:44.000-08:00
---------------------------------------------------------------------------

It specifically documents;
1. how it is used
2. how sycl use it
3. some caveats
4. how memory are allocated
5. device address encoding
6. the whole implementation flow of all function that are responsible for allocation, memory transfer
7. some corner cases
diff --git a/docs/buffers.md b/docs/buffers.md
@@ -0,0 +1,104 @@
+# Buffers
+
+Unlike USM pointers, buffer takes care of memory migration automatically.
+
+## Summary
+
+1. Buffers are not actually allocated on the device until the kernel executes, unless clEnqueueWriteBuffer are called or CL_MEM_COPY_HOST_PTR flag is defined.
+2. Buffer takes care of memory migration between host and device automatically.
+3. Sycl typically calls opencl function in different order. clCreateBuffer -> euqueue write buffer -> set kernel arg -> create kenel & enqueue kernel. But in opencl, the optimal calls is: create buffer -> set arg -> enqueue write buffer (opt) -> enqueue kernel.
+4. The runtime keeps track of which memory address are occupied, and use that to decide where the next allocation should be.
+5. The actual memory allocation is made with MMD calls, by passing the device address, size.
+6. Device address has a special encoding, this encoding is the same between buffer and USM pointers.
+7. Even if the memory is not explicitly transferred before launching the kernel, it will still be migrated right before the kernel executes. (i.e clEnqueueWriteBuffer are not necessary).
+
+## The flow
+
+SYCL runtime calls the memory operation in the following order (each explained in the subsection below)
+1. clCreateBufferWithPropertiesINTEL
+2. clEnqueueWriteBufferIntelFPGA
+3. clSetKernelArg
+4. clEnqueueKernel / Task
+
+This is not the preferred order from point of view of OpenCL Runtime, but you may ask why couldn't SYCL change it? It is because SYCL needs to take care of other vendors too. Not all vendors can get away with setting kernel arg first before enqueue write buffer (As discussed in one of the [issues](https://github.com/intel/llvm/discussions/4627)).
+
+### clCreateBufferWithPropertiesINTEL
+When [clCreateBufferWithPropertiesINTEL](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L408-L953) is called, a host pointer need to be provided if this buffer is supposed to move data from the host to other places.
+
+The memory can be allocated in different global memory and different bank within the same global memory
+
+When CL_MEM_COPY_HOST_PTR is specified, it does not know which device is the memory going to, so it will always allocate it in the first device in context and submitted to auto_queue. Context can contain multiple devices, and if this flag is specified in this case keep in mind this may cause a bug.
+
+The actual allocation of the memory and transfer of the memory is typically deferred until we know which device the buffer should be bounded to. In this case, there is a copy from provided host pointer to cl_mem object's host mem (to keep track and use in later calls when transferring to device), this is an extra mem copy on the host side.
+
+Note: If the allocation is on shared memory for CV SoC is bank #1 if there are two banks and bank #0 if there is only one bank, always with alignment of 1024 bytes.
+
+### clEnqueueWriteBufferIntelFPGA
+[clEnqueueWriteBufferIntelFPGA](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L3475-L3510) will allocate memory space for buffer through [acl_bind_buffer_to_device](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L357-L406) and then enqueue a mem transfer to actually copy the memory through [l_enqueue_mem_transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174)
+
+#### Allocate space
+[acl_bind_buffer_to_device](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L357-L406) is responsible for finalizing buffer allocation, it is only called if the allocation is deferred.
+
+1. It first calls on [acl_do_physical_buffer_allocation](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L256-L316) keep track of whether the location of memory is set by checking if the buffer location mem_id is 0. if it is zero, then it will be set to [default device global memory](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L277) (as indicated in board_spec.xml)
+
+Note: simulation does not know the memory interfaces of any device until an AOCX is loaded, which usually happens after SYCL calls clEnqueueWriteBuffer.
+
+2. Buffer uses a 2D list (device, global_mem) to keep track of the allocation for each device. Only the devices used are sized to the number of global_mem.
+3. There is a field (`block_allocation`) in [buffer object](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/include/acl_types.h#L729-L878) that keeps track of current block allocation. If the corresponding global memory on the given device is already set, then delete the previous block allocation, and set it to what is set in the given device global mem. If the corresponding memory is not set in the 2D list, then it will call [acl_allocate_block](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4310-L4565)
+4. `acl_allocate_block` tries to allocate on target global memory on the device. To do this, it first tries allocate on target DIMM, and then it tries on the entire memory range.
+5. It first needs to decide on the range of memory it can allocate based on user provided info regarding which device, global memory, memory bank. Return range in the form of [pointer to begin address, pointer to end address] (achieved through [l_get_working_range](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4253-L4308)).
+6. The actual device address is different from the surface representation in runtime, specifically, they are bit or of device id and device pointer as calculated [here](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/include/acl.h#L264-L274).
+7. A single device's global memory can be partitioned into multiple banks (the partition can be interleaving or separate, with interleaving being the default). Interleaving memory provide more load balancing between memory banks, user can query which specific bank to access through runtime calls. The implication of interleaving is: for any consecutive memory allocation, it is not guaranteed that they will be one after another in the memory address. For more information on memory banks, see [Global Memory Accesses Optimization](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/optimize-your-design/throughput-1/memory-accesses/global-memory-accesses-optimization.html)
+8. Once the range of candidate memory is set, also loop through the set of already allocated blocks (already occupied memory), and identify any gaps in between that met size requirement. The allocation will prioritize on the preferred bank, if all bank's mem occupied, then find other places. Once the address of allocation is decided, set mem object's current `block_allocation` to that address range.
+9. Once the 2D list is ready and the current block allocation is set, it will [enqueue a memory transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174) from context's unwrapped_host_mem to the buffer's for enqueue writes. Described in the next subsection.
+10. You may wonder why is it from each context's unwrapped_host_mem? It is an implementation detail that allows us to treat read/write/copy in a uniform way. All read/write commands are given a pointer to host memory, and `unwrapped_host_mem` is a max size host memory buffer used to wrap these pointers.
+
+#### Transfer Memory
+The second part of enqueue read/write buffer is to enqueue a memory transfer between host and device, as implemented in [l_enqueue_mem_transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174).
+
+The whole enqueue process is as follows:
+
+Upon updating the command queues, memory transfer will first be submitted to the device operation queue through calling [acl_submit_mem_transfer_device_op](https://github.com/intel/fpga-runtime-for-opencl/blob/950f21dd079dfd55a473ba4122a4a9dca450e36f/src/acl_command.cpp#L343)
+([definition](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5313-L5392)). When the device operation is executed, the [acl_mem_transfer_buffer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5395-L5409) is called, which calls on [l_mem_transfer_buffer_explicitly](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5791-L6246)
+
+`l_mem_transfer_buffer_explicitly` will first create a pointer to pointer mapping between the source and destination buffer, and then copy the memories, then use the following MMD function to copy 1 byte from each pointer
+
+It relies on one of the four MMD functions:
+1. [copy_hostmem_to_hostmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1680-L1694) - Uses memcpy system calls.
+2. [copy_hostmem_to_globalmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1696-L1716) - Calls mmd function [`aocl_mmd_write`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879)
+3. [copy_globalmem_to_hostmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1718-L1739) - calls MMD [`aocl_mmd_read`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879)
+4. [copy_globalmem_to_globalmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1763-L1873) - If the source and destination are on the same device, then directly call [`aocl_mmd_copy`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L881-891), otherwise use both [`aocl_mmd_read`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879) and [`aocl_mmd_write`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879) to copy from source device to host, then host to destination device. All operation are blocking, the runtime will keep calling MMD's yield (sleep function) until read and write are done.
+
+
+### clSetKernelArg: What if clEnqueueWriteBuffer are not called?
+
+When `clEnqueueWriteBuffer` is not called, the memory transfer will automatically happen before launching the kernel that uses it. There is an enqueued mem transfer device operation before every kernel launch device operation. The only difference between calling enqueueWriteBuffer or not is whether the enqueue mem transfer will actually copy the memory. 
+
+In order to make sure the memory will be transferred to the right place, [`clSetKernelArg`](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L314-L725) plays a crucial role.  [`clSetKernelArg`](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L314-L725) is responsible for:
+1. Telling the kernel that one of its arguments is that specific buffer
+2. Set correct buffer attributes (eg. global memory id) according to the kernel argument's attribute.
+3. Create and bind the host channel if the kernel arg is a host pipe.
+
+### Enqueue Kernel / Task
+
+During [kernel enqueue](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L1644-L2313), it will call [l_copy_and_adjust_arguments_for_device](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L2730-L2983) to:
+1. Create a temporary buffer object that is the aligned copy of buffer arg value
+2. Get the correct kernel's required buffer location
+3. Reserve space at the required device global memory if not already reserved
+4. Copy the reserved address into the kernel invocation image
+5. Prepare memory migration information containing this temporary buffer as the source, destination device id as the target. Note at this point, the temporary buffer could have already gone through memory transfer. This is being taken care of when actually migrating the buffer ([`acl_mem_migrate_buffer`](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5412-L5665)), i.e the function will know whether it should be moving from host to device or device to device (generally device to device operation is faster).
+
+
+Some other note about memory operation during enqueue:
+1. Device local pointer size is 4, device global pointer size is always the device's address bit integer division 8
+
+Before [submitting kernel](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/src/acl_kernel.cpp#L2982-L3093) to device queue. It first checks if the device is programmed, if not, it will [queue reprogram device operation](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/src/acl_kernel.cpp#L3034) to do so. Then it will [arrange memory migration](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/src/acl_kernel.cpp#L3043) for each kernel memory argument.
+
+Different from the device operation resulting from enqueue read/write, the memory migration calls on [`acl_mem_migrate_buffer`](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L5412-L5665) (i.e memory transfer and memory migration behave differently). 
+
+In memory migration:
+1. It will take the memory object that was passed into clSetKernelArg, and also the destination device as well as global memory id.
+2. Check if the buffer's 2D list has a memory object at the destination global memory.
+3. If so, check if the current buffer's allocation: `block_allocation` is the same as the one located in the destination global memory of the 2D list. If true, then the memory is already located in the right place, there will be no copy operation in this case. If not, then it will call the same MMD function as memory transfer to write from host memory to device memory ([copy_hostmem_to_globalmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1696-L1716) - Calls MMD function [`aocl_mmd_write`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879)).
+
+You may wonder what is the difference between memory migration and memory transfer? Memory migration's functionality is almost a subset of memory transfer operation, because memory transfer also takes care of the situation where we have offsets, and all the checks on image buffers.