Add motivation and example usage of buffer at beginning, remove some fine grained details of the implementation

sherry-yuan · sherry-yuan · commit e4d42af8ea14 · 2022-03-01T11:10:17.000-08:00
diff --git a/docs/buffers.md b/docs/buffers.md
@@ -1,60 +1,89 @@
 # Buffers
 
-Unlike USM pointers, buffer takes care of memory migration automatically.
+Buffer takes care of automatic memory migration between host and device or between device when kernel needs to access the memory. This provides the user convinience of not having to copy memory from host to device manually.
+
+An synopsis of buffer usage is as follows
+
+```cpp
+#define GMEM_DATA_VEC 8
+
+cl_platform_id platform;
+cl_int err;
+// General setups
+err = clGetPlatformIDs(1, &platform, NULL);
+err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
+cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);
+cl_command_queue queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
+// aocx_size is the size of aocx file
+// aocx_contents is the content of aocx file
+cl_program program = clCreateProgramWithBinary(context, 1, &device, &aocx_size, (const unsigned char **)&aocx_contents, &binary_status, &err);
+cl_kernel kernel = clCreateKernel(program, "kernel_name", &err);
+
+// Host side data that contains the info needed for kernel execution
+unsigned int *host_data = (unsigned int *)acl_aligned_malloc(global_work_size*sizeof(unsigned int)*GMEM_DATA_VEC);
+
+// Setup the expectation for a buffer, but not actually allocating memory yet
+cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, global_work_size*sizeof(unsigned int)*GMEM_DATA_VEC, NULL, &err);
+
+// Enqueue a memory transfer operation
+cl_event event = {};
+err = clEnqueueWriteBuffer(queue, mem, CL_FALSE, 0, global_work_size*sizeof(unsigned int)*GMEM_DATA_VEC, host_data, 0, NULL, &event);
+
+// Tells the kernel one of its argument is the buffer
+err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &mem);
+
+// Submit kernel execution to queue
+err = clEnqueueNDRangeKernel(queue, kernel, 2, NULL, &global_work_size_2d[0], &local_work_size_2d[0], 0, NULL, &event);
+
+... // Block on kernel finish and clean up resources
+```
 
 ## Summary
 
-1. Buffers are not actually allocated on the device until the kernel executes, unless clEnqueueWriteBuffer are called or CL_MEM_COPY_HOST_PTR flag is defined.
+1. Buffers are not actually allocated on the device until the kernel executes or `clEnqueueWriteBuffer` are called or `CL_MEM_COPY_HOST_PTR` flag is defined when calling `clCreateBuffer`.
 2. Buffer takes care of memory migration between host and device automatically.
-3. Sycl typically calls opencl function in different order. clCreateBuffer -> euqueue write buffer -> set kernel arg -> create kenel & enqueue kernel. But in opencl, the optimal calls is: create buffer -> set arg -> enqueue write buffer (opt) -> enqueue kernel.
-4. The runtime keeps track of which memory address are occupied, and use that to decide where the next allocation should be.
+3. SYCL runtime typically calls opencl function in different order compare to OpenCL prefered order. Specifically. SYCL runtime calls functions in the order of `clCreateBuffer` > `clEnqueueWriteBuffer` > `clSetKernelArg` > `clEnqueueNDRangeKernel`. In opencl, the optimal calls is: `clCreateBuffer` > `clSetKernelArg` > `clEnqueueWriteBuffer` >  `clEnqueueNDRangeKernel`. This sometimes lead to unexpected result, but can be avoided if `clSetKernelArg` do not modify `cl_mem` attributes.
+4. The runtime keeps track of which memory address are occupied, and use that to decide where the next allocation should be. Specifically, it uses first-fit allocation algorithm.
 5. The actual memory allocation is made with MMD calls, by passing the device address, size.
 6. Device address has a special encoding, this encoding is the same between buffer and USM pointers.
 7. Even if the memory is not explicitly transferred before launching the kernel, it will still be migrated right before the kernel executes. (i.e clEnqueueWriteBuffer are not necessary).
 
 ## The flow
 
-SYCL runtime calls the memory operation in the following order (each explained in the subsection below)
-1. clCreateBufferWithPropertiesINTEL
-2. clEnqueueWriteBufferIntelFPGA
+SYCL runtime calls the memory operation in the following order. Each of them are explained in the subsection below.
+1. clCreateBuffer
+2. clEnqueueWriteBuffer
 3. clSetKernelArg
 4. clEnqueueKernel / Task
 
 This is not the preferred order from point of view of OpenCL Runtime, but you may ask why couldn't SYCL change it? It is because SYCL needs to take care of other vendors too. Not all vendors can get away with setting kernel arg first before enqueue write buffer (As discussed in one of the [issues](https://github.com/intel/llvm/discussions/4627)).
 
-### clCreateBufferWithPropertiesINTEL
-When [clCreateBufferWithPropertiesINTEL](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L408-L953) is called, a host pointer need to be provided if this buffer is supposed to move data from the host to other places.
+### clCreateBuffer
+When [clCreateBuffer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L408-L953) is called, a host pointer need to be provided if this buffer is supposed to move data from the host to other places.
 
-The memory can be allocated in different global memory and different bank within the same global memory
+The memory can be allocated in different global memory and different bank within the same global memory.
 
-When CL_MEM_COPY_HOST_PTR is specified, it does not know which device is the memory going to, so it will always allocate it in the first device in context and submitted to auto_queue. Context can contain multiple devices, and if this flag is specified in this case keep in mind this may cause a bug.
+When CL_MEM_COPY_HOST_PTR is specified, it does not know which device is the memory going to, so it will always allocate it in the first device in context and submitted to auto_queue. Context can contain multiple devices, and if this flag is specified in this case keep in mind this may cause a bug. For more information about the other flags available, see [OpenCL clCreateBuffer Spec](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html).
 
-The actual allocation of the memory and transfer of the memory is typically deferred until we know which device the buffer should be bounded to. In this case, there is a copy from provided host pointer to cl_mem object's host mem (to keep track and use in later calls when transferring to device), this is an extra mem copy on the host side.
-
-Note: If the allocation is on shared memory for CV SoC is bank #1 if there are two banks and bank #0 if there is only one bank, always with alignment of 1024 bytes.
-
-### clEnqueueWriteBufferIntelFPGA
-[clEnqueueWriteBufferIntelFPGA](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L3475-L3510) will allocate memory space for buffer through [acl_bind_buffer_to_device](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L357-L406) and then enqueue a mem transfer to actually copy the memory through [l_enqueue_mem_transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174)
+### clEnqueueWriteBuffer
+[clEnqueueWriteBuffer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L3475-L3510) will allocate memory space for buffer through [acl_bind_buffer_to_device](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L357-L406) and then enqueue a memory transfer to actually copy the memory through [l_enqueue_mem_transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174)
 
 #### Allocate space
 [acl_bind_buffer_to_device](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L357-L406) is responsible for finalizing buffer allocation, it is only called if the allocation is deferred.
 
-1. It first calls on [acl_do_physical_buffer_allocation](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L256-L316) keep track of whether the location of memory is set by checking if the buffer location mem_id is 0. if it is zero, then it will be set to [default device global memory](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L277) (as indicated in board_spec.xml)
+1. It first calls on [acl_do_physical_buffer_allocation](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L256-L316) to set the target global memory to [default device global memory](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L277) (as indicated in board_spec.xml) if not specified.
 
 Note: simulation does not know the memory interfaces of any device until an AOCX is loaded, which usually happens after SYCL calls clEnqueueWriteBuffer.
 
-2. Buffer uses a 2D list (device, global_mem) to keep track of the allocation for each device. Only the devices used are sized to the number of global_mem.
-3. There is a field (`block_allocation`) in [buffer object](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/include/acl_types.h#L729-L878) that keeps track of current block allocation. If the corresponding global memory on the given device is already set, then delete the previous block allocation, and set it to what is set in the given device global mem. If the corresponding memory is not set in the 2D list, then it will call [acl_allocate_block](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4310-L4565)
-4. `acl_allocate_block` tries to allocate on target global memory on the device. To do this, it first tries allocate on target DIMM, and then it tries on the entire memory range.
-5. It first needs to decide on the range of memory it can allocate based on user provided info regarding which device, global memory, memory bank. Return range in the form of [pointer to begin address, pointer to end address] (achieved through [l_get_working_range](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4253-L4308)).
-6. The actual device address is different from the surface representation in runtime, specifically, they are bit or of device id and device pointer as calculated [here](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/include/acl.h#L264-L274).
-7. A single device's global memory can be partitioned into multiple banks (the partition can be interleaving or separate, with interleaving being the default). Interleaving memory provide more load balancing between memory banks, user can query which specific bank to access through runtime calls. For more information on memory banks, see [Global Memory Accesses Optimization](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/optimize-your-design/throughput-1/memory-accesses/global-memory-accesses-optimization.html)
-8. Once the range of candidate memory is set, also loop through the set of already allocated blocks (already occupied memory), and identify any gaps in between that met size requirement. The allocation will prioritize on the preferred bank, if all bank's mem occupied, then find other places. Once the address of allocation is decided, set mem object's current `block_allocation` to that address range.
-9. Once the 2D list is ready and the current block allocation is set, it will [enqueue a memory transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174) from context's unwrapped_host_mem to the buffer's for enqueue writes. Described in the next subsection.
-10. You may wonder why is it from each context's unwrapped_host_mem? It is an implementation detail that allows us to treat read/write/copy in a uniform way. All read/write commands are given a pointer to host memory, and `unwrapped_host_mem` is a max size host memory buffer used to wrap these pointers.
+2. It will then reserve the memory through [acl_allocate_block](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4310-L4565). It first tries to allocate memory on the preferred bank, if it fails (i.e bank's memory are full), then it will try on the whole device global memory.
+3. `acl_allocate_block` first needs to decide on the range of memory it can allocate based on user provided info regarding which device, global memory, memory bank. Return range in the form of [pointer to begin address, pointer to end address] (achieved through [l_get_working_range](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4253-L4308)).
+4. The terminology `memory bank` was used. A single device's global memory can be partitioned into multiple banks (the partition can be interleaving or separate, with interleaving being the default). Interleaving memory provide more load balancing between memory banks, user can query which specific bank to access through runtime calls. For more information on memory banks, see [Global Memory Accesses Optimization](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/optimize-your-design/throughput-1/memory-accesses/global-memory-accesses-optimization.html)
+5. Once the range of candidate memory is set, also loop through the set of already allocated blocks (already occupied memory), and identify any gaps in between that met size requirement. The allocation will prioritize on the preferred bank, if all bank's memory occupied, then find other places. Once the address of allocation is decided, set memory object's current `block_allocation` to that address range. The specifics of how memory are reserved is described in Section `Memory allocation algorithm`
+6. Once the address range is set, it will return a device address. The device address is different from the surface representation in runtime, specifically, they are bit or of device id and device pointer as formated [here](https://github.com/intel/fpga-runtime-for-opencl/blob/1264543c0361530f5883e35dc0c9d48ac0fd3653/include/acl.h#L264-L274).
+7. Once the 2D list is ready and the current block allocation is set, it will [enqueue a memory transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174) from context's unwrapped_host_mem to the buffer's for enqueue writes. Described in the next subsection.
 
 ##### Memory allocation algorithm
-The memory allocation algorithm is first fit allocation. It start from the beginning of requested global memory, then search for the next available space (gaps or at the end) that satisfy size requirement. If the user request for specific memory bank, then the memory has to be non-interleaving. When user specified bank id, the first fit allocation will start at the address of ((bank_id -1) % num_banks )* bank_size + the start of target global_mem. The implication of specifying bank id is: consecutive memory allocation may not be adjacent to each other. On the other hand, if user never specified bank id, then the consecutive memory allocation should be adjacent.
+The memory allocation algorithm is first fit allocation. It start from the beginning of requested global memory, then search for the next available space (gaps or at the end) that satisfy size requirement. If the user request for specific memory bank, then the memory has to be non-interleaving. When user specified bank id, the first fit allocation will start at the address of ((bank_id -1) % num_banks )* bank_size + the start of target global_mem. The implication of specifying bank id is: consecutive memory allocation may not be adjacent to each other. On the other hand, if user never specified bank id, then the consecutive memory allocation should be adjacent assuming there was no deallocation.
 
 #### Transfer Memory
 The second part of enqueue read/write buffer is to enqueue a memory transfer between host and device, as implemented in [l_enqueue_mem_transfer](https://github.com/intel/fpga-runtime-for-opencl/blob/b08e0af97351718ce0368a9ee507242b35f4929e/src/acl_mem.cpp#L4726-L5174).
@@ -72,10 +101,9 @@ It relies on one of the four MMD functions:
 3. [copy_globalmem_to_hostmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1718-L1739) - calls MMD [`aocl_mmd_read`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879)
 4. [copy_globalmem_to_globalmem](https://github.com/intel/fpga-runtime-for-opencl/blob/fc99b92704a466f7dc4d84bd45d465d64d03dbb0/src/acl_hal_mmd.cpp#L1763-L1873) - If the source and destination are on the same device, then directly call [`aocl_mmd_copy`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L881-891), otherwise use both [`aocl_mmd_read`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879) and [`aocl_mmd_write`](https://gitlab.devtools.intel.com/OPAE/opencl-bsp/-/blob/master/agilex_f_dk/source/host/ccip_mmd.cpp#L870-879) to copy from source device to host, then host to destination device. All operation are blocking, the runtime will keep calling MMD's yield (sleep function) until read and write are done.
 
-
 ### clSetKernelArg: What if clEnqueueWriteBuffer are not called?
 
-When `clEnqueueWriteBuffer` is not called, the memory transfer will automatically happen before launching the kernel that uses it. There is an enqueued mem transfer device operation before every kernel launch device operation. The only difference between calling enqueueWriteBuffer or not is whether the enqueue mem transfer will actually copy the memory. 
+When `clEnqueueWriteBuffer` is not called, the memory transfer will automatically happen before launching the kernel that uses it. There is an enqueued memory transfer device operation before every kernel launch device operation. The only difference between calling enqueueWriteBuffer or not is whether the enqueue memory transfer will actually copy the memory. 
 
 In order to make sure the memory will be transferred to the right place, [`clSetKernelArg`](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L314-L725) plays a crucial role.  [`clSetKernelArg`](https://github.com/intel/fpga-runtime-for-opencl/blob/3f7a228133f92c63be5b04e222f3fc8ff72310e6/src/acl_kernel.cpp#L314-L725) is responsible for:
 1. Telling the kernel that one of its arguments is that specific buffer