Optimal runtime selection of CPU / GPU BLAS (with an obligatory reference to the 5th Element).
BLAS (1979) is a set of basic linear algebra operations such as vector dot product and matrix multiplication. BLAS is used extensively in high-performance computing with machine optimised implementations available for specific CPU chipsets (e.g. Intel, AMD, Apple, ATLAS and OpenBLAS).
More recently, BLAS has been implemented for GPUs by NVIDIA and AMD (the latter using the OpenCL standard). Such implementations result in dramatic speedups for larger arrays, but are much slower for small to medium sized arrays (less than ~100,000 entries).
Clearly, BLAS is best implemented by machine optimised implementations (i.e. the CPU) for small to medium arrays, and by the GPU for larger arrays.
MultiBLAS is a proof-of-concept that delegates between two implementations of BLAS using dlopen/dlsym
on UNIX systems
and LoadLibrary/GetProcAddress
on Windows systems.
The First Milestone of the project is to implement delegation,
machine-specific performance tuning, and demonstrate performance results for DDOT
and DGEMM
on OS X, Linux and Windows.
It is very important that the BLAS and CBLAS APIs are preserved. Decades of middleware has been developed to use the BLAS API and it has been exposed to other languages, e.g. by netlib-java.
The second milestone (requires funding) will cover the full BLAS API.
Future milestones will deal with issues such as dynamic load balancing, minimising startup time from cold, auto-batching and beyond.
Please consider supporting this open source project with a donation:
Contributors are encouraged to fork this repository and issue pull requests. Contributors implicitly agree to assign an unrestricted licence to Sam Halliday, but retain the copyright of their code (this means we both have the freedom to update the licence for those contributions).