Performance-portable code is hard to produce due to diversity and heterogeneity of the state-of-the-art hardware platforms. Even more complex is the task of optimizing Artificial Neural Networks (ANNs) towards multiple hardware platforms. Manual optimization is expensive, while modern automated tools either support a narrow set of platforms or do not exploit individual strengths of different platforms to the fullest.
The functional data-parallel language Lift was shown to be performance-portable; the performance of the compiled OpenCL code is on par or better than that of highly tuned platform-specific libraries. This project aims to extend the method to the domain of Artificial Neural Networks by integrating domain-specific optimisations into the rewrite rules-based Lift compiler.
- Parallel mappings space exploration
- Memory tiling
- Memory coalescing
- Approximate computations
- Float quantization
- Neuron pruning
- Training batch size autotuning
- Varying precision across layers and neurons
- Convolution kernel decomposition
- Sharing 32-bit registers
- OpenCL kernel fusion
- Expression simplification
- Proprietary instruction sets usage