Scaling AI Workloads with Machine Learning on Kubernetes

Authors

  • Naresh Dulam Vice President Sr Lead Software Engineer, JP Morgan Chase, USA Author

Keywords:

Kubernetes, Machine Learning, Scaling AI Workloads, Containerization

Abstract

Machine learning (ML) is changing industries such as healthcare, banking, and marketing, but as AI workloads grow, more scalable and efficient infrastructure will be required. Kubernetes is an open-source container orchestration platform that provides the flexibility required to manage and scale machine learning systems. Organizations can use it to coordinate containerized workloads, optimize resources, and enable distributed computing. Kubernetes optimizes large-scale model deployment by automating, self-healing, and integrating smoothly with machine learning frameworks such as TensorFlow,  PyTorch, and Apache MXNet. Its benefits include increased availability, more efficient scaling, and simplified model maintenance.  However, issues like resource efficiency and managing complex workflows persist. However, challenges like resource optimization and managing complex workflows persist. Despite this, Kubernetes remains a viable option for organizations looking to optimise ML model deployment in cloud-native systems. Businesses that adopt Kubernetes may see improved performance, scalability, and operational efficiency, making it a critical tool for the future of AI-powered infrastructure.

References

1. Kommera, A. R. (2013). The Role of Distributed Systems in Cloud Computing: Scalability, Efficiency, and Resilience. NeuroQuantology, 11(3), 507-516.

2. Trindadea, S., Bittencourta, L. F., & da Fonsecaa, N. L. (2015). Management of Resource at the Network Edge for Federated Learning.

3. Abhishek, M. K., Rao, D. R., & Subrahmanyam, K. (2012). DYNAMIC ASSIGNMENT OF SCIENTIFIC COMPUTING RESOURCES USING CONTAINERS. Education, 2014.

4. Dunie, R., Schulte, W. R., Cantara, M., & Kerremans, M. (2015). Magic Quadrant for intelligent business process management suites. Gartner Inc.

5. Li, L., Chou, W., & Luo, M. (2015). A rest service framework for RAAS clouds. Services Transactions on Cloud Computing (STCC), 3(4), 16-31.

6. Doherty, P. (2014). AIICS Publications: All Publications. Journal of Artificial Intelligence Research, 80, 171-208.

7. Machiraju, S., & Gaurav, S. (2015). Hardening azure applications (p. 208). Apress.

8. Gholipour, N., Arianyan, E., & Buyya, R. (2012). Recent Advances in Energy-Efficient Resource Management Techniques in Cloud Computing Environments. New Frontiers in Cloud Computing and Internet of Things, 31-68.

9. Balaganski, A. (2015). API Security Management. KuppingerCole Report, (70958), 20-27.

10. Henrix, M., Tretmans, J., Jansen, D., & Vaandrager, F. (2015). Performance improvement in automata learning (Doctoral dissertation, Master thesis, Radboud University).

11. Nambiar, R., & Poess, M. (2009). Performance evaluation and benchmarking. Springer Berlin/Heidelberg.

12. Yaqub, E. (2015). Generic Methods for Adaptive Management of Service Level Agreements in Cloud Computing (Doctoral dissertation, Niedersächsische Staats-und Universitätsbibliothek Göttingen).

13. Huang, J., Lee, K., Badam, A., Son, H., Chandra, R., Kim, W. H., ... & Sakalanaga, S. (2015). Selling Stuff That's Free: the Commercial Side of Free Software. In 2015 USENIX Annual Technical Conference (USENIX ATC 15) (pp. 613-625).

14. Tools, P. P., & Data, P. W. (2015). File Systems. JETS.

15. Wehmeyer, B. (2007). Complexity theory as a model for the delivery of high value IT solutions (Doctoral dissertation).

Published

01-09-2016

How to Cite

[1]
Naresh Dulam, “Scaling AI Workloads with Machine Learning on Kubernetes ”, Distrib. Learn. Broad Appl. Sci. Res., vol. 2, pp. 50–70, Sep. 2016, Accessed: Mar. 14, 2025. [Online]. Available: https://dlbasr.org/index.php/publication/article/view/78