Cloud-Based Data Pipelines: Design, Operations, and Model Based Approach

Authors

  • Sairamesh Konidala Vice President at JPMorgan & Chase, USA Author

Keywords:

Cloud computing, ETL (Extract, Transform, Load), data pipelines, fault tolerance

Abstract

Modern data-driven companies' managements the  vast amounts of information depends on the cloud-based data pipelines. Effective data processing & analytics at a scale are made of possible by these pipelines, which allows the smooth collecting, transformation & data movement across many cloud services & systems. Establishing a strong cloud-based data pipeline calls for knowledge of the many needs of the company, the features of the data, and the easily available cloud services such Microsoft Azure, Google Cloud Platform, and Amazon Web Services (AWS). Implementing frequently requires a synthesis of intake tools, transformation techniques, orchestration services, and storage options, working together. Design must handle fundamental features such scalability, fault tolerance, latency, and security to ensure continuous data flow—even during system failures or heavy loads. An interesting scenario may have a company gathering data from IoT devices, doing real-time analytics on it, and preserving it in a cloud data warehouse for machine learning uses and future reporting.

References

1. Demchenko, Y., Turkmen, F., De Laat, C., Blanchet, C., & Loomis, C. (2016, July). Cloud based big data infrastructure: Architectural components and automated provisioning. In 2016 International Conference on High Performance Computing & Simulation (HPCS) (pp. 628-636). IEEE.

2. Onsongo, G., Erdmann, J., Spears, M. D., Chilton, J., Beckman, K. B., Hauge, A., ... & Thyagarajan, B. (2014). Implementation of Cloud based Next Generation Sequencing data analysis in a clinical laboratory. BMC research notes, 7, 1-6.

3. Öhrström, M., Tomlinson, J., Cortes, R., & Goda, S. (2018, August). Cloud-based pipeline distribution for effective and secure remote workflows. In Proceedings of the 8th Annual Digital Production Symposium (pp. 1-8).

4. Minevich, G., Park, D. S., Blankenberg, D., Poole, R. J., & Hobert, O. (2012). CloudMap: a cloud-based pipeline for analysis of mutant genome sequences. Genetics, 192(4), 1249-1269.

5. Schmidt, R., & Möhring, M. (2013, September). Strategic alignment of cloud-based architectures for big data. In 2013 17th IEEE International Enterprise Distributed Object Computing Conference Workshops (pp. 136-143). IEEE.

6. Umylny, B., & Weisburd, R. S. (2011). Beyond the Pipelines: Cloud Computing Facilitates Management, Distribution, Security, and Analysis of High‐Speed Sequencer Data. Tag‐Based Next Generation Sequencing, 449-468.

7. Garron, J., Stoner, C., & Meyer, F. (2017, September). Cloud-based oil detection processing pipeline prototype for C-band synthetic aperture radar data. In OCEANS 2017-Anchorage (pp. 1-7). IEEE.

8. Cala, J., Xu, Y., Wijaya, E. A., & Missier, P. (2014, May). From scripted HPC-based NGS pipelines to workflows on the cloud. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 694-700). IEEE.

9. Ivanov, V., & Smolander, K. (2018). Implementation of a DevOps pipeline for serverless applications. In Product-Focused Software Process Improvement: 19th International Conference, PROFES 2018, Wolfsburg, Germany, November 28–30, 2018, Proceedings 19 (pp. 48-64). Springer International Publishing.

10. Trudgian, D. C., & Mirzaei, H. (2012). Cloud CPFP: a shotgun proteomics data analysis pipeline using cloud and high performance computing. Journal of proteome research, 11(12), 6282-6290.

11. Demchenko, Y., Turkmen, F., de Laat, C., Hsu, C. H., Blanchet, C., & Loomis, C. (2017). Cloud computing infrastructure for data intensive applications. In Big Data Analytics for Sensor-Network Collected Intelligence (pp. 21-62). Academic Press.

12. Chen, L., Zhang, B., Schnaubelt, M., Shah, P., Aiyetan, P., Chan, D., ... & Zhang, Z. (2018). MS-PyCloud: An open-source, cloud computing-based pipeline for LC-MS/MS data analysis. BioRxiv, 320887.

13. Gorton, I., Wynne, A., Liu, Y., & Yin, J. (2011). Components in the Pipeline. IEEE software, 28(3), 34-40.

14. Lynnes, C., & Ramachandran, R. (2018, July). Generalizing a Data Analysis Pipeline in the Cloud to Handle Diverse Use Cases in NASA's EOSDIS. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium (pp. 422-425). IEEE.

15. Yaseen, M. U., Anjum, A., & Antonopoulos, N. (2017, December). Modeling and analysis of a deep learning pipeline for cloud based video analytics. In Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 121-130).

16. Gade, K. R. (2018). Real-Time Analytics: Challenges and Opportunities. Innovative Computer Sciences Journal, 4(1).

17. Gade, K. R. (2017). Integrations: ETL vs. ELT: Comparative analysis and best practices. Innovative Computer Sciences Journal, 3(1).

18. Gade, K. R. (2017). Migrations: Challenges and Best Practices for Migrating Legacy Systems to Cloud-Based Platforms. Innovative Computer Sciences Journal, 3(1).

19. Naresh Dulam, et al. Apache Arrow: Optimizing Data Interchange in Big Data Systems. Distributed Learning and Broad Applications in Scientific Research, vol. 3, Oct. 2017, pp. 93-114

20. Naresh Dulam, and Venkataramana Gosukonda. Event-Driven Architectures With Apache Kafka and Kubernetes. Distributed Learning and Broad Applications in Scientific Research, vol. 3, Oct. 2017, pp. 115-36

21. Naresh Dulam, et al. Snowflake Vs Redshift: Which Cloud Data Warehouse Is Right for You? . Distributed Learning and Broad Applications in Scientific Research, vol. 4, Oct. 2018, pp. 221-40

22. Naresh Dulam, et al. Apache Iceberg: A New Table Format for Managing Data Lakes . Distributed Learning and Broad Applications in Scientific Research, vol. 4, Sept. 2018

23. Naresh Dulam, et al. Data Governance and Compliance in the Age of Big Data. Distributed Learning and Broad Applications in Scientific Research, vol. 4, Nov. 2018

Published

15-05-2019

How to Cite

[1]
Sairamesh Konidala, “Cloud-Based Data Pipelines: Design, Operations, and Model Based Approach”, Distrib. Learn. Broad Appl. Sci. Res., vol. 5, pp. 1586–1603, May 2019, Accessed: Mar. 14, 2025. [Online]. Available: https://dlbasr.org/index.php/publication/article/view/60