Modern cloud architectures distinguish data lakes from data warehouses in terms of choice of technology for your data pipelines.

Authors

  • Sairamesh Konidala Vice President at JPMorgan & Chase, USA Author
  • Guruprasad Nookala Software Engineer III at JP Morgan Chase LTD, USA Author
  • Vishnu Vardhan Reddy Boda Sr. Software engineer at Optum Services inc, USA Author

Keywords:

Data Lake, Cloud Storage, Cost Optimization, Data Pipeline

Abstract

Effective data storage and analytics solutions are becoming more and more important for companies making decisions in the modern data-centric world. Two often used substitutes that serve distinct purposes in modern cloud systems are data lakes and data warehouses. The design, volume, and applications of your data pipelines determine most of the suitable solution choice. Evaluating a transactional data & creating reports depending on the predefined searches is best done using a data warehouses, distinguished by its structured & schema-driven approach. It supports business intelligence (BI) technologies by providing the consistent & trustworthy information for decision-makers. On the other hand, for handling huge amounts of raw, unstructured, semi-structured & structured data, data lakes provide a more flexible & affordable option. They permit data to be kept in its natural form, therefore enabling analysts, engineers, and data scientists to study it using many processing architectures.

References

1. Gorelik, A. (2019). The enterprise big data lake: Delivering the promise of big data and data science. O'Reilly Media.

2. John, T., & Misra, P. (2017). Data lake for enterprises. Packt Publishing Ltd.

3. Pasupuleti, P., & Purra, B. S. (2015). Data lake development with big data. Packt Publishing Ltd.

4. Tejada, Z. (2017). Mastering azure analytics: architecting in the cloud with azure data lake, HDInsight, and Spark. " O'Reilly Media, Inc.".

5. Coté, C., Gutzait, M. K., & Ciaburro, G. (2018). Hands-On Data Warehousing with Azure Data Factory: ETL techniques to load and transform data from various sources, both on-premises and on cloud. Packt Publishing Ltd.

6. Gupta, S., Giri, V., Gupta, S., & Giri, V. (2018). Data Processing Strategies in Data Lakes. Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake, 125-199.

7. Vermeulen, A. F. (2018). Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets. Apress.

8. Gupta, S., & Giri, V. (2018). Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake. Apress.

9. Mohanty, S., Jagadeesh, M., & Srivatsa, H. (2013). Big data imperatives: Enterprise ‘Big Data’warehouse,‘BI’implementations and analytics. Apress.

10. Mehmood, H., Gilman, E., Cortes, M., Kostakos, P., Byrne, A., Valta, K., ... & Riekki, J. (2019, April). Implementing big data lake for heterogeneous data sources. In 2019 ieee 35th international conference on data engineering workshops (icdew) (pp. 37-44). IEEE.

11. Kovačević, I., & Mekterovic, I. (2018, May). Novel BI data architectures. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (pp. 1191-1196). IEEE.

12. Suriarachchi, I., & Plale, B. (2016, October). Crossing analytics systems: A case for integrated provenance in data lakes. In 2016 IEEE 12th International Conference on e-Science (e-Science) (pp. 349-354). IEEE.

13. Beckner, M. (2018). Quick Start Guide to Azure Data Factory, Azure Data Lake Server, and Azure Data Warehouse. De-G Press.

14. Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE communications surveys & tutorials, 13(3), 311-336.

15. Ali, S. M. F. (2018, March). Next-generation ETL Framework to Address the Challenges Posed by Big Data. In DOLAP.

16. Gade, K. R. (2019). Data Migration Strategies for Large-Scale Projects in the Cloud for Fintech. Innovative Computer Sciences Journal, 5(1).

17. Gade, K. R. (2018). Real-Time Analytics: Challenges and Opportunities. Innovative Computer Sciences Journal, 4(1).

18. Boda, V. V. R., & Immaneni, J. (2019). Streamlining FinTech Operations: The Power of SysOps and Smart Automation. Innovative Computer Sciences Journal, 5(1).

19. Nookala, G., Gade, K. R., Dulam, N., & Thumburu, S. K. R. (2019). End-to-End Encryption in Enterprise Data Systems: Trends and Implementation Challenges. Innovative Computer Sciences Journal, 5(1).

20. Katari, A. (2019). Real-Time Data Replication in Fintech: Technologies and Best Practices. Innovative Computer Sciences Journal, 5(1).

21. Katari, A. (2019). ETL for Real-Time Financial Analytics: Architectures and Challenges. Innovative Computer Sciences Journal, 5(1).

22. Komandla, V. Enhancing Security and Fraud Prevention in Fintech: Comprehensive Strategies for Secure Online Account Opening.

23. Komandla, V. Transforming Financial Interactions: Best Practices for Mobile Banking App Design and Functionality to Boost User Engagement and Satisfaction.

24. Gade, K. R. (2017). Integrations: ETL vs. ELT: Comparative analysis and best practices. Innovative Computer Sciences Journal, 3(1).

25. Muneer Ahmed Salamkar, and Karthik Allam. Architecting Data Pipelines: Best Practices for Designing Resilient, Scalable, and Efficient Data Pipelines. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Jan. 2019

26. Muneer Ahmed Salamkar. ETL Vs ELT: A Comprehensive Exploration of Both Methodologies, Including Real-World Applications and Trade-Offs. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Mar. 2019

27. Muneer Ahmed Salamkar. Next-Generation Data Warehousing: Innovations in Cloud-Native Data Warehouses and the Rise of Serverless Architectures. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Apr. 2019

28. Muneer Ahmed Salamkar. Real-Time Data Processing: A Deep Dive into Frameworks Like Apache Kafka and Apache Pulsar. Distributed Learning and Broad Applications in Scientific Research, vol. 5, July 2019

29. Muneer Ahmed Salamkar, and Karthik Allam. “Data Lakes Vs. Data Warehouses: Comparative Analysis on When to Use Each, With Case Studies Illustrating Successful Implementations”. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Sept. 2019

30. Naresh Dulam, et al. Data Governance and Compliance in the Age of Big Data. Distributed Learning and Broad Applications in Scientific Research, vol. 4, Nov. 2018

31. Naresh Dulam, et al. “Kubernetes Operators: Automating Database Management in Big Data Systems”. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Jan. 2019

32. Naresh Dulam, and Karthik Allam. “Snowflake Innovations: Expanding Beyond Data Warehousing ”. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Apr. 2019

33. Dulam, and Venkataramana Gosukonda. “AI in Healthcare: Big Data and Machine Learning Applications ”. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Aug. 2019

34. Naresh Dulam. “Real-Time Machine Learning: How Streaming Platforms Power AI Models ”. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Sept. 2019

35. Sarbaree Mishra. A Distributed Training Approach to Scale Deep Learning to Massive Datasets. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Jan. 2019

36. Sarbaree Mishra, et al. Training Models for the Enterprise - A Privacy Preserving Approach. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Mar. 2019

37. Sarbaree Mishra. Distributed Data Warehouses - An Alternative Approach to Highly Performant Data Warehouses. Distributed Learning and Broad Applications in Scientific Research, vol. 5, May 2019

38. Sarbaree Mishra, et al. Improving the ETL Process through Declarative Transformation Languages. Distributed Learning and Broad Applications in Scientific Research, vol. 5, June 2019

39. Sarbaree Mishra. A Novel Weight Normalization Technique to Improve Generative Adversarial Network Training. Distributed Learning and Broad Applications in Scientific Research, vol. 5, Sept. 2019

Published

16-07-2020

How to Cite

[1]
Sairamesh Konidala, Guruprasad Nookala, and Vishnu Vardhan Reddy Boda, “Modern cloud architectures distinguish data lakes from data warehouses in terms of choice of technology for your data pipelines”., Distrib. Learn. Broad Appl. Sci. Res., vol. 6, pp. 1045–1065, Jul. 2020, Accessed: Mar. 14, 2025. [Online]. Available: https://dlbasr.org/index.php/publication/article/view/61