Why is a modern data pipeline essential and what is it?

Sairamesh Konidala

Authors

Sairamesh Konidala Vice President at JPMorgan & Chase, USA Author

Keywords:

Data Pipeline, Modern Data Pipelines, ETL, ELT

Abstract

A modern data pipeline is a system meant to efficiently monitor data transit, transformation, and analysis, thereby helping businesses to turn unprocessed data into valuable insights. Unlike traditional pipelines, modern data pipelines are meant to control the growing volume, speed, and variety of data generated by the present digital environment. Often leveraging cloud technologies, actual time processing & automations, these pipelines also comprise the data collections, purification, transformation, storage & analysis tools. Modern data pipelines are essential as they allow to maximize information flow & ensure that data is reliable, available & ready for use in making decisions. To stay competitive, improve customer experiences & inspires innovation, companies rely on accurate & fast data. Modern pipelines minimize human work by automating data flow & processing, hence lowering errors & speeding insight creation. They allow companies handle multiple data sources, including transaction records, customer databases, sensor data & social media feeds. A properly built data pipeline helps companies in a data-centric environments to react to fast changes. It helps teams to focus more on analysis & planning instead of having data management's complexities limit them. Modern data pipelines eventually provide a continuous data flow, which helps businesses to regularly & efficiently utilize information.

References

1. O’Donovan, P., Leahy, K., Bruton, K., & O’Sullivan, D. T. (2015). An industrial big data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities. Journal of big data, 2, 1-26.

2. Deutsch, E. W., Mendoza, L., Shteynberg, D., Slagel, J., Sun, Z., & Moritz, R. L. (2015). Trans‐Proteomic Pipeline, a standardized data processing pipeline for large‐scale reproducible proteomics informatics. PROTEOMICS–Clinical Applications, 9(7-8), 745-754.

3. Rex, D. E., Ma, J. Q., & Toga, A. W. (2003). The LONI pipeline processing environment. Neuroimage, 19(3), 1033-1048.

4. Irwin, M. J., Lewis, J., Hodgkin, S., Bunclark, P., Evans, D., McMahon, R., ... & Beard, S. (2004, September). VISTA data flow system: pipeline processing for WFCAM and VISTA. In Optimizing scientific return for astronomy through information technologies (Vol. 5493, pp. 411-422). SPIE.

5. Muhlbauer, W. K. (2004). Pipeline risk management manual: ideas, techniques, and resources. Gulf Professional Publishing.

6. Hesketh, D. (2010). Weaknesses in the supply chain: who packed the box?. World Customs Journal, 4(2), 3-20.

7. Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., ... & Whittle, S. (2015). The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12), 1792-1803.

8. Vranken, W. F., Boucher, W., Stevens, T. J., Fogh, R. H., Pajon, A., Llinas, M., ... & Laue, E. D. (2005). The CCPN data model for NMR spectroscopy: development of a software pipeline. Proteins: structure, function, and bioinformatics, 59(4), 687-696.

9. Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., ... & Zaharia, M. (2015, May). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (pp. 1383-1394).

10. Bertin, E., Mellier, Y., Radovich, M., Missonnier, G., Didelon, P., & Morin, B. (2002). The TERAPIX pipeline. In Astronomical Data Analysis Software and Systems XI (Vol. 281, p. 228).

11. Shen, J. P., & Lipasti, M. H. (2013). Modern processor design: fundamentals of superscalar processors. Waveland Press.

12. Tweddle, D. (2008). Logistics, security and compliance: the part to be played by Authorised Economic Operators (AEOs) and data management. World Customs Journal, 2(1), 101-105.

13. Schubert, M., Ermini, L., Sarkissian, C. D., Jónsson, H., Ginolhac, A., Schaefer, R., ... & Orlando, L. (2014). Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX. Nature protocols, 9(5), 1056-1082.

14. Manegold, S., Boncz, P., & Kersten, M. (2002). Optimizing main-memory join on modern hardware. IEEE transactions on knowledge and data engineering, 14(4), 709-730.

15. Bienia, C. (2011). Benchmarking modern multiprocessors. Princeton University.

16. Naresh Dulam. NoSQL Vs SQL: Which Database Type Is Right for Big Data?. Distributed Learning and Broad Applications in Scientific Research, vol. 1, May 2015, pp. 115-3

17. Naresh Dulam. Data Lakes: Building Flexible Architectures for Big Data Storage. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Oct. 2015, pp. 95-114

18. Naresh Dulam. The Rise of Kubernetes: Managing Containers in Distributed Systems. Distributed Learning and Broad Applications in Scientific Research, vol. 1, July 2015, pp. 73-94

19. Naresh Dulam. Snowflake: A New Era of Cloud Data Warehousing. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Apr. 2015, pp. 49-72

20. Naresh Dulam. The Shift to Cloud-Native Data Analytics: AWS, Azure, and Google Cloud Discussing the Growing Trend of Cloud-Native Big Data Processing Solutions. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Feb. 2015, pp. 28-48

Why is a modern data pipeline essential and what is it?

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite