Which are the fundamental ideas, guidelines for design for data pipelines and the highest standards of data orchestration?

Sairamesh Konidala

Authors

Sairamesh Konidala Vice President at JPMorgan & Chase, USA Author

Keywords:

Data Pipelines, Data Orchestration, Data Workflow

Abstract

Data orchestration is the coordination of pipeline actions to run efficiently and in the correct sequence, therefore guaranteeing the honoring of dependencies and the simplification of processes. Automation of repetitive tasks, job scheduling to guarantee load balance, and the building of monitoring and alerting systems for the timely error detection constitute optimal orchestration techniques. Retries, checkpoints, and idempotent techniques help to build fault-tolerant systems guaranteeing little disruption during failures. Moreover, using version control for pipeline code and parameters guarantees homogeneity across deployments and helps to monitor changes. Explicit documentation and collaboration among data engineers, analysts, and business stakeholders underlie a human-centric approach for data pipelines. This ensures that the pipelines provide a notable outcomes & match company goals. Constant testing & data quality assurance at all pipeline levels help to lower the possibility of further errors. In the end, stressing security and data governance—establishing suitable access limits, encryption, & privacy rule adherence—helps to maintain the trust & integrity all along the data lifespans. Following these guidelines & best practices can help businesses create strong, adaptable data pipelines that support innovation & growth by means of which development is facilitated.

References

1. Khalifa, S., Elshater, Y., Sundaravarathan, K., Bhat, A., Martin, P., Imam, F., ... & Statchuk, C. (2016). The six pillars for building big data analytics ecosystems. ACM Computing Surveys (CSUR), 49(2), 1-36.

2. Spafford, K., Meredith, J., & Vetter, J. (2010). Maestro: data orchestration and tuning for opencl devices. In Euro-Par 2010-Parallel Processing: 16th International Euro-Par Conference, Ischia, Italy, August 31-September 3, 2010, Proceedings, Part II 16 (pp. 275-286). Springer Berlin Heidelberg.

3. Tan, W., Madduri, R., Nenadic, A., Soiland-Reyes, S., Sulakhe, D., Foster, I., & Goble, C. A. (2010). CaGrid Workflow Toolkit: A taverna based workflow tool for cancer grid. BMC bioinformatics, 11, 1-12.

4. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., ... & Zhao, Y. (2006). Scientific workflow management and the Kepler system. Concurrency and computation: Practice and experience, 18(10), 1039-1065.

5. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., & Tzoumas, K. (2015). Apache flink: Stream and batch processing in a single engine. The Bulletin of the Technical Committee on Data Engineering, 38(4).

6. Guinard, D., Trifa, V., Mattern, F., & Wilde, E. (2011). From the internet of things to the web of things: Resource-oriented architecture and best practices. Architecting the Internet of things, 97-129.

7. Marcu, O. C., Costan, A., Antoniu, G., & Pérez-Hernández, M. S. (2016, September). Spark versus flink: Understanding performance in big data analytics frameworks. In 2016 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 433-442). IEEE.

8. Mohanty, S., Jagadeesh, M., & Srivatsa, H. (2013). Big data imperatives: Enterprise ‘Big Data’warehouse,‘BI’implementations and analytics. Apress.

9. Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications. " O'Reilly Media, Inc.".

10. Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., & Becker, B. (2008). The data warehouse lifecycle toolkit. John Wiley & Sons.

11. Kaschesky, M., & Selmi, L. (2013, June). Fusepool R5 linked data framework: concepts, methodologies, and tools for linked data. In Proceedings of the 14th Annual International Conference on Digital Government Research (pp. 156-165).

12. Huber, W., Carey, V. J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B. S., ... & Morgan, M. (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nature methods, 12(2), 115-121.

13. Kelly, N., Thompson, K., & Yeoman, P. (2015). Theory-led design of instruments and representations in learning analytics: Developing a novel tool for orchestration of online collaborative learning. Journal of Learning Analytics, 2(2), 14-43.

14. Simmhan, Y., Aman, S., Kumbhare, A., Liu, R., Stevens, S., Zhou, Q., & Prasanna, V. (2013). Cloud-based software platform for big data analytics in smart grids. Computing in Science & Engineering, 15(4), 38-47.

15. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., & Welton, C. (2009). MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment, 2(2), 1481-1492.

16. Naresh Dulam. NoSQL Vs SQL: Which Database Type Is Right for Big Data?. Distributed Learning and Broad Applications in Scientific Research, vol. 1, May 2015, pp. 115-3

17. Naresh Dulam. Data Lakes: Building Flexible Architectures for Big Data Storage. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Oct. 2015, pp. 95-114

18. Naresh Dulam. The Rise of Kubernetes: Managing Containers in Distributed Systems. Distributed Learning and Broad Applications in Scientific Research, vol. 1, July 2015, pp. 73-94

19. Naresh Dulam. Snowflake: A New Era of Cloud Data Warehousing. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Apr. 2015, pp. 49-72

20. Naresh Dulam. The Shift to Cloud-Native Data Analytics: AWS, Azure, and Google Cloud Discussing the Growing Trend of Cloud-Native Big Data Processing Solutions. Distributed Learning and Broad Applications in Scientific Research, vol. 1, Feb. 2015, pp. 28-48

Which are the fundamental ideas, guidelines for design for data pipelines and the highest standards of data orchestration?

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite