by Colin Thomas
Data Engineering is the discipline of designing, building, and maintaining a robust infrastructure for collecting, transforming, storing, and serving data for use in machine learning, analytic reporting, and decision management.
Managing a successful data engineering project requires balancing three different aspects: pragmatism, principles, and practice:
- The pragmatic aspect considers the constraints of a project, such as budget, existing tools, and minimum viable solutions.
- The principled aspect focuses on fully understanding the problem and developing robust solutions that follow best practices.
- The practiced aspect draws on expertise derived from repeated application of a canonical set of techniques to enable efficient solution deployments.
These three aspects drive related engineering archetypes, as shown in Figure 1:
- The Designer balances pragmatism and principles to prepare robust solutions that conform to the constraints of a project.
- The Architect draws on principles and practice to develop solutions that follow general best practices and are likely to meet the needs of a project.
- The Coder combines practice with pragmatism to formulate quick solutions that use existing technologies to provide instant value.
The Data Engineer balances all three aspects to avoid leading data engineering pitfalls.
Data engineering projects often look alike in their structure, as shown in Figure 2: the client has source data that needs to be extracted, transformed, and loaded (ETL) into an analytic data store. This analytic data store is used to inform data models whose results are then presented visually to users.
However, while the structure of data engineering projects rarely changes, the details often do. Client environments can range from entirely cloud-based solutions such as AWS or Azure to bespoke on-premises data centers — or some unique mix of these environments. Source data stores can include file storage systems with documents in dozens of formats, SQL databases from multiple vendors, and specialized NoSQL solutions. The client might have existing software licenses that they would like to use or existing models and ETL processes that must be integrated. The data engineer must be ready to handle any ecosystem of these components by being pragmatic, principled, and practiced:
- Clients often have problems that require quick solutions developed on a limited budget. The data engineer will need to work within the constraints of time, budget, and environment to pragmatically develop a solution that meets the client’s needs. Often, this requires combining existing tools and components while replicating solutions that have worked in similar situations.
- Deployed analytic solutions tend to stick around longer than was originally envisioned. The data engineer must consider the entire client ecosystem to develop strong, maintainable solutions based on best principles. With proper planning the deployed solution will be easier to manage and adapt to changing needs.
- In spite of the uniqueness of each client environment, data engineering projects share many similar components. Working on a variety of projects allows the data engineer to build a toolbox of practiced techniques, such as database query optimization or data cleaning. Gaining experience with projects using a particular ecosystem, such as AWS, familiarizes the data engineer with the tools available and potential pitfalls of that environment.
Data that is not properly organized and processed leads to long delays and failed analytics projects. The successful data engineer learns to balance the strengths and weaknesses of pragmatism, principle, and practice across all phases of the data pipeline to deliver successful solutions that enable end users to make informed business decisions.
Originally published at https://www.elderresearch.com on March 12, 2021.