The makings of an exceptional big data engineer

Reading Time: 7 minutes

As a discipline, data engineering is effectively the sibling of data science; like data scientists, data engineers are highly analytical, able to write complex code and produce detailed data visualizations. However, unlike data scientists, a data engineer will manage complete data pipelines by building tools, infrastructure, frameworks and services. Their most crucial role is to design, develop, construct, install, test and maintain entire data management and processing systems. A data engineer is responsible for handling the complete data management and processing infrastructure.

As such, we can think of the field as a superset of business intelligence and data warehousing that incorporates elements of software engineering. Big data engineering integrates additional components in distributed systems, Hadoop ecosystem, stream processing and computation at scale. Here, we examine this skill set in more detail, discussing the responsibilities of a data engineer and evolving approaches to the discipline.

Table of Contents

Overview of the role and responsibilities of a data engineer and their team

As introduced, a data engineer is responsible for managing the entire data pipeline. The exact parameters of these tasks will depend on the scale of the organisation. In smaller environments, big data engineering roles may cover setting up and operating data infrastructure including tasks like implementing and managing platforms such as Hadoop, Hive, HBase and Spark. Further to this, they’ll often use hosted services offered by the likes of AWS, Azure, or Databricks.

In larger environments, the need for a data infrastructure team grows. The data engineering team can be considered a “centre of excellence” through defining standards, best practices and certification processes for data objects. The role of automating some big data engineering processes falls under the remit of both the data engineering and data infrastructure teams, which tend to collaborate to solve higher level problems. As the engineering aspect of the role is growing in scope, tasks like developing and maintaining portfolios of reports and dashboards are not a data engineer’s main focus.

The data engineering team in large organisations may also lead an education programme where they’ll share their core competencies in order to help other team members become better citizens of the data warehouse. For instance, Facebook has a “data camp” education program and Airbnb is developing a similar “data university”, where data engineers lead sessions about data proficiency.

Zoom-in on key functions and responsibilities

Data warehousing

Data engineering is a rapidly evolving field, however, the data warehouse is just as relevant as it ever was. Thus, a key responsibility of a data engineer is to oversee its construction and operation. In essence, the data engineer’s focal point is the data warehouse and their role gravitates around it.

To drill deeper into the functions of a well-constructed data warehouse, it is useful to reflect on the definition coined by Bill Inmon, who is widely regarded as the father of data warehousing. Inmon defines the data warehouse as once said: “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process.” To define what makes an effective data warehouse, let’s break down exactly what this means:

The well-constructed data warehouse

Subject-oriented: A data warehouse can be used to analyse a particular subject area. For example, “sales” could be a particular category.
Integrated: A data warehouse integrates data from multiple sources. For example, source A and source B may have different ways of identifying a product, but in the data warehouse, there will be only a single identification method.
Time-variant: A data warehouse is a storage space for historic data. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older from a data warehouse. This contrasts with a transaction system, which often only holds recent data. Whereas a transaction system may hold the most recent address of a customer, a data warehouse can hold all addresses associated with that client.
Non-volatile: Once data are in the data warehouse, they won’t change. Therefore, historical data in a data warehouse should never be altered.

As data scientist Ralph Kimball specifies, “a data warehouse is a copy of transaction data specifically structured for query and analysis” – and it is these features that make a data warehouse highly searchable, and thus, functional. Subsequently, a data engineer is responsible for cataloging and organising metadata and defining the processes for extracting data from the warehouse. In a fast-growing, rapidly evolving data ecosystem, metadata management and tooling have become a vital component of a modern data platform.

ETL vs. programmatic workflows

Primarily, data engineers are shifting away from drag-and-drop ETL (Extract Transform and Load) tools towards a more programmatic approach. Platforms like Informatica, Cognos, Ab Initio or Microsoft SSIS are no longer common amongst modern data engineers. Instead, they are being replaced by more generic platforms like Airflow, Oozie, Azkabhan or Luigi. Therefore, a competent data engineer should be familiar with these contemporary approaches to data workflows.

Data modelling

Today, traditional data modeling techniques for the analytics workloads – typically associated with data warehouses – are not relevant as they once were. As such, a good data engineer needs to be familiar with recent trends in data modelling. For instance, storage and compute cycles are cheaper than ever. With the advent of distributed databases that scale-out linearly, the use of keys and dimension attributes in fact tables are becoming more common.

Equally, support for encoding and compression in serialization formats like Parquet or ORC address most of the performance loss that would normally be associated with denormalization. As such, talented data engineers should explore innovative, creative solutions to enhance database read performance.

Furthermore, modern databases have growing support for Binary Large Objects (BLOBS), such as images, sound files or other multimedia objects through native types and functions. This allows fact tables to store multiple grains at once. Plus, since the advent of MapReduce and the growing popularity of document stores alongside support for BLOBS in databases, it’s becoming easier to evolve database schemas without executing DML.

Finally, considering the recent commoditization of compute cycles, there is less need to precompute and store results in the warehouse. For instance, a data engineer can create a complex Spark job that can compute detailed analysis on-demand only, instead of scheduling the output to be part of the warehouse.

In summary, a competent, up-to-date data engineer will be au fait with all the new tools and techniques that make database performance more streamlined and effective.

Performance optimization

As implied in the preceding section, performance optimization is a key competency of a good data engineer. As such, data engineers are required to focus on performance tuning and optimization of data processing and storage. In essence, optimization consists of linearising exponential growth in both resource utilization and costs; therefore, the data engineer needs to build infrastructure that scales with the company, whilst being resource conscious.

Data integration

Data integration is another key responsibility of a data engineer. This process consists of integrating businesses and systems via the exchange of data. Of late, Software as a Service (SaaS) is the new standard, and subsequently, synchronising data across all systems has become critical. Even if SaaS includes a set of analytic offerings, generated data needs to move to the data warehouse seamlessly so that it can be analysed along the rest of the data.

Further services and competencies of a big data engineer

The responsibilities of a data engineer are by no means limited by the functions described above. Further competencies include:

Data ingestion, which includes tasks such as scraping databases, loading logs, fetching data from external stores or APIs.
Anomaly detection, which consists of automating data consumption to generate alerts tied to anomalous events.
Experimentation, such as A/B testing and multivariate testing form a critical piece of a company’s analytics strategy, within which data engineering plays a significant role.
Instrumentation, including capturing events and related attributes, thus ensuring the capture of quality data.
Automation, as data engineers should constantly look to automate their workloads and build abstraction. While the potential for automation differs depending on the environment, the need for automation is common across the board.

Essential skills and knowledge of a big data engineer

As a data engineer has several far-reaching responsibilities, they need to have a broad-ranging, comprehensive skill set. Moreover, considering the pace of development in the field, a good data engineer needs to demonstrate they have the adaptability and aptitude to move with the times. As it stands, data engineers should demonstrate the following competencies:

An in-depth knowledge of RDBMS. There are various RDBMS used in the industry such as Oracle DB, MYSQL, MSFT SQL Server, SQLite, IBM DB2, and so on. Data engineers must have the knowledge of one such database at least.
Knowledge of a structured query language is also a must, as they’re essential to structuring, manipulating and managing data in the RDBMS. As data engineers work closely with relational databases, they need to have a strong command of SQL.
They must also have extensive knowledge of ETL and data warehousing. Data warehousing is crucial to business intelligence and very important when it comes to managing a huge amount of data from heterogeneous sources. Here, the data engineer will need to know how to apply ETL.
As the requirements of organisations have grown beyond structured data, NoSQL databases were introduced to provide a solution. Thus, a good data engineer should be familiar with NoSQL databases. These databases are used to store large volumes of structured, semi-structured and unstructured data with quick alteration and agile structures as per application requirements.

Some of the NoSQL databases are:

Hbase: a column-oriented NoSQL database, which is good for scalable and distributed big data stores. It is also great for applications with optimized read and range-based scans, providing consistency and partitioning.
Cassandra: a highly scalable database with incremental scalability. The best part of Cassandra is minimal administration and no single point of failure. It is good for applications with fast and random reads and writes.
MongoDB: a document-oriented NoSQL database that is schema-free – which means that your schema can evolve as your application grows. This is why it gives full index support for high performance and replication, allowing for full tolerance. It has a master-slave type of architecture and provides CP (consistency and partitioning out of CLP). MongoDB is frequently used for web apps and semi-structured data handling.

With the rise of big data in the early 21st century, a new data framework was born: Hadoop. Not only does it store big data in a distributed manner, but it also processes the data parallelly. There are several tools in the Hadoop ecosystem that cater to different purposes and to professionals with different backgrounds. For a big data engineer, mastering these Hadoop tools is a definite must. Data engineers should have a knowledge of the following Hadoop tools
- HDFS (Hadoop distributed file system), which stores data in a distributed cluster.
- Yarn, which performs resource management by allocating resources to different applications and scheduling jobs. Yarn was introduced in Hadoop 2.x, making the framework highly flexible, efficient and scalable.
- MapReduce, which is a parallel processing paradigm, which allows data to be processed parallely on top of HDFS.
- Hive is a data warehousing tool that’s used in addition to HDFS. This tool caters to professionals from SQL backgrounds, allowing them to perform analytics on top of HDFS.
- Apache Pig, which is a high-level scripting language used for data transformation in tandem with Hadoop.
- Flume, which is a tool used to import unstructured data to HDFS.
- Scoop, which imports and exports structured data from an RDBMS to HDFS.
- Zookeeper, which acts as a coordinator amongst the distributed services running in the Hadoop environment. This tool assists with configuration management and synchronising services.
- Finally, Oozie is a scheduler that binds multiple logical jobs together and helps accomplish a complete task.
Data engineers should also be familiar with real-time processing frameworks like Apache Spark. These distributed real-time processing frameworks are used across industries to detect anomalies and make recommendations. These frameworks are easily integrated with Hadoop leveraging HDFS.

Today, data are crucial to meeting business needs. As every industry becomes increasingly technologically-driven, data are critical to delivering quality service and meeting client expectations. Moreover, the quantity and nature of these data are constantly expanding and transforming. Thus, it is essential that companies have the appropriate expertise and infrastructure to capture, analyse and store this information. With a data pipeline constructed and managed by experienced data engineers, companies can begin to leverage data that puts them one step ahead of the competition.

Mourad Touzani

A Data Science and AI Consultant with over five years of experience in retail, consumer packaged goods, finance, and government. I leverage Big Data and Machine Learning to optimize processes, maximize efficiency, improve customer experience and increase profitability.