Data engineering

Preface

Prerequisites

Learning ethics

Introduction

What is Data engineering?

Data engineering involves developing processes that transform raw data into high-quality information. These processes, or jobs, encompass a lifecycle that begins with data generation and includes steps such as ingestion, transformation, and serving. This lifecycle concludes with downstream use cases such as analytics, machine learning, and reverse ETL.

Ingestion, transformation, and serving are typically conducted using ETL or ELT methodologies, which stand for extracting, loading, and transforming data. However, the sequence of these operations differs to meet specific objectives. ETL processes the data in a specific order to prepare it for a data mart, while ELT involves loading raw data into a single repository for later processing by end users.

In the field of Data Engineering, processes are referred to as jobs, and the systems that manage these jobs are known as data orchestrators. A job is conceptualized as a directed acyclic graph, where nodes represent tasks, and edges indicate dependencies between these tasks. Each job begins with an upstream data source that triggers the ingestion process, with each data source generating volume and cadence differently. Sensors act as the initial nodes, triggering downstream tasks based on stream. The stream can be a large chunk scheduled with a cron, an event, a sequence of events or a real-time signal.

Data sources have their own protocols for connection, necessitating the use of various data extraction techniques such as web scraping, APIs, and database queries to facilitate connectivity and data wrangling.

Undercurrents

Security

Data Management

DataOps

Data Architecture

Orchestration

Software engineering

Design decisions about batch versus stream ingestion

Why does Data engineering matter to you?

Research

Ecosystem

Standards, jobs, industry, roles, …

Story

FAQ

Worked examples

Data Architecture

What is Data architecture?

Principles

Concepts

Patterns

Design decisions

Exercises

Projects

Summary

FAQ

Reference Notes

Data Lifecycle

Data generation

Files

APIs

Application Databases (OLTP Systems)

Message queues and Event-streaming plataforms

Storage

Memory hierarchy

CPU

Magnetic Disk Drive

Random Access Memory

Solid-State

Serialization

A common problem in data engineering, especially when working with external parties, is identifying file types. An easy, but unreliable, method to do this is by using file extensions; however, this approach can lead to misleading results during the parsing step. A better method is to use the file's binary signature and content to determine its type.

Plain Text

Text-to-binary encoding

Binary encoding

CSV. No.

XML

JSON and JSONL

CueLang and JSONNet

Avro

uuencode or Base64

Protobuff

Office files.

Parquet

ORC

Hudi

Iceberg

Languages offers its own way to serialize their language constructs i.e. Pickle for Python or Boost for C++.

Java

https://isocpp.org/wiki/faq/serialization

Networking

Caching

Object Storage

File storage

Block storage

Separation of computation, storage, code and environment

Ingestion

Queres, Modeling, and Transformation

Spark

Hadoop

Serving Data for Analytics, Machine Learning and Reverse ETL

Data orchestration

Data Ops

Data versioning

Tracking and observability

ML Serving

Summary

FAQ

Reference Notes

Security, Privacy

Summary

FAQ

Reference Notes

Next steps

The future of Data engineering

References

Reis, J., & Housley, M. (2022). Fundamentals of Data Engineering. " O'Reilly Media, Inc.".

TODO