Data engineering
Preface
Prerequisites
Learning ethics
Introduction
What is Data engineering?
Data engineering involves developing processes that transform raw data into high-quality information. These processes, or jobs, encompass a lifecycle that begins with data generation and includes steps such as ingestion, transformation, and serving. This lifecycle concludes with downstream use cases such as analytics, machine learning, and reverse ETL.
Ingestion, transformation, and serving are typically conducted using ETL or ELT methodologies, which stand for extracting, loading, and transforming data. However, the sequence of these operations differs to meet specific objectives. ETL processes the data in a specific order to prepare it for a data mart, while ELT involves loading raw data into a single repository for later processing by end users.
In the field of Data Engineering, processes are referred to as jobs, and the systems that manage these jobs are known as data orchestrators. A job is conceptualized as a directed acyclic graph, where nodes represent tasks, and edges indicate dependencies between these tasks. Each job begins with an upstream data source that triggers the ingestion process, with each data source generating volume and cadence differently. Sensors act as the initial nodes, triggering downstream tasks based on stream. The stream can be a large chunk scheduled with a cron, an event, a sequence of events or a real-time signal.
Data sources have their own protocols for connection, necessitating the use of various data extraction techniques such as web scraping, APIs, and database queries to facilitate connectivity and data wrangling.
Undercurrents
Security
Data Management
DataOps
Data Architecture
Orchestration
Software engineering
Design decisions about batch versus stream ingestion
Why does Data engineering matter to you?
Research
Ecosystem
Standards, jobs, industry, roles, …
Story
FAQ
Worked examples
Data Architecture
What is Data architecture?
Principles
Concepts
Patterns
Design decisions
Exercises
Projects
Summary
FAQ
Reference Notes
Data Lifecycle
Data generation
Files
APIs
Application Databases (OLTP Systems)
Message queues and Event-streaming plataforms
Storage
Memory hierarchy
CPU
Magnetic Disk Drive
Random Access Memory
Solid-State
Serialization
A common problem in data engineering, especially when working with external parties, is identifying file types. An easy, but unreliable, method to do this is by using file extensions; however, this approach can lead to misleading results during the parsing step. A better method is to use the file's binary signature and content to determine its type.
Plain Text
Text-to-binary encoding
Binary encoding
CSV. No.
XML
JSON and JSONL
CueLang and JSONNet
Avro
uuencode or Base64
Protobuff
Office files.
Parquet
ORC
Hudi
Iceberg
Languages offers its own way to serialize their language constructs i.e. Pickle for Python or Boost for C++.
Java
https://isocpp.org/wiki/faq/serialization
Networking
Caching
Object Storage
File storage
Block storage
Separation of computation, storage, code and environment
Ingestion
Queres, Modeling, and Transformation
Spark
Hadoop
Serving Data for Analytics, Machine Learning and Reverse ETL
Data orchestration
Data Ops
Data versioning
Tracking and observability
ML Serving
Summary
FAQ
Reference Notes
Security, Privacy
Summary
FAQ
Reference Notes
Next steps
The future of Data engineering
References
Reis, J., & Housley, M. (2022). Fundamentals of Data Engineering. " O'Reilly Media, Inc.".