📚

Database, datasets and spreadsheets

Introduction

1.1 Conceptos básicos de bases de datos
1.2 Elementos y roles que intervienen en los sistemas de bases de datos
1.3 Tipos y formatos de datos
1.4 Retos actuales en el manejo de bases de datos
1.5 Sistemas manejadores de bases de datos

Why do you need a database?

Memory vs Storage

How to think about databases. (2016, February 09). Retrieved from https://nolanlawson.com/2016/02/08/how-to-think-about-databases

Information, data, and database

Concepts

NameTermsExamples
Storage
Noise
DataAssumed factsAtomic datumCharactersQuantitiesSymbols
Databasecoherent dataorganized datarecorded datastorage medium
Datasetunstructured information
InformationAggregated datasetsHuman-readableReportsSynthesized databasecontextual database
KnowledgePredict future from information


Database

💡
Database means organized and atomic collections of the data from the types, structures, and constraints of his storage medium (which usually is electronic), his domain, and his management system.

Archaic. Data bank is an obsolete term for databases.

Types

Data model

  1. Structure.
  1. Operations.
  1. Constraints.

Types of databases

2.1 Modelo jerárquico
2.2 Modelo de redes
2.3 Modelo relacional
2.4 Modelo orientado a objetos
2.5 Modelos no convencionales (multidimensionales, deductivos, transaccionales

https://twitter.com/bytebytego/status/1676828359475744769/photo/1
https://twitter.com/AmigosCode/status/1682758269004050432/photo/1

Online transation processing (OLTP) and online transaction processing (OLAP).

Story

Key questions

How to choose a database for your project?

Are spreadsheets -Excel-, Access, Notion,... a database?

Yes, they are. But they won't either scale very well or fit with your needs. Remember they are not real relational databases.

How can I visualize the data?

Why we should create an electronic database?

This is an organizational decision.

Is digital more secure than paper?

Yes, it is.

How to keep documents, images, and other blobs?

You shouldn't use your database to keep blobs; Use a file system or object storage.

The data is an organizational asset since at least Egyptian times.

🔥
Financial industry

Y2K Problem

Database management system

A database management system (DBMS) is a software system for defining, constructing, and sharing databases among various clients. It abstracts storage medium.

Good database model to deliver good information

Quickly, reliably, and consistently. It also won't be useful if it is stolen, lost, or corrupted when the system crashes.

User Needs and Requirements

This is no book about software engineering, so we’re not able to talk about how to extract business rules -requirements-. Here we’re assuming them, that is, you’re going to learn about database paradigms, their drawbacks, their advantages, and building a Data Model such that it fits with requirements.

Dilbert y las Hojas de Cálculo - Una divertida visión del "Excel Hell". (2023, March 20). Retrieved from https://outsourceando.blogspot.com/2012/08/dilbert-y-las-hojas-de-calculo-una.html?mc_cid=385764652e&mc_eid=541175b307

Relational databases and SQL

If you don't have a special constraint or requirement, you should use a relational database.

Data-interchange format

XML

JSON

http://www.json.org/xml.html

YAML

Binary special formats

Spreadsheets

Storage

Flat Files

JSON

INI Files

XML (XML)

YAML

Facts with prolog (facts = data)

https://stackoverflow.com/questions/2117651/comparing-sql-and-prolog

Worked examples

package.json

timeline

Of Prolog, T. P. (2022, April 03). A Tour of Prolog. Youtube. Retrieved from https://www.youtube.com/watch?v=8XUutFBbUrg&ab_channel=ThePowerofProlog

Spreadsheets

https://datascience.stackexchange.com/questions/5443/do-data-scientists-use-excel

Hierarchical Databases

Geographical Information System Database

Non-relational databases

1.1 Evolució n de las bases de datos
1.1.1 Surgimiento de las computadoras
1.1.2 Bases de datos relacionales
1.1.3 Bases de datos no-relacionales
1.2 Almacenamiento de datos no-textuales
1.3 Nuevos requerimientos de accesibilidad
1.3.1 Acceso global
1.3.2 Disponibilidad continua
1.3.3 Datos protegidos

2.1 Explorando Medios sociales
2.1.1 Tó pico tendencia (Twitter)
2.1.2 Conexiones sociales (Facebook)
2.1.3 Perfiles laborales (LinkedIn)
2.2 Documentos de texto
2.3 Datos Web
2.3.1 Correos electró nicos
2.3.2 Redes de intereses (GitHub)
2.3.3 Páginas Web
2.3.4 Microformatos
2.4 Otros tipos de datos no-estructurados

3.1 Evolució n de las bases de datos
3.1.1 Surgimiento de las computadoras
3.2 Conceptos NoSQL
3.2.1 Teorema CAP
3.2.2 Consistencia
3.2.3 Concurrencia
3.2.4 MapReduce
3.3 Tecnología NoSQL
3.3.1 Manejadores clave-valor
3.3.1.1 Redis, Memcached, Oracle Coherence.
3.3.2 Wide-column
3.3.2.1 Apache Hadoop, HBase, Apache Kudu, Apache Druid.
3.3.3 Manejadores de documentos
3.3.3.1 MongoDB, CouchDB.
3.3.4 Manejadores de grafos
3.3.4.1 InterSystems Caché, db4o.
3.3.5 Manejadores de XML
3.3.5.1 BaseX, eXist.
3.3.6 Manejadores de multimedia
3.3.6.1 Apache Cassandra

4.1 Introducció n a lagos de datos (LD)
4.1.1 Entendiendo que son los LD
4.1.2 Antes y después de los LD
4.1.3 Retos para implementar LD
4.1.4 Cuando implementar LD
4.2 Arquitectura de LD
4.2.1 Componentes de un arquitectura de LD
4.2.2 Capas de la arquitectura (gobernanza y seguridad, informació n y metadatos)
4.2.3 Niveles de la arquitectura (entrada, administració n y consumo de datos)
4.3 Gobernanza de datos
4.3.1 Entendiendo el concepto de gobernanza de datos (GD)
4.3.2 Macro y micro GD
4.3.3 Clasificació n y organizació n de datos
4.3.4 Catálogo de datos y metadatos
4.3.5 Administració n de calidad de los datos
4.3.6 Administració n de acceso a los datos
4.3.7 Auditoría de datos
4.3.8 Protecció n de datos

Carpenter, J. & Hewitt, E. (2020). Cassandra The Definitive Guide: Distributed Data at Web Scale. O’Reilly.

Eryurek, E., Gilad, U., Lakshmanan, V., Kibunguchy, A. & Ashdown, J. (2021). Data Governance: The Definitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. O'Reilly Media

Hills, T. (2016). NoSQL and SQL Data Modeling: Bringing Together Data, Semantics, and Software. Technics Publications. [Clasica].

Pasupuleti, P. & Purra, B. S. (2015) Data Lake Development with Big Data. Packt Publishing [Clásica].

Harrison, G. (2015). Next Generation Databases: NoSQLand Big Data. Apress.

Russell, M. (2019). Mining the Social Web (3rd ed.). O’Reilly Media, Inc. Singh, A. & Ahmad, S. (2021). Data Modeling with NoSQL Database. Independently Published.

Key-value store, Redis case

JSON-like document store, MongoDB case

Wide-column store, Cassandra case

Graph database, Neo4j case

https://neo4j.com/

SPARQL

Object database

Tabular

Tuple store

Triple/quad store (RDF) database

Multivalue database

Multimodel database

Vector embeddings database

Time series

A time serie is the data generated by the observations indexed by time. A more formal defintion can be found on

As you might imagine, many processes are indexed by time. For instance, examples include stock prices, CPU usage, sensor data, light curves, and ECG readings.

In this book, our main focus is on how to store time series data reliably and quickly, while minimizing the caveats for real-world applications.

We have various alternatives for managing this kind of data, including Prometheus, Graphite, InfluxDB, TimeScaleDB, and OpenTSDB. While it's possible to use different database systems, they may not be optimized for real-time processing or analytics. Although everyone has, at some point, added a date column to indicate when an operation occurs in the database, with typical examples being 'at_update' and 'at_create', this basic approach doesn't scale well and isn't suitable for time series processing applications.

As demonstrated by Timescale and others, a reliable method for persisting time series data is using PostgreSQL and its extension, TimescaleDB. Therefore, we’re going to use this as our example to illustrate a concrete case.

https://www.timescale.com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/

https://www.timescale.com/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/

https://www.youtube.com/channel/UCPmHSkid9IOYbdN1Psh24lg

Big data

Hadoop

Spark

Mainframe

Supercomputers

Data Warehouse

Big Query

ETL

OLAP

Schemas

Data mart

dataWarehouse.com | Article. (2000, July 18). Retrieved from https://web.archive.org/web/20110420134556/http://csis.bits-pilani.ac.in/faculty/goel/Data Warehousing/Articles/Data Marts/dataWarehouse_com Article_DM VS DW.htm

Snowflake schema

Star schema

Activity schema

References

Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)

1.1. La economía de los datos
1.2. Industria 4.0 y las tecnologías digitales
1.3. Sectores pioneros en datos masivos
1.4. Evolución de los sistemas de gestión de datos

2.1. Fundamentos a datos masivos
2.1.1. Verdades y mitos
2.1.2. Dimensionalidad y características:
2.1.2.1 Sistemas distribuidos
2.2. Tipos de datos
2.2.1 Datos estructurados
2.2.2 Bases de datos NoSQL
2.3. Casos de estudio (industria, empresa, educación, etc.)
2.4. Estado del arte en datos masivos

3.1. Arquitectura de datos masivos
3.1.1. Infraestructura para almacenamiento de datos
3.1.2. Infraestructura para gestión
3.1.3. Aplicaciones para modelado y análisis de datos
3.2. Componentes tecnológicos
3.2.1. Acopio de datos estructurados y no estructurados (Hive, Drill, Impala, HBase, entre otros).
3.2.2. Sistemas de archivo y bases de datos distribuidas
3.2.2.1 Gestión de datos distribuidos (HDFS)
3.2.3. Plataformas de cómputo en la nube
3.2.4. Sistemas de almacenamiento escalable
3.2.5. Procesamiento masivo paralelo de datos
3.2.5.1 MapReduce
3.3. Gobernabilidad de los datos masivos
3.4 Roles: negocios, científico de datos, IT, analista, etc.

4.1. Modelo de datos
4.1.1 Key-value, documentos, wide-column, gráficos
4.2. Herramientas de extracción, transformación y carga (ETL)
4.2.1. Elementos del proceso de ETL
4.2.2. Marco de trabajo para la implementación del proceso ETL
4.2.2.1 Arquitectura
4.2.2.2 Implementación
4.2.2.3 Integración
4.2.2.4 API’s para la manipulación de datos integrados
4.2.2.4.1 Pig, Spark, Tableau

5.1. Tipos de análisis: descriptivo, diagnóstico, predictivo y prescriptivo.
5.2. Proceso de descubrimiento de conocimiento en base de datos (KDD)
5.3. Tecnologías emergentes de analítica
5.3.1 Amazon AWS, Microsoft Azure, Google BigQuery, entre otros.
5.4. Principales métodos de minado de datos
5.5. Bibliotecas de consulta para el manejo de datos masivos
5.6. Bibliotecas de minería de datos y aprendizaje automático
5.7. Visualización
5.8. Interpretación de resultados

References

  1. Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. John Wiley & Sons. [Clásica]. Available at: Wiley Online Library.
  1. Hurwitz, J., Nugent, A., Halper, F., & Kaufman, M. (2013). Big Data for Dummies. John Wiley & Sons Inc. [Clásica].
  1. Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. Available at: O'Reilly Media.
  1. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. [Clásica].
  1. Marz, N., & Warren, J. (2015). Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publications Co. [Clásica].
  1. O'Reilly Radar Team. (2014). Big Data Now: Current Perspectives from O'Reilly Radar. O'Reilly Media, Incorporated. [Clásica].
  1. Wessler, M. (2016). Big Data Analytics for Dummies. John Wiley & Sons. [Clásica].
  1. White, T. (2012). Hadoop: The Definitive Guide. O'Reilly Media, Incorporated. [Clásica].
  1. Atzori, L., Iera, A., & Morabito, G. (2010). The Internet of Things: A Survey. Computer Networks, 54(15), 2787-2805.
  1. Li, S., Da Xu, L., & Zhao, S. (2015). The Internet of Things: A Survey. Information Systems Frontiers, 17(2), 243-259.
  1. Marwedel, P. (2021). Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, and the Internet of Things. Springer Nature. Available at: UABC Library.

Visualization and user-friendly reports

PowerBI

Grafana

Information Retrieval

Natural language search

Elastic search

Database as product

Supabase

Next steps

Data engineering

https://supabase.com/

https://www.cockroachlabs.com/

https://www.youtube.com/watch?v=wUaGyUpyE2E&t=187s&ab_channel=PeladoNerd

References

Database Internals: A Deep Dive Into How Distributed Data Systems Work https://www.databass.dev/

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. by Eric C Redmond, Jim L. Wilson, and Luc Perkins

Ullman, J. and Widom, J. 2014. A first Course in Database Systems: Pearson New International Edition. Pearson Education Limited.


Sumathi, S. and Esakkirajan, S. 2010. Fundamentals of relational database management systems. Springer.


Stephens, R. 2009. Beginning database design solutions. Wiley Pub.


Garcia-Molina, H., Ullman, J. and Widom, J. 2009. Database systems. Prentice Hall.


Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. 3rd ed. McGraw-Hill, 2002. ISBN: 9780072465631.


Database System Concepts by Henry Korth

Hellerstein, Joseph, and Michael Stonebraker. Readings in Database Systems (The Red Book). 4th ed. MIT Press, 2005. ISBN: 9780262693141. https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-830-database-systems-fall-2010/readings/ http://www.redbook.io/index.html


2021. Database Systems: The Complete Book -complement-.
Stanford.edu. Retrieved June 29, 2021 from http://infolab.stanford.edu/~ullman/dscb.html#slides

2021. CS 186 - Introduction to Database Systems.
Berkeley.edu. Retrieved June 29, 2021 from https://inst.eecs.berkeley.edu/~cs186/sp08/index.html