Database, datasets and spreadsheets
Introduction
1.1 Conceptos básicos de bases de datos
1.2 Elementos y roles que intervienen en los sistemas de bases de datos
1.3 Tipos y formatos de datos
1.4 Retos actuales en el manejo de bases de datos
1.5 Sistemas manejadores de bases de datos
Why do you need a database?
Memory vs Storage
How to think about databases. (2016, February 09). Retrieved from https://nolanlawson.com/2016/02/08/how-to-think-about-databases
Information, data, and database
Concepts
Name | Terms | Examples |
---|---|---|
Storage | ||
Noise | ||
Data | Assumed factsAtomic datumCharactersQuantitiesSymbols | |
Database | coherent dataorganized datarecorded datastorage medium | |
Dataset | unstructured information | |
Information | Aggregated datasetsHuman-readableReportsSynthesized databasecontextual database | |
Knowledge | Predict future from information |
Database
Archaic. Data bank is an obsolete term for databases.
Types
Data model
- Structure.
- Operations.
- Constraints.
Types of databases
2.1 Modelo jerárquico
2.2 Modelo de redes
2.3 Modelo relacional
2.4 Modelo orientado a objetos
2.5 Modelos no convencionales (multidimensionales, deductivos, transaccionales
Online transation processing (OLTP) and online transaction processing (OLAP).
Story
Key questions
How to choose a database for your project?
Are spreadsheets -Excel-, Access, Notion,... a database?
Yes, they are. But they won't either scale very well or fit with your needs. Remember they are not real relational databases.
How can I visualize the data?
Why we should create an electronic database?
This is an organizational decision.
Is digital more secure than paper?
Yes, it is.
How to keep documents, images, and other blobs?
You shouldn't use your database to keep blobs; Use a file system or object storage.
The data is an organizational asset since at least Egyptian times.
Y2K Problem
Database management system
A database management system (DBMS) is a software system for defining, constructing, and sharing databases among various clients. It abstracts storage medium.
- Defining involves specifying data types and structures, and constraints in the database by a data definition language. Descriptive information in the form of a database catalog or dictionary, which is called meta-data.
- Construing is the process the storing the data by DML – Data Manipulation Language.
Good database model to deliver good information
Quickly, reliably, and consistently. It also won't be useful if it is stolen, lost, or corrupted when the system crashes.
- CRUD feature.
- Retrieval
- Controlling redundancy
- Persistent.
- Persistence and Backups.
- Atomicity
- Consistency
- Durability
- Isolation
- Low Cost and Extensibility
- Sharing
User Needs and Requirements
This is no book about software engineering, so we’re not able to talk about how to extract business rules -requirements-. Here we’re assuming them, that is, you’re going to learn about database paradigms, their drawbacks, their advantages, and building a Data Model such that it fits with requirements.
Dilbert y las Hojas de Cálculo - Una divertida visión del "Excel Hell". (2023, March 20). Retrieved from https://outsourceando.blogspot.com/2012/08/dilbert-y-las-hojas-de-calculo-una.html?mc_cid=385764652e&mc_eid=541175b307
Relational databases and SQL
If you don't have a special constraint or requirement, you should use a relational database.
Data-interchange format
XML
JSON
YAML
Binary special formats
Spreadsheets
Storage
Flat Files
JSON
INI Files
XML (XML)
YAML
Facts with prolog (facts = data)
https://stackoverflow.com/questions/2117651/comparing-sql-and-prolog
Worked examples
package.json
Of Prolog, T. P. (2022, April 03). A Tour of Prolog. Youtube. Retrieved from https://www.youtube.com/watch?v=8XUutFBbUrg&ab_channel=ThePowerofProlog
Spreadsheets
https://datascience.stackexchange.com/questions/5443/do-data-scientists-use-excel
Hierarchical Databases
Geographical Information System Database
Non-relational databases
1.1 Evolució n de las bases de datos
1.1.1 Surgimiento de las computadoras
1.1.2 Bases de datos relacionales
1.1.3 Bases de datos no-relacionales
1.2 Almacenamiento de datos no-textuales
1.3 Nuevos requerimientos de accesibilidad
1.3.1 Acceso global
1.3.2 Disponibilidad continua
1.3.3 Datos protegidos
2.1 Explorando Medios sociales
2.1.1 Tó pico tendencia (Twitter)
2.1.2 Conexiones sociales (Facebook)
2.1.3 Perfiles laborales (LinkedIn)
2.2 Documentos de texto
2.3 Datos Web
2.3.1 Correos electró nicos
2.3.2 Redes de intereses (GitHub)
2.3.3 Páginas Web
2.3.4 Microformatos
2.4 Otros tipos de datos no-estructurados
3.1 Evolució n de las bases de datos
3.1.1 Surgimiento de las computadoras
3.2 Conceptos NoSQL
3.2.1 Teorema CAP
3.2.2 Consistencia
3.2.3 Concurrencia
3.2.4 MapReduce
3.3 Tecnología NoSQL
3.3.1 Manejadores clave-valor
3.3.1.1 Redis, Memcached, Oracle Coherence.
3.3.2 Wide-column
3.3.2.1 Apache Hadoop, HBase, Apache Kudu, Apache Druid.
3.3.3 Manejadores de documentos
3.3.3.1 MongoDB, CouchDB.
3.3.4 Manejadores de grafos
3.3.4.1 InterSystems Caché, db4o.
3.3.5 Manejadores de XML
3.3.5.1 BaseX, eXist.
3.3.6 Manejadores de multimedia
3.3.6.1 Apache Cassandra
4.1 Introducció n a lagos de datos (LD)
4.1.1 Entendiendo que son los LD
4.1.2 Antes y después de los LD
4.1.3 Retos para implementar LD
4.1.4 Cuando implementar LD
4.2 Arquitectura de LD
4.2.1 Componentes de un arquitectura de LD
4.2.2 Capas de la arquitectura (gobernanza y seguridad, informació n y metadatos)
4.2.3 Niveles de la arquitectura (entrada, administració n y consumo de datos)
4.3 Gobernanza de datos
4.3.1 Entendiendo el concepto de gobernanza de datos (GD)
4.3.2 Macro y micro GD
4.3.3 Clasificació n y organizació n de datos
4.3.4 Catálogo de datos y metadatos
4.3.5 Administració n de calidad de los datos
4.3.6 Administració n de acceso a los datos
4.3.7 Auditoría de datos
4.3.8 Protecció n de datos
Carpenter, J. & Hewitt, E. (2020). Cassandra The Definitive Guide: Distributed Data at Web Scale. O’Reilly.
Eryurek, E., Gilad, U., Lakshmanan, V., Kibunguchy, A. & Ashdown, J. (2021). Data Governance: The Definitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. O'Reilly Media
Hills, T. (2016). NoSQL and SQL Data Modeling: Bringing Together Data, Semantics, and Software. Technics Publications. [Clasica].
Pasupuleti, P. & Purra, B. S. (2015) Data Lake Development with Big Data. Packt Publishing [Clásica].
Harrison, G. (2015). Next Generation Databases: NoSQLand Big Data. Apress.
Russell, M. (2019). Mining the Social Web (3rd ed.). O’Reilly Media, Inc. Singh, A. & Ahmad, S. (2021). Data Modeling with NoSQL Database. Independently Published.
Key-value store, Redis case
JSON-like document store, MongoDB case
Wide-column store, Cassandra case
Graph database, Neo4j case
SPARQL
Object database
Tabular
Tuple store
Triple/quad store (RDF) database
Multivalue database
Multimodel database
Vector embeddings database
Time series
A time serie is the data generated by the observations indexed by time. A more formal defintion can be found on
As you might imagine, many processes are indexed by time. For instance, examples include stock prices, CPU usage, sensor data, light curves, and ECG readings.
In this book, our main focus is on how to store time series data reliably and quickly, while minimizing the caveats for real-world applications.
We have various alternatives for managing this kind of data, including Prometheus, Graphite, InfluxDB, TimeScaleDB, and OpenTSDB. While it's possible to use different database systems, they may not be optimized for real-time processing or analytics. Although everyone has, at some point, added a date column to indicate when an operation occurs in the database, with typical examples being 'at_update' and 'at_create', this basic approach doesn't scale well and isn't suitable for time series processing applications.
As demonstrated by Timescale and others, a reliable method for persisting time series data is using PostgreSQL and its extension, TimescaleDB. Therefore, we’re going to use this as our example to illustrate a concrete case.
https://www.youtube.com/channel/UCPmHSkid9IOYbdN1Psh24lg
Big data
Hadoop
Spark
Mainframe
Supercomputers
Data Warehouse
Big Query
ETL
OLAP
Schemas
Data mart
dataWarehouse.com | Article. (2000, July 18). Retrieved from https://web.archive.org/web/20110420134556/http://csis.bits-pilani.ac.in/faculty/goel/Data Warehousing/Articles/Data Marts/dataWarehouse_com Article_DM VS DW.htm
Snowflake schema
Star schema
Activity schema
References
Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)
1.1. La economía de los datos
1.2. Industria 4.0 y las tecnologías digitales
1.3. Sectores pioneros en datos masivos
1.4. Evolución de los sistemas de gestión de datos
2.1. Fundamentos a datos masivos
2.1.1. Verdades y mitos
2.1.2. Dimensionalidad y características:
2.1.2.1 Sistemas distribuidos
2.2. Tipos de datos
2.2.1 Datos estructurados
2.2.2 Bases de datos NoSQL
2.3. Casos de estudio (industria, empresa, educación, etc.)
2.4. Estado del arte en datos masivos
3.1. Arquitectura de datos masivos
3.1.1. Infraestructura para almacenamiento de datos
3.1.2. Infraestructura para gestión
3.1.3. Aplicaciones para modelado y análisis de datos
3.2. Componentes tecnológicos
3.2.1. Acopio de datos estructurados y no estructurados (Hive, Drill, Impala, HBase, entre otros).
3.2.2. Sistemas de archivo y bases de datos distribuidas
3.2.2.1 Gestión de datos distribuidos (HDFS)
3.2.3. Plataformas de cómputo en la nube
3.2.4. Sistemas de almacenamiento escalable
3.2.5. Procesamiento masivo paralelo de datos
3.2.5.1 MapReduce
3.3. Gobernabilidad de los datos masivos
3.4 Roles: negocios, científico de datos, IT, analista, etc.
4.1. Modelo de datos
4.1.1 Key-value, documentos, wide-column, gráficos
4.2. Herramientas de extracción, transformación y carga (ETL)
4.2.1. Elementos del proceso de ETL
4.2.2. Marco de trabajo para la implementación del proceso ETL
4.2.2.1 Arquitectura
4.2.2.2 Implementación
4.2.2.3 Integración
4.2.2.4 API’s para la manipulación de datos integrados
4.2.2.4.1 Pig, Spark, Tableau
5.1. Tipos de análisis: descriptivo, diagnóstico, predictivo y prescriptivo.
5.2. Proceso de descubrimiento de conocimiento en base de datos (KDD)
5.3. Tecnologías emergentes de analítica
5.3.1 Amazon AWS, Microsoft Azure, Google BigQuery, entre otros.
5.4. Principales métodos de minado de datos
5.5. Bibliotecas de consulta para el manejo de datos masivos
5.6. Bibliotecas de minería de datos y aprendizaje automático
5.7. Visualización
5.8. Interpretación de resultados
References
- Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. John Wiley & Sons. [Clásica]. Available at: Wiley Online Library.
- Hurwitz, J., Nugent, A., Halper, F., & Kaufman, M. (2013). Big Data for Dummies. John Wiley & Sons Inc. [Clásica].
- Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. Available at: O'Reilly Media.
- Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. [Clásica].
- Marz, N., & Warren, J. (2015). Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publications Co. [Clásica].
- O'Reilly Radar Team. (2014). Big Data Now: Current Perspectives from O'Reilly Radar. O'Reilly Media, Incorporated. [Clásica].
- Wessler, M. (2016). Big Data Analytics for Dummies. John Wiley & Sons. [Clásica].
- White, T. (2012). Hadoop: The Definitive Guide. O'Reilly Media, Incorporated. [Clásica].
- Atzori, L., Iera, A., & Morabito, G. (2010). The Internet of Things: A Survey. Computer Networks, 54(15), 2787-2805.
- Li, S., Da Xu, L., & Zhao, S. (2015). The Internet of Things: A Survey. Information Systems Frontiers, 17(2), 243-259.
- Marwedel, P. (2021). Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, and the Internet of Things. Springer Nature. Available at: UABC Library.
Visualization and user-friendly reports
PowerBI
Grafana
Information Retrieval
Natural language search
Elastic search
Database as product
- Backup
- Visualization
- Authorization and authentication
- API REST
- User Interface
Supabase
Next steps
Data engineering
https://www.cockroachlabs.com/
https://www.youtube.com/watch?v=wUaGyUpyE2E&t=187s&ab_channel=PeladoNerd
References
Database Internals: A Deep Dive Into How Distributed Data Systems Work https://www.databass.dev/
Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. by Eric C Redmond, Jim L. Wilson, and Luc Perkins
Ullman, J. and Widom, J. 2014. A first Course in Database Systems: Pearson New International Edition. Pearson Education Limited.
Sumathi, S. and Esakkirajan, S. 2010. Fundamentals of relational database management systems. Springer.
Stephens, R. 2009. Beginning database design solutions. Wiley Pub.
Garcia-Molina, H., Ullman, J. and Widom, J. 2009. Database systems. Prentice Hall.
Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. 3rd ed. McGraw-Hill, 2002. ISBN: 9780072465631.
Database System Concepts by Henry Korth
Hellerstein, Joseph, and Michael Stonebraker. Readings in Database Systems (The Red Book). 4th ed. MIT Press, 2005. ISBN: 9780262693141. https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-830-database-systems-fall-2010/readings/ http://www.redbook.io/index.html
2021. Database Systems: The Complete Book -complement-.
Stanford.edu. Retrieved June 29, 2021 from http://infolab.stanford.edu/~ullman/dscb.html#slides
2021. CS 186 - Introduction to Database Systems.
Berkeley.edu. Retrieved June 29, 2021 from https://inst.eecs.berkeley.edu/~cs186/sp08/index.html