📚

Data stores, Storage and Repositories

AI summaryThis document covers various data storage, databases, and data management topics. It discusses database management systems, good database models, data models, relational databases, non-relational databases, big data, data warehousing, visualization and user-friendly reports, storage options, information filtering and retrieval, and more. The document also includes references to additional resources for further exploration.

Introduction

1.1 Conceptos básicos de bases de datos
1.2 Elementos y roles que intervienen en los sistemas de bases de datos
1.3 Tipos y formatos de datos
1.4 Retos actuales en el manejo de bases de datos
1.5 Sistemas manejadores de bases de datos

Memory hierarchy

Single source of truth, blackboard architecture, and repositories

Why do you need a database?

Memory vs Storage

How to think about databases. (2016, February 09). Retrieved from https://nolanlawson.com/2016/02/08/how-to-think-about-databases

Information, data, and database

Concepts

NameTermsExamples
Storage
Noise
DataAssumed factsAtomic datumCharactersQuantitiesSymbols
Databasecoherent dataorganized datarecorded datastorage medium
Datasetunstructured information
InformationAggregated datasetsHuman-readableReportsSynthesized databasecontextual database
KnowledgeSkillscompetenceinferencelearningpredictions
Wisdom/IntelligenceComplex systemsManage uncertainty
Blob
Memory


Database

💡
Database means organized and atomic collections of the data from the types, structures, and constraints of his storage medium (which usually is electronic), his domain, and his management system.

Archaic. Data bank is an obsolete term for databases.

Types

Data model

In the space of data modeling, we have different data models to represent data consistently. The entity-relationship model, object orientation, and graph models.

There are two primary paradigms in the graph model space: one based on the semantic web, where the graph consists of the subject-predicate-objects in RFD language, often referred to as “triple-store”, and the other based on labeled property graphs, which extend the directed graph paradigm with the concept of properties located on nodes and potentially on relationships.

https://www.youtube.com/watch?v=PEyW-MfxaEA

https://www.youtube.com/watch?v=OVweE--RJqM

https://www.youtube.com/watch?v=-jrOWXkgfcU

https://www.dataversity.net/the-atoms-and-molecules-of-data-models/

https://terminusdb.com/blog/graph-database-fundamentals/

GraphQL

Gremlin

Cypher

SPARQL

AQL

https://www.bbntimes.com/technology/why-graphql-will-rewrite-the-semantic-web

The resource description framework (in short RFD) is a W3C standard for exchanging proposed in 1999 for exchanging data on the web.

Document data model embed data hierarchy into a single object.

  1. Structure.
  1. Operations.
  1. Constraints.

Types of databases

2.1 Modelo jerárquico
2.2 Modelo de redes
2.3 Modelo relacional
2.4 Modelo orientado a objetos
2.5 Modelos no convencionales (multidimensionales, deductivos, transaccionales

https://twitter.com/bytebytego/status/1676828359475744769/photo/1
https://twitter.com/AmigosCode/status/1682758269004050432/photo/1

Online transaction processing (OLTP) and online transaction processing (OLAP).

Story

Key questions

How to choose a database for your project?

Are spreadsheets -Excel-, Access, Notion,... a database?

Yes, they are. But they will not be able to scale very well or fit with your needs. Remember they are not real relational databases.

How can I visualize the data?

Why should we create an electronic database?

This is an organizational decision.

Is digital more secure than paper?

Yes, it is.

How do you save and retrieve documents, images, and other blobs?

Different approaches have been addressed to save and retrieve data in databases, filesystems, object systems, and others.

https://en.wikipedia.org/wiki/Object_storage

https://wiki.postgresql.org/wiki/BinaryFilesInDB

https://www.microsoft.com/en-us/research/wp-content/uploads/2006/04/tr-2006-45.pdf

Y2K Problem

Database management system

A database management system (DBMS) is a software system for defining, constructing, and sharing databases among various clients. It abstracts storage medium.

Good database model to deliver good information

Quickly, reliably, and consistently. It also won't be useful if it is stolen, lost, or corrupted when the system crashes. ACID and constraints

https://brandur.org/acid

Transaction

Distributed Transaction

Atomicity

Isolation

Consistency

Durability

Eventual Consistency

User Needs and Requirements

This is no book about software engineering, so we’re not able to talk about how to extract business rules -requirements-. Here we’re assuming them, that is, you’re going to learn about database paradigms, their drawbacks, their advantages, and building a Data Model such that it fits with requirements.

Dilbert y las Hojas de Cálculo - Una divertida visión del "Excel Hell". (2023, March 20). Retrieved from https://outsourceando.blogspot.com/2012/08/dilbert-y-las-hojas-de-calculo-una.html?mc_cid=385764652e&mc_eid=541175b307

Data models

Normal forms

Money and other high school numeric types

Decimal data type

The standard IEEE 754

Cache

i18n

Time and timezones

11:59 UTC−12:00 Anywhere on Earth

Geometry

Generated columns

Indexing

By default, relational and documented-based databases scan the entire table sequentially to find matches. This is known as a full table scan or sequential scan. As the number of rows increases, the query becomes slower. The solution to this is to employ indexing which performs well even as the table size grows. Indexing consists of collecting, parsing, and storing data to facilitate fast and accurate exact information retrieval. So, the indexing process relies on a data structure that improves the speed of retrieval operations in a database table, at the cost of additional writes and storage space to maintain the index data structures.

💡
Hash-based databases such as Redix may execute the queries in O(1)O(1) depending on how model our problem.
SELECT first_name FROM people WHERE last_name = 'Smith';

Primary keys

Partitions

Horizontal scaling

Hash Tables and Consistent Transactions

Can NULLs improve your database query performance?

Relational databases and SQL

Performance checklist

Prepared statements

Indexing

Data-interchange format

CSV

XML

JSON

http://www.json.org/xml.html

YAML

Arrow

Parquet

Spreadsheets

Storage

Flat Files

JSON

INI Files

XML (XML)

YAML

Facts with prolog (facts = data)

https://stackoverflow.com/questions/2117651/comparing-sql-and-prolog

Worked examples

package.json

timeline

Of Prolog, T. P. (2022, April 03). A Tour of Prolog. Youtube. Retrieved from https://www.youtube.com/watch?v=8XUutFBbUrg&ab_channel=ThePowerofProlog

Spreadsheets

https://datascience.stackexchange.com/questions/5443/do-data-scientists-use-excel

Hierarchical Databases

Geographical Information System Database

Embedded databases

An embedded database is an in-process database management system that’s tightly integrated with the application layer.

Duck DB

KùzuDB

LanceDB

The current main drawback for embedded databases is they don’t allow collaboration by default.

You can explore OLTP

Non-relational databases

1.1 Evolució n de las bases de datos
1.1.1 Surgimiento de las computadoras
1.1.2 Bases de datos relacionales
1.1.3 Bases de datos no-relacionales
1.2 Almacenamiento de datos no-textuales
1.3 Nuevos requerimientos de accesibilidad
1.3.1 Acceso global
1.3.2 Disponibilidad continua
1.3.3 Datos protegidos

2.1 Explorando Medios sociales
2.1.1 Tó pico tendencia (Twitter)
2.1.2 Conexiones sociales (Facebook)
2.1.3 Perfiles laborales (LinkedIn)
2.2 Documentos de texto
2.3 Datos Web
2.3.1 Correos electró nicos
2.3.2 Redes de intereses (GitHub)
2.3.3 Páginas Web
2.3.4 Microformatos
2.4 Otros tipos de datos no-estructurados

3.1 Evolució n de las bases de datos
3.1.1 Surgimiento de las computadoras
3.2 Conceptos NoSQL
3.2.1 Teorema CAP
3.2.2 Consistencia
3.2.3 Concurrencia
3.2.4 MapReduce
3.3 Tecnología NoSQL
3.3.1 Manejadores clave-valor
3.3.1.1 Redis, Memcached, Oracle Coherence.
3.3.2 Wide-column
3.3.2.1 Apache Hadoop, HBase, Apache Kudu, Apache Druid.
3.3.3 Manejadores de documentos
3.3.3.1 MongoDB, CouchDB.
3.3.4 Manejadores de grafos
3.3.4.1 InterSystems Caché, db4o.
3.3.5 Manejadores de XML
3.3.5.1 BaseX, eXist.
3.3.6 Manejadores de multimedia
3.3.6.1 Apache Cassandra

4.1 Introducció n a lagos de datos (LD)
4.1.1 Entendiendo que son los LD
4.1.2 Antes y después de los LD
4.1.3 Retos para implementar LD
4.1.4 Cuando implementar LD
4.2 Arquitectura de LD
4.2.1 Componentes de un arquitectura de LD
4.2.2 Capas de la arquitectura (gobernanza y seguridad, informació n y metadatos)
4.2.3 Niveles de la arquitectura (entrada, administració n y consumo de datos)
4.3 Gobernanza de datos
4.3.1 Entendiendo el concepto de gobernanza de datos (GD)
4.3.2 Macro y micro GD
4.3.3 Clasificació n y organizació n de datos
4.3.4 Catálogo de datos y metadatos
4.3.5 Administració n de calidad de los datos
4.3.6 Administració n de acceso a los datos
4.3.7 Auditoría de datos
4.3.8 Protecció n de datos

Carpenter, J. & Hewitt, E. (2020). Cassandra The Definitive Guide: Distributed Data at Web Scale. O’Reilly.

Eryurek, E., Gilad, U., Lakshmanan, V., Kibunguchy, A. & Ashdown, J. (2021). Data Governance: The Definitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. O'Reilly Media

Hills, T. (2016). NoSQL and SQL Data Modeling: Bringing Together Data, Semantics, and Software. Technics Publications. [Clasica].

Pasupuleti, P. & Purra, B. S. (2015) Data Lake Development with Big Data. Packt Publishing [Clásica].

Harrison, G. (2015). Next Generation Databases: NoSQLand Big Data. Apress.

Russell, M. (2019). Mining the Social Web (3rd ed.). O’Reilly Media, Inc. Singh, A. & Ahmad, S. (2021). Data Modeling with NoSQL Database. Independently Published.

Key-value store, Redis case

JSON-like document store, MongoDB case

MongoDB is a document database which become popular with its schamaless way of saving data.

Wide-column store, Cassandra case

Graph database, Neo4j case

https://neo4j.com/

SPARQL

Object database

Tabular

Tuple store

Triple/quad store (RDF) database

Multivalue database

Multimodel database

Vector embeddings database

Time series

A time series is the data generated by the observations indexed by time. A more formal definition can be found on

As you might imagine, many processes are indexed by time. For instance, examples include stock prices, CPU usage, sensor data, light curves, and ECG readings.

In this book, our main focus is on how to store time series data reliably and quickly, while minimizing the caveats for real-world applications.

We have various alternatives for managing this kind of data, including Prometheus, Graphite, InfluxDB, TimeScaleDB, and OpenTSDB. While it's possible to use different database systems, they may not be optimized for real-time processing or analytics. Although everyone has, at some point, added a date column to indicate when an operation occurs in the database, with typical examples being 'at_update' and 'at_create', this basic approach doesn't scale well and isn't suitable for time series processing applications.

As demonstrated by Timescale and others, a reliable method for persisting time series data is using PostgreSQL and its extension, TimescaleDB. Therefore, we’re going to use this as our example to illustrate a concrete case.

https://www.timescale.com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/

https://www.timescale.com/blog/how-to-store-time-series-data-mongodb-vs-timescaledb-postgresql-a73939734016/

https://www.youtube.com/channel/UCPmHSkid9IOYbdN1Psh24lg

Big data

Hadoop

Spark

Mainframe

Supercomputers

Data Warehouse

Big Query

ETL

OLAP

Schemas

Data mart

dataWarehouse.com | Article. (2000, July 18). Retrieved from https://web.archive.org/web/20110420134556/http://csis.bits-pilani.ac.in/faculty/goel/Data Warehousing/Articles/Data Marts/dataWarehouse_com Article_DM VS DW.htm

Snowflake schema

Star schema

Activity schema

References

Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)

1.1. La economía de los datos
1.2. Industria 4.0 y las tecnologías digitales
1.3. Sectores pioneros en datos masivos
1.4. Evolución de los sistemas de gestión de datos

2.1. Fundamentos a datos masivos
2.1.1. Verdades y mitos
2.1.2. Dimensionalidad y características:
2.1.2.1 Sistemas distribuidos
2.2. Tipos de datos
2.2.1 Datos estructurados
2.2.2 Bases de datos NoSQL
2.3. Casos de estudio (industria, empresa, educación, etc.)
2.4. Estado del arte en datos masivos

3.1. Arquitectura de datos masivos
3.1.1. Infraestructura para almacenamiento de datos
3.1.2. Infraestructura para gestión
3.1.3. Aplicaciones para modelado y análisis de datos
3.2. Componentes tecnológicos
3.2.1. Acopio de datos estructurados y no estructurados (Hive, Drill, Impala, HBase, entre otros).
3.2.2. Sistemas de archivo y bases de datos distribuidas
3.2.2.1 Gestión de datos distribuidos (HDFS)
3.2.3. Plataformas de cómputo en la nube
3.2.4. Sistemas de almacenamiento escalable
3.2.5. Procesamiento masivo paralelo de datos
3.2.5.1 MapReduce
3.3. Gobernabilidad de los datos masivos
3.4 Roles: negocios, científico de datos, IT, analista, etc.

4.1. Modelo de datos
4.1.1 Key-value, documentos, wide-column, gráficos
4.2. Herramientas de extracción, transformación y carga (ETL)
4.2.1. Elementos del proceso de ETL
4.2.2. Marco de trabajo para la implementación del proceso ETL
4.2.2.1 Arquitectura
4.2.2.2 Implementación
4.2.2.3 Integración
4.2.2.4 API’s para la manipulación de datos integrados
4.2.2.4.1 Pig, Spark, Tableau

5.1. Tipos de análisis: descriptivo, diagnóstico, predictivo y prescriptivo.
5.2. Proceso de descubrimiento de conocimiento en base de datos (KDD)
5.3. Tecnologías emergentes de analítica
5.3.1 Amazon AWS, Microsoft Azure, Google BigQuery, entre otros.
5.4. Principales métodos de minado de datos
5.5. Bibliotecas de consulta para el manejo de datos masivos
5.6. Bibliotecas de minería de datos y aprendizaje automático
5.7. Visualización
5.8. Interpretación de resultados

References

  1. Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. John Wiley & Sons. [Clásica]. Available at: Wiley Online Library.
  1. Hurwitz, J., Nugent, A., Halper, F., & Kaufman, M. (2013). Big Data for Dummies. John Wiley & Sons Inc. [Clásica].
  1. Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. Available at: O'Reilly Media.
  1. Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. [Clásica].
  1. Marz, N., & Warren, J. (2015). Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publications Co. [Clásica].
  1. O'Reilly Radar Team. (2014). Big Data Now: Current Perspectives from O'Reilly Radar. O'Reilly Media, Incorporated. [Clásica].
  1. Wessler, M. (2016). Big Data Analytics for Dummies. John Wiley & Sons. [Clásica].
  1. White, T. (2012). Hadoop: The Definitive Guide. O'Reilly Media, Incorporated. [Clásica].
  1. Atzori, L., Iera, A., & Morabito, G. (2010). The Internet of Things: A Survey. Computer Networks, 54(15), 2787-2805.
  1. Li, S., Da Xu, L., & Zhao, S. (2015). The Internet of Things: A Survey. Information Systems Frontiers, 17(2), 243-259.
  1. Marwedel, P. (2021). Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, and the Internet of Things. Springer Nature. Available at: UABC Library.

Visualization and user-friendly reports

PowerBI

Grafana

Storage

File storage

https://www.cs.cmu.edu/~fp/courses/15213-s07/lectures/15-filesys/index.html

https://wiki.archlinux.org/title/FUSE

FUSE-based file system

NFS

S3FS

Block storage

Object storage

NAS

HDFS

IPFS

DVC

Vault

How can you get and retrieve the secrets?

Information Filtering and Information Retrieval

Database as product

Supabase

Discussions

Filesystem as a source of truth

Is QUIC a Good Protocol for Databases?

How Spotify switched from UUID as the Primary key type

Why Uber Engineering switched from Postgres to MySQL

Optimistic vs Pessimistic currency control

Data with API REST support

Write amplification explained in backend apps, database systems, and SSDs

Optimistic vs Pessimistic Concurrency control

Locks

Database comparison

Indexing in PostgreSQL vs MySQL

Next steps

Data Engineering

https://supabase.com/

https://www.cockroachlabs.com/

https://www.youtube.com/watch?v=wUaGyUpyE2E&t=187s&ab_channel=PeladoNerd

References

Database Internals: A Deep Dive Into How Distributed Data Systems Work https://www.databass.dev/

Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. by Eric C Redmond, Jim L. Wilson, and Luc Perkins

Ullman, J. and Widom, J. 2014. A first Course in Database Systems: Pearson New International Edition. Pearson Education Limited.


Sumathi, S. and Esakkirajan, S. 2010. Fundamentals of relational database management systems. Springer.


Stephens, R. 2009. Beginning database design solutions. Wiley Pub.


Garcia-Molina, H., Ullman, J. and Widom, J. 2009. Database systems. Prentice Hall.


Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. 3rd ed. McGraw-Hill, 2002. ISBN: 9780072465631.


Database System Concepts by Henry Korth

Hellerstein, Joseph, and Michael Stonebraker. Readings in Database Systems (The Red Book). 4th ed. MIT Press, 2005. ISBN: 9780262693141. https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-830-database-systems-fall-2010/readings/ http://www.redbook.io/index.html


2021. Database Systems: The Complete Book -complement-.
Stanford.edu. Retrieved June 29, 2021 from http://infolab.stanford.edu/~ullman/dscb.html#slides

2021. CS 186 - Introduction to Database Systems.
Berkeley.edu. Retrieved June 29, 2021 from https://inst.eecs.berkeley.edu/~cs186/sp08/index.html



https://www.udemy.com/course/database-engines-crash-course

TODO

GenBank and sequence-similarity searches