Data stores, Storage and Repositories
AI summary | This document covers various data storage, databases, and data management topics. It discusses database management systems, good database models, data models, relational databases, non-relational databases, big data, data warehousing, visualization and user-friendly reports, storage options, information filtering and retrieval, and more. The document also includes references to additional resources for further exploration. |
---|
Introduction
1.1 Conceptos básicos de bases de datos
1.2 Elementos y roles que intervienen en los sistemas de bases de datos
1.3 Tipos y formatos de datos
1.4 Retos actuales en el manejo de bases de datos
1.5 Sistemas manejadores de bases de datos
Memory hierarchy
Single source of truth, blackboard architecture, and repositories
Why do you need a database?
Memory vs Storage
How to think about databases. (2016, February 09). Retrieved from https://nolanlawson.com/2016/02/08/how-to-think-about-databases
Information, data, and database
Concepts
Name | Terms | Examples |
---|---|---|
Storage | ||
Noise | ||
Data | Assumed factsAtomic datumCharactersQuantitiesSymbols | |
Database | coherent dataorganized datarecorded datastorage medium | |
Dataset | unstructured information | |
Information | Aggregated datasetsHuman-readableReportsSynthesized databasecontextual database | |
Knowledge | Skillscompetenceinferencelearningpredictions | |
Wisdom/Intelligence | Complex systemsManage uncertainty | |
Blob | ||
Memory |
Database
Archaic. Data bank is an obsolete term for databases.
Types
Data model
In the space of data modeling, we have different data models to represent data consistently. The entity-relationship model, object orientation, and graph models.
There are two primary paradigms in the graph model space: one based on the semantic web, where the graph consists of the subject-predicate-objects in RFD language, often referred to as “triple-store”, and the other based on labeled property graphs, which extend the directed graph paradigm with the concept of properties located on nodes and potentially on relationships.
https://www.youtube.com/watch?v=PEyW-MfxaEA
https://www.youtube.com/watch?v=OVweE--RJqM
https://www.youtube.com/watch?v=-jrOWXkgfcU
https://www.dataversity.net/the-atoms-and-molecules-of-data-models/
https://terminusdb.com/blog/graph-database-fundamentals/
GraphQL
Gremlin
Cypher
SPARQL
AQL
https://www.bbntimes.com/technology/why-graphql-will-rewrite-the-semantic-web
The resource description framework (in short RFD) is a W3C standard for exchanging proposed in 1999 for exchanging data on the web.
Document data model embed data hierarchy into a single object.
- Structure.
- Operations.
- Constraints.
Types of databases
2.1 Modelo jerárquico
2.2 Modelo de redes
2.3 Modelo relacional
2.4 Modelo orientado a objetos
2.5 Modelos no convencionales (multidimensionales, deductivos, transaccionales
Online transaction processing (OLTP) and online transaction processing (OLAP).
Story
Key questions
How to choose a database for your project?
Are spreadsheets -Excel-, Access, Notion,... a database?
Yes, they are. But they will not be able to scale very well or fit with your needs. Remember they are not real relational databases.
How can I visualize the data?
Why should we create an electronic database?
This is an organizational decision.
Is digital more secure than paper?
Yes, it is.
How do you save and retrieve documents, images, and other blobs?
Different approaches have been addressed to save and retrieve data in databases, filesystems, object systems, and others.
https://en.wikipedia.org/wiki/Object_storage
https://wiki.postgresql.org/wiki/BinaryFilesInDB
https://www.microsoft.com/en-us/research/wp-content/uploads/2006/04/tr-2006-45.pdf
Y2K Problem
Database management system
A database management system (DBMS) is a software system for defining, constructing, and sharing databases among various clients. It abstracts storage medium.
- Defining involves specifying data types structures, and constraints in the database by a data definition language. Descriptive information is in the form of a database catalog or dictionary, which is called meta-data.
- Construing is the process the storing the data by DML – Data Manipulation Language.
Good database model to deliver good information
Quickly, reliably, and consistently. It also won't be useful if it is stolen, lost, or corrupted when the system crashes. ACID and constraints
Transaction
Distributed Transaction
Atomicity
Isolation
Consistency
Durability
Eventual Consistency
- CRUD feature.
- Retrieval
- Controlling redundancy
- Persistent.
- Persistence and Backups.
- Atomicity
- Consistency
- Durability
- Isolation
- Low Cost and Extensibility
- Sharing
User Needs and Requirements
This is no book about software engineering, so we’re not able to talk about how to extract business rules -requirements-. Here we’re assuming them, that is, you’re going to learn about database paradigms, their drawbacks, their advantages, and building a Data Model such that it fits with requirements.
Dilbert y las Hojas de Cálculo - Una divertida visión del "Excel Hell". (2023, March 20). Retrieved from https://outsourceando.blogspot.com/2012/08/dilbert-y-las-hojas-de-calculo-una.html?mc_cid=385764652e&mc_eid=541175b307
Data models
Normal forms
Money and other high school numeric types
Decimal data type
The standard IEEE 754
Cache
i18n
Time and timezones
11:59 UTC−12:00 Anywhere on Earth
Geometry
Generated columns
Indexing
By default, relational and documented-based databases scan the entire table sequentially to find matches. This is known as a full table scan or sequential scan. As the number of rows increases, the query becomes slower. The solution to this is to employ indexing which performs well even as the table size grows. Indexing consists of collecting, parsing, and storing data to facilitate fast and accurate exact information retrieval. So, the indexing process relies on a data structure that improves the speed of retrieval operations in a database table, at the cost of additional writes and storage space to maintain the index data structures.
SELECT first_name FROM people WHERE last_name = 'Smith';
Primary keys
Partitions
Horizontal scaling
Hash Tables and Consistent Transactions
Can NULLs improve your database query performance?
Relational databases and SQL
Performance checklist
Prepared statements
Indexing
Data-interchange format
CSV
XML
JSON
YAML
Arrow
Parquet
Spreadsheets
Storage
Flat Files
JSON
INI Files
XML (XML)
YAML
Facts with prolog (facts = data)
https://stackoverflow.com/questions/2117651/comparing-sql-and-prolog
Worked examples
package.json
Of Prolog, T. P. (2022, April 03). A Tour of Prolog. Youtube. Retrieved from https://www.youtube.com/watch?v=8XUutFBbUrg&ab_channel=ThePowerofProlog
Spreadsheets
https://datascience.stackexchange.com/questions/5443/do-data-scientists-use-excel
Hierarchical Databases
Geographical Information System Database
Embedded databases
An embedded database is an in-process database management system that’s tightly integrated with the application layer.
Duck DB
LanceDB
The current main drawback for embedded databases is they don’t allow collaboration by default.
You can explore OLTP
Non-relational databases
1.1 Evolució n de las bases de datos
1.1.1 Surgimiento de las computadoras
1.1.2 Bases de datos relacionales
1.1.3 Bases de datos no-relacionales
1.2 Almacenamiento de datos no-textuales
1.3 Nuevos requerimientos de accesibilidad
1.3.1 Acceso global
1.3.2 Disponibilidad continua
1.3.3 Datos protegidos
2.1 Explorando Medios sociales
2.1.1 Tó pico tendencia (Twitter)
2.1.2 Conexiones sociales (Facebook)
2.1.3 Perfiles laborales (LinkedIn)
2.2 Documentos de texto
2.3 Datos Web
2.3.1 Correos electró nicos
2.3.2 Redes de intereses (GitHub)
2.3.3 Páginas Web
2.3.4 Microformatos
2.4 Otros tipos de datos no-estructurados
3.1 Evolució n de las bases de datos
3.1.1 Surgimiento de las computadoras
3.2 Conceptos NoSQL
3.2.1 Teorema CAP
3.2.2 Consistencia
3.2.3 Concurrencia
3.2.4 MapReduce
3.3 Tecnología NoSQL
3.3.1 Manejadores clave-valor
3.3.1.1 Redis, Memcached, Oracle Coherence.
3.3.2 Wide-column
3.3.2.1 Apache Hadoop, HBase, Apache Kudu, Apache Druid.
3.3.3 Manejadores de documentos
3.3.3.1 MongoDB, CouchDB.
3.3.4 Manejadores de grafos
3.3.4.1 InterSystems Caché, db4o.
3.3.5 Manejadores de XML
3.3.5.1 BaseX, eXist.
3.3.6 Manejadores de multimedia
3.3.6.1 Apache Cassandra
4.1 Introducció n a lagos de datos (LD)
4.1.1 Entendiendo que son los LD
4.1.2 Antes y después de los LD
4.1.3 Retos para implementar LD
4.1.4 Cuando implementar LD
4.2 Arquitectura de LD
4.2.1 Componentes de un arquitectura de LD
4.2.2 Capas de la arquitectura (gobernanza y seguridad, informació n y metadatos)
4.2.3 Niveles de la arquitectura (entrada, administració n y consumo de datos)
4.3 Gobernanza de datos
4.3.1 Entendiendo el concepto de gobernanza de datos (GD)
4.3.2 Macro y micro GD
4.3.3 Clasificació n y organizació n de datos
4.3.4 Catálogo de datos y metadatos
4.3.5 Administració n de calidad de los datos
4.3.6 Administració n de acceso a los datos
4.3.7 Auditoría de datos
4.3.8 Protecció n de datos
Carpenter, J. & Hewitt, E. (2020). Cassandra The Definitive Guide: Distributed Data at Web Scale. O’Reilly.
Eryurek, E., Gilad, U., Lakshmanan, V., Kibunguchy, A. & Ashdown, J. (2021). Data Governance: The Definitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness. O'Reilly Media
Hills, T. (2016). NoSQL and SQL Data Modeling: Bringing Together Data, Semantics, and Software. Technics Publications. [Clasica].
Pasupuleti, P. & Purra, B. S. (2015) Data Lake Development with Big Data. Packt Publishing [Clásica].
Harrison, G. (2015). Next Generation Databases: NoSQLand Big Data. Apress.
Russell, M. (2019). Mining the Social Web (3rd ed.). O’Reilly Media, Inc. Singh, A. & Ahmad, S. (2021). Data Modeling with NoSQL Database. Independently Published.
Key-value store, Redis case
JSON-like document store, MongoDB case
MongoDB is a document database which become popular with its schamaless way of saving data.
Wide-column store, Cassandra case
Graph database, Neo4j case
SPARQL
Object database
Tabular
Tuple store
Triple/quad store (RDF) database
Multivalue database
Multimodel database
Vector embeddings database
Time series
A time series is the data generated by the observations indexed by time. A more formal definition can be found on
As you might imagine, many processes are indexed by time. For instance, examples include stock prices, CPU usage, sensor data, light curves, and ECG readings.
In this book, our main focus is on how to store time series data reliably and quickly, while minimizing the caveats for real-world applications.
We have various alternatives for managing this kind of data, including Prometheus, Graphite, InfluxDB, TimeScaleDB, and OpenTSDB. While it's possible to use different database systems, they may not be optimized for real-time processing or analytics. Although everyone has, at some point, added a date column to indicate when an operation occurs in the database, with typical examples being 'at_update' and 'at_create', this basic approach doesn't scale well and isn't suitable for time series processing applications.
As demonstrated by Timescale and others, a reliable method for persisting time series data is using PostgreSQL and its extension, TimescaleDB. Therefore, we’re going to use this as our example to illustrate a concrete case.
https://www.youtube.com/channel/UCPmHSkid9IOYbdN1Psh24lg
Big data
Hadoop
Spark
Mainframe
Supercomputers
Data Warehouse
Big Query
ETL
OLAP
Schemas
Data mart
dataWarehouse.com | Article. (2000, July 18). Retrieved from https://web.archive.org/web/20110420134556/http://csis.bits-pilani.ac.in/faculty/goel/Data Warehousing/Articles/Data Marts/dataWarehouse_com Article_DM VS DW.htm
Snowflake schema
Star schema
Activity schema
References
Ralph Kimball and Margy Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition)
1.1. La economía de los datos
1.2. Industria 4.0 y las tecnologías digitales
1.3. Sectores pioneros en datos masivos
1.4. Evolución de los sistemas de gestión de datos
2.1. Fundamentos a datos masivos
2.1.1. Verdades y mitos
2.1.2. Dimensionalidad y características:
2.1.2.1 Sistemas distribuidos
2.2. Tipos de datos
2.2.1 Datos estructurados
2.2.2 Bases de datos NoSQL
2.3. Casos de estudio (industria, empresa, educación, etc.)
2.4. Estado del arte en datos masivos
3.1. Arquitectura de datos masivos
3.1.1. Infraestructura para almacenamiento de datos
3.1.2. Infraestructura para gestión
3.1.3. Aplicaciones para modelado y análisis de datos
3.2. Componentes tecnológicos
3.2.1. Acopio de datos estructurados y no estructurados (Hive, Drill, Impala, HBase, entre otros).
3.2.2. Sistemas de archivo y bases de datos distribuidas
3.2.2.1 Gestión de datos distribuidos (HDFS)
3.2.3. Plataformas de cómputo en la nube
3.2.4. Sistemas de almacenamiento escalable
3.2.5. Procesamiento masivo paralelo de datos
3.2.5.1 MapReduce
3.3. Gobernabilidad de los datos masivos
3.4 Roles: negocios, científico de datos, IT, analista, etc.
4.1. Modelo de datos
4.1.1 Key-value, documentos, wide-column, gráficos
4.2. Herramientas de extracción, transformación y carga (ETL)
4.2.1. Elementos del proceso de ETL
4.2.2. Marco de trabajo para la implementación del proceso ETL
4.2.2.1 Arquitectura
4.2.2.2 Implementación
4.2.2.3 Integración
4.2.2.4 API’s para la manipulación de datos integrados
4.2.2.4.1 Pig, Spark, Tableau
5.1. Tipos de análisis: descriptivo, diagnóstico, predictivo y prescriptivo.
5.2. Proceso de descubrimiento de conocimiento en base de datos (KDD)
5.3. Tecnologías emergentes de analítica
5.3.1 Amazon AWS, Microsoft Azure, Google BigQuery, entre otros.
5.4. Principales métodos de minado de datos
5.5. Bibliotecas de consulta para el manejo de datos masivos
5.6. Bibliotecas de minería de datos y aprendizaje automático
5.7. Visualización
5.8. Interpretación de resultados
References
- Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners. John Wiley & Sons. [Clásica]. Available at: Wiley Online Library.
- Hurwitz, J., Nugent, A., Halper, F., & Kaufman, M. (2013). Big Data for Dummies. John Wiley & Sons Inc. [Clásica].
- Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. Available at: O'Reilly Media.
- Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press. [Clásica].
- Marz, N., & Warren, J. (2015). Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publications Co. [Clásica].
- O'Reilly Radar Team. (2014). Big Data Now: Current Perspectives from O'Reilly Radar. O'Reilly Media, Incorporated. [Clásica].
- Wessler, M. (2016). Big Data Analytics for Dummies. John Wiley & Sons. [Clásica].
- White, T. (2012). Hadoop: The Definitive Guide. O'Reilly Media, Incorporated. [Clásica].
- Atzori, L., Iera, A., & Morabito, G. (2010). The Internet of Things: A Survey. Computer Networks, 54(15), 2787-2805.
- Li, S., Da Xu, L., & Zhao, S. (2015). The Internet of Things: A Survey. Information Systems Frontiers, 17(2), 243-259.
- Marwedel, P. (2021). Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, and the Internet of Things. Springer Nature. Available at: UABC Library.
Visualization and user-friendly reports
PowerBI
Grafana
Storage
File storage
https://www.cs.cmu.edu/~fp/courses/15213-s07/lectures/15-filesys/index.html
https://wiki.archlinux.org/title/FUSE
FUSE-based file system
NFS
S3FS
Block storage
Object storage
NAS
HDFS
IPFS
DVC
Vault
How can you get and retrieve the secrets?
Information Filtering and Information Retrieval
Database as product
- Backup
- Visualization
- Authorization and authentication
- API REST
- User Interface
Supabase
Discussions
Filesystem as a source of truth
Is QUIC a Good Protocol for Databases?
How Spotify switched from UUID as the Primary key type
Why Uber Engineering switched from Postgres to MySQL
Optimistic vs Pessimistic currency control
Data with API REST support
Write amplification explained in backend apps, database systems, and SSDs
Optimistic vs Pessimistic Concurrency control
Locks
Database comparison
Indexing in PostgreSQL vs MySQL
Next steps
Data Engineering
https://www.cockroachlabs.com/
https://www.youtube.com/watch?v=wUaGyUpyE2E&t=187s&ab_channel=PeladoNerd
References
Database Internals: A Deep Dive Into How Distributed Data Systems Work https://www.databass.dev/
Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. by Eric C Redmond, Jim L. Wilson, and Luc Perkins
Ullman, J. and Widom, J. 2014. A first Course in Database Systems: Pearson New International Edition. Pearson Education Limited.
Sumathi, S. and Esakkirajan, S. 2010. Fundamentals of relational database management systems. Springer.
Stephens, R. 2009. Beginning database design solutions. Wiley Pub.
Garcia-Molina, H., Ullman, J. and Widom, J. 2009. Database systems. Prentice Hall.
Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. 3rd ed. McGraw-Hill, 2002. ISBN: 9780072465631.
Database System Concepts by Henry Korth
Hellerstein, Joseph, and Michael Stonebraker. Readings in Database Systems (The Red Book). 4th ed. MIT Press, 2005. ISBN: 9780262693141. https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-830-database-systems-fall-2010/readings/ http://www.redbook.io/index.html
2021. Database Systems: The Complete Book -complement-.
Stanford.edu. Retrieved June 29, 2021 from http://infolab.stanford.edu/~ullman/dscb.html#slides
2021. CS 186 - Introduction to Database Systems.
Berkeley.edu. Retrieved June 29, 2021 from https://inst.eecs.berkeley.edu/~cs186/sp08/index.html
https://www.udemy.com/course/database-engines-crash-course
TODO
GenBank and sequence-similarity searches