🌐

Distributed Systems

AI summaryThis document covers various topics related to distributed systems, including the MSIS problem in graphs, Halin graphs, self-stabilizing algorithms, process resilience, reliable client-server communication, distributed commit, capacity planning, CAP theorem, monitoring and observability, security, data centers and supercomputing, autonomic computing, tutorials and assignments, and references. It also mentions simulations, electronic voting, storage systems, network file systems, cloud services, blockchain and smart contracts, Kubernetes and Docker, queue messaging, and secure multi-party computation.

Preface

Prerequisites

Learning ethics

Introduction

What is Distributed Computing?

A distributed computer system consists of multiple software components that runs concurrently but run as a single transparent system. Real distribution is achieveed when multiple nodes uses different processing units refers to CPU, GPU, and TPU via networking. However, we can apply algorithms, architechtures in local environments. What happen with async?

"Concurrency vs Parallelism" is a useless dichotomy especially in the world of coroutines, fibers, and intelligent schedulers.

There’s a crisp mathematical distinction. Concurrency means independent interacting computations. Parallelism means independent isolated computations. This is completely orthogonal to hardware/software boundaries. You can write this down in many process calculi. For example, in the rho calculus the process P1 | … | Pn is parallel if for all 1 <= i, j <= n FN( Pi ) \cap FN( Pj ) = \empty; and concurrent otherwise. It’s sad that so little of concurrency theory has penetrated into the communities that need it most.

https://twitter.com/leithaus/status/1699568865121104098

https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

What sets distributed systems engineering apart is the probability of failure and, worse, the probability of partial failure. If a well-formed mutex unlock fails with an error, we can assume the process is unstable and crash it. But the failure of a distributed mutex’s unlock must be built into the lock protocol. https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Single clock and scheduler

pervasive computing

Multiprocessor Systems

Multi-core Processors

Graphics Processing Unit (GPU) Computing

Characteristics.

Transparency.

Global Scheduler.

Design. Interfaces. Open-close principle.

High-performance distributed computing.

Pervasive systems (Supranet)

Federated systems

Servless, Blokchain

Cloud

Descentralized

The model

A misconception is that distributed systems are microservices, indeed the latter is an implementation of the former.

Types of distribution transparency

Access

Location

Relocation

Migration

Replication

Concurrency

Failure.

Pitfalls

Types of distributed systems

Grid computing.

Cloud computing.

Cluster computing.

Distributed information systems

Enterprise Application Integration

Primitives.

ACID properties.

Pervasive systems

Mobile systems

Embedded Devices

Sensor Networks

Why does Distributed Computing matter to you?

If you have performance issues where the process is parallelizable, you have intensive IO operations, or you need a reliable system you must check out distributed computing.

Research

Ecosystem

Standards, jobs, industry, roles, …

Clouds

Hadoop

Docker

Story

Case of study

Google File System (GFS)

Summary

FAQ

Worked examples

Concurrent and Parallel Principles

Parallel and concurrent process

Primitives of concurrent programs

CPU-bound operations and IO-bound operations

Coroutines make sense for IO-bound operations because X. On the another hand, threads make sense for CPU-bound operations. Process makes sense when you would like to implement the Actor Model.

Threads

Kernel Threads and User threads are suitable for CPU-bound tasks.

Processes and Actor model

Coroutines

Futures, promises and the async/await pattern

Cooperative multitasking

Yield

Event driven and callbacks

Reactive programming

Asynchronous and synchronous Programming

Concurrency

Parallelism

Bernstein conditions for detection of parallelism

Deadlocks

Reactive and Deliberative Systems

Logging and Distributed Management

Push-Pull vs Subscribe

Dealing with Time

Reactive programming

Functional reactive programming

Throughput

Overlay Network

Latency

Shared Memory vs Message Passing

Async/await, Futures and promises, Coroutines, Channels

Idempotent

Formally, we said the operationoperation to be idempotent if operation(x,x)=xoperation(x,x)=x for all xXx \in X.

For example, suppose the operation consults the money in my bank account you could ask for it a lot of times but always you have the same amount, so it is idempotent. However, transfer $3145 from my bank account is not idempotent because you’re changing the state.

ModelPropertiesExample
Multi-threadingParallelism, blocking system callsWorker in the browsers

Single-threaded processNo parallelism, blocking system callsTypical synchronous function
Finite-state machineParallelism, nonblocking system calls, Asynchronous operationEvent loop in the browser.

Starvation

This refers to a situation in which a task is prevented from accessing a resource that it needs, due to the actions of other tasks.

Synchronization

Dining philosophers problem

Fork–join model

Producer–consumer problem

Readers–writers problem

Mutual exclusion, locks or mutex

Race conditions

Semaphore

Global interpreter lock

Green thread

Giant lock

Symmetric multiprocessing

Actor approach

Distributed Computing Paradigms

https://arxiv.org/pdf/1311.3070.pdf

Cloud

Grid

Cluster

SCOOP (Scalable COncurrent Operations in Python)

Threads, workers, processes, tasks, jobs 🟢

Concurrency patterns

https://en.wikipedia.org/wiki/Concurrency_pattern

Active Object

Balking pattern

Twitter Color Emoji

Barrier

Double-checked locking

Guarded suspension

Leaders/followers pattern

Monitor Object

Nuclear reaction

Reactor pattern

Read write lock pattern

Scheduler pattern

Pool pattern

Given a problem that needs nn  workers (or a pool) and one shared resource, we have two approaches based on either Queue or Mutex.

Thread-local storage

Case of study. Unix and Bash.

Case of study. JavaScript.

Event loop.

Stack, microtasks, macrotasks, promises, callbacks, html elements, and custom events, subscribers, streams

Case of study. Python.

Coroutines

Threads

Multiprocessing library

JobLib

Luigi

Dask

Ray

Case of study. Go.

Coroutines

Case of study. C and C++.

Message Passing Interface (MPI)

http://condor.cc.ku.edu/~grobe/docs/intro-MPI-C.shtm

Case of study. Java.

https://replit.com/@sanchezcarlosjr/distributed-systems-2023-1

Case of study. Haskell.

We classify programming languages as stateful and stateless to abstract the concurrent and parallel principles, even though people have named the in different ways over the years. Stateful languages are those that encourage us to assign variables, on another hand, stateless languages are those that don’t encourage us to assign variables. JavaScript, Go, and C is stateful languages meanwhile Haskell and Erlang are stateless. Immutability. Mutability.

Semigroups and monoids

Group theory in cryptography

Case of study. Erlang and Elixir.

https://www.youtube.com/watch?v=cLno3Wf720c&ab_channel=rvirding

Case of study. Scala.

Case of study. Blockchain smart contracts.

Coroutines

The Little Book of Semaphores

Foundations

Timed Input/Output Automaton (TIOA)

https://github.com/online-ml/river

Distributed algorithms

Models

https://en.wikipedia.org/wiki/Concurrency_(computer_science)

TLA+ Language

PlusCal

https://lamport.azurewebsites.net/tla/book.html

Process calculus

Process Algebra

Algebraic Theory of Processes

π-calculus

References

Distributed Algorithms

Use of Formal Methods at Amazon Web Services
Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, Michael Deardeuff

Amazon.com

https://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf

Pierce, Benjamin (1996-12-21). "Foundational Calculi for Programming Languages". The Computer Science and Engineering Handbook. CRC Press. pp. 2190–2207. ISBN 0-8493-2909-4.

MIT OpenCourseWare. (2023, January 27). MIT OpenCourseWare. Retrieved from https://ocw.mit.edu/courses/6-046j-design-and-analysis-of-algorithms-spring-2015/resources/lecture-20-asynchronous-distributed-algorithms-shortest-paths-spanning-trees

http://profesores.elo.utfsm.cl/~agv//elo330/

Architectures

Think the architecture of the systems as linear combination.

System architecture



Design Patterns for Cloud Native Applications by Kasun Indrasiri, Sriskandarajah Suhothayan

Data stores

Summary of system design principles

Architectural styles for processing

Architectural styles have two main views the software and hardware organization such that the style are a set of components connected to each other and how these are configured into a system. A component is a replaceable modular unit with interfaces [OMG, 2004b] where “replaceable” means during that system operation you can change the implementation or interface to another one, but you must consider changing interfaces is a pain in the ass. You connect them by “connector”, a mechanism that mediates communication, coordination, or cooperation among them. Some of them are remote procedure calls, message passing, or streaming data.

The needs of architectural styles depends on the style of data processing, these distinguish three different of systems based on data:

Online Transactional Processing

RPC

API

MVC is type of client/server system.

https://www.youtube.com/watch?v=nH4qjmP2KEE&ab_channel=ByteByteGo

TODO: MV* software architectures fits well with client/server systems, right?

Data stream, Spark streaming

Now, we focus on software architecture, that is, the logical organization into software components of which the most important architectural styles are layered architecture, object-based architecture, and resource-centered architecture.

Application layering. N-tier architecture.

Definition

Topological constraints

Applicable problems

Resilience to change

Negative behaviours

Supported NFPs

Inhibited NFPs

Comparison with other architectures

Layered.pdf

Offline-first software

OLTP databases with offline-first approaches

NameDeveloperTypeSync CentralSync P2PDescriptionLicense
Couchbase LiteCouchbaseJSON DocumentYesYesEmbedded/portable database, can synchronize with multiple stationary database and/or mobile devices.Apache 2.0 License
InterBaseEmbarcadero TechnologiesRelationalDependentDependentIoT Award-winning embedded/portable database, can synchronize with multiple stationary database and/or mobile devices using patent pending Change ViewsProprietary
ObjectBoxObjectBox Inc.Object DatabaseDependent?Fast Edge Database for local data persistence, 10x faster than SQLiteApache 2.0 / MIT
RealmMongoDB Inc.Object DatabaseDependentNoPortable local database, has a synchronized mode that synchronizes (real-time) with stationary databaseCore Apache 2.0 License, Sync Proprietary
SQL AnywhereSybase iAnywhereRelationalDependentNoEmbedded/portable database, can synchronize with stationary databaseProprietary
DB2 EveryplaceIBMRelationalDependentNoPortable, can synchronize with stationary databaseProprietary EULA
SQL Server CompactMicrosoftRelationalNoNoSmall-footprint embedded/portable database for Microsoft Windows mobile devices and desktops, supports synchronization with Microsoft SQL ServerProprietary
SQL Server ExpressMicrosoftRelationalNoNoEmbedded database, free downloadProprietary
Oracle Database LiteOracle CorporationRelationalNoNoPortable, can synchronize with stationary databaseProprietary
SQLiteD. Richard HippRelationalNoNoC programming libraryPublic domain
SQLBaseGupta Technologies LLC of Redwood Shores, CaliforniaNoNoProprietary
Sparksee (graph database)Sparsity TechnologiesGraph DatabaseNoNoGraph Database. Written in C++98.Proprietary
RxDBRxDBJSON DocumentYesYesOffline-first, reactive database for JavaScript applicationsApache 2.0 License
PouchDBApache Software FoundationJSON DocumentYesYesOffline-first database designed to sync with CouchDB and compatible serversApache 2.0 License


Mobile database

https://github.com/pazguille/offline-first

https://rxdb.info/offline-first.html

https://twitter.com/paulgb/status/1707101873252057481

https://twitter.com/paulgb/status/1707410383567200734/photo/1

Object-based and service-oriented architectures

Monoliths and microservices

Microservices

https://twitter.com/bytebytego/status/1677925395382042625/photo/1

Resources-based architectures

Middleware organization

Sharding and replication mechanisms

Ranged/dynamic sharding

Algorithmic/hashed sharding

Directory sharding

Geography-based sharding

https://www.youtube.com/watch?v=PmLPbrf-x9g&t=1582s

Sharding

Database Federation

Decentralized organizations

Peer2Peer.pdf

P2P systems

Overlay network

Bootstrapping node

LibP2P

Example

Batch processing

A job or batch refers to a process that is executed offline, meaning it requires no user interaction, but it will run under certain conditions or events, often based on time, though this is not always necessary. We execute jobs when we need to handle a large amount of data for the task at hand. For instance, typical batch operations include DataOps tasks, data governance, and machine learning training, which ingest large amounts of data for downstream consumption.

The hardware matters with you do Deep Learning, namely, you should choose wisely CPU, GPU, TPU.

Distributed massive processing. The map-reduce algorithm.

How can I fit big massive datasets?

Batch Size

Cluster Architecture

Machine Learning, Statistics

StatisticsCPU

Pipe-filter architecture

Dags

Unix philosophy, HDFS (Hadoop Distributed File System). All-reduce algorithm. The two main problems that bash processsing programmers need to solve are: Partitioning and fault tolerance. Collective operation. GFS.

OLAP

Message Passing Interface (MPI).

Google Map-Reduce → Hadooop → Spark

OLAP

Data engineering. v. DataOps.

Kafka, Flink, Spark, Pulsar, etc

Amazon Kinesis, Google Cloud Pub/Sub, Google Cloud Dataflow

Workflow orchestrators and Scheduler

Pipes and Dags

Apache AirFlow

Luigi

Dagster

Apache Oozie

Conductor

https://www.youtube.com/watch?v=MF5OaQEOF2E

Cron Job

Queue architecture

When you prefer a simple approach, a queue architecture can be a good fit. Either a user or another job can dispatch a job to a queue, which could be managed by Redis, Beanstalkd, Amazon's Simple Queue Service (SQS), or even a database table. Meanwhile, your application employs a queue worker, which processes and filters jobs in the background based on certain criteria such as priority. The queue worker can be a new thread, a new process, or an event if you're managing an event loop in a single-thread language.

💡
Hook

Quees with delays.

Stream processing

Event.pdf

Event-driven architecture

Event loop

Micro task and Macro task

Callbacks

Async

Event Queue

Inbox and outbox pattern

https://www.activepieces.com/

System design patterns

Blackboard Architecture

ClientServer.pdf

Server-driven ui

Command.pdf

Composite.pdf

Facade.pdf

Decorator.pdf

MobileCode.pdf

Observer.pdf

PipeFilter.pdf and DAGS

PlugIn and dependencies

https://cs.uwaterloo.ca/~m2nagapp/courses/CS446/1195/Arch_Design_Activity/PlugIn.pdf

https://www.figma.com/blog/how-we-built-the-figma-plugin-system/#implementing-the-api-using-realms-securely

https://tldp.org/LDP/tlk/modules/modules.html

https://dl.acm.org/doi/abs/10.1145/3365137.3365402

https://antonz.org/writing-package-manager/

https://developer.mozilla.org/en-US/docs/Learn/Tools_and_testing/Understanding_client-side_tools/Package_management

https://dl.acm.org/doi/10.1016/j.infsof.2012.09.002

Plugins have an life cycle, events and commonn API.

Dependency maanger

https://gitpkg.vercel.app/about

Examples are Linux Kernel Modules

Proxy.pdf

PublishSubscribe.pdf

State.pdf

Strategy.pdf

Visitor.pdf

Algorithms

https://twitter.com/bytebytego

Case of studies for system design

Astrolabe

Robbert Van Renesse, Kenneth P. Birman, and Werner Vogels. 2003. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Trans. Comput. Syst. 21, 2 (May 2003), 164–206. https://doi.org/10.1145/762483.762485

Globule

Globule – Scalable Web application hosting. (2023, February 09). Retrieved from http://www.globule.org

Jade

Jade Site | Java Agent DEvelopment Framework. (2023, February 09). Retrieved from https://jade.tilab.com

https://dxos.org/ provides developers with everything they need to build real-time, collaborative apps which run entirely on the client, and communicate peer-to-peer, without servers.

https://jitsi.org/

https://www.youtube.com/watch?v=PmLPbrf-x9g&t=429s&ab_channel=Ceph

WebSocket, WebRTC, SFU (MediaSoup), SIP (Kamailio), RTP

Summary

Worked examples

Processes

Threads in distributed systems

The most important thread property is allowing blocking system calls but not blocking the entire process in which the thread is running because it makes it much easier to express multiple logical connections at the same time.

Virtualization

Clients

If you must establish a high degree of distribution transparency on the client side, that is, you should hide the communication latencies, then the most common is to initiate the communication right off the bat, display something quickly to users, and not block the user interface. All together can be done via threads.

Replicated calls from a client

A replicate call is a remote procedure call (RPC) between a node and the replicas, that is, the other nodes in the system that holds a copy of the data or chunk being requested, where a response may or may not be received due to system design or network problems.

Typically, a replicated call refers to a situation where multiple replicas of a server-side application are involved in processing a client's request.

Web browser

Servers

Rates Limits

Window X

Window X designates a user’s terminal as hosting the server, while the application is known as the client. It makes sense because on one hand there is a limited resource and $n$ applications wanting to access them, so the Windows X server coordinates the clients and resource interruptions.

Proxies

Reverse proxies and loading balancers

traefik, ngrok, …

API Gateways

Authentication

Capacity

Queue Theory

https://www.youtube.com/watch?v=m64SWl9bfvk

Cache, materialization and Content Delivery Network

An alternative solution to Caching stratagies

Filters are probabilistic data structures

Webhook

Code migration and mobility

Remote execution and Code-on demand

Mobile Agent

Cache

Checkpoint and restore

Actor model

https://arxiv.org/pdf/1008.1459.pdf

Queue system analysis

https://www.youtube.com/watch?v=z611FLLR3_8&ab_channel=GeoffreyMessier

https://www.youtube.com/watch?v=Hda5tMrLJqc&ab_channel=USENIX

https://www.youtube.com/watch?v=oQGreeij-OE&ab_channel=8thLightInc.

Job scheduler

Summary

Exercises

Communication

Rate Limiting

Protocols

Communication protocols supported by browers

Server-Sent Events (SSE)

Websockets and WebSocket Secure

HTTP (1,2,3) and HTTPS

FTP, SFTP FISH

File URI scheme

Marionette Remote Protocol, WebDriver, and BiDi

WebRTC

Data URI Scheme

Magnet Links

Push API

TODO: Media Source Extensions (MSE)

WebSocket-to-TCP Proxy

Routing algorithms

Deadlock-free packet switching

Wave and Traversal Algorithms

Election algorithms

Termination detection

Anonymous Networks

Snapshots

Sense of direction and orientation

DNS

rapid_api

Synchrony in networks

Remote procedure call (RPC)

A remote procedure call is a transparent function or method call that internally sends and receives data from remote processes. It must hide the message-passing nature of network communication. You can find a lot of RPC examples such as REST, SOAP, gRPC, Ajax in Web Browsers, graphQL, CORBA, and Trihft.

Webhook

Message-oriented communication

Multicast communication

Worked examples

Summary

Naming

FingerprintJS

UUID, Snowflake IDs, ULIDs, or even NanoIDs

Should be the naming either descentralized or centralized?

Fingerprinting

https://fingerprint.com/blog/device-fingerprinting-android/

url

https://docs.python.org/3/library/urllib.parse.html

scheme://netloc/path;parameters?query#fragment

Service Discovery

Identificadores bibliográficos (ISBN, ISSN, DOI, URI): PURL

https://biblioguias.unex.es/c.php?g=572103&p=3944904

Flat naming

Chord ring

It employs random m-bit identifier spaceDistributed hash table (key, value) key is a m-bit identifier space, the entities can be files, processes, etc.finger_table
node_set
__succ(key)
find(key)
set(key, value)
What happens when a node joins and leaves the ring?An entity with key k falls under the jurisdiction of the node with the smallest identifier idkid \ge kfinger_table: index → succ(node_id+2^(index-1))
node-set is the sorted set of nodes known by a specific Chord node.The values are shared among the nodes instead of relaying in one node.The node set is the reference to a subset of nodes.
The keys and node identifiers are in the same space.The pred(p) is the node's previous to p in the ring.

Structured naming

Attribute-based naming

Worked examples

Summary

Coordination

Consistent Hashing

Clock synchronization

Time, Clocks, and the Ordering of Events in a Distributed System

Leslie Lamport

Communications of the ACM

July 1978, Volume 21, Number 7, pages 558-565

Download

https://lamport.azurewebsites.net/pubs/time-clocks.pdf

Location

Geography systems

Choosing location

Location tracking

location-aware authentication

Geohashing

Quadtrees

Geo-blocking

Summary

Worked example

Process 1 and Process 2 behave same, however, it depends both are fullfilled and they ocurrs in different order. Let S0,S1,S2 those states that are in PiP_i we definite transitions the global state of system as follow:

 P1 P2      P1 P2
[S0,S1] -> [S1,S2]
[S1,S2] -> [S2,S0]
[S2,S0] -> [S0,S1]

Questions

Consistency and replication

Stateless and stateful

Atomic objects

Fault tolerance

Fault Tolerance in Distributed systems

Case of study

VMware FT

Circuit Breaker Design Pattern

Questions

https://link.springer.com/article/10.1007/s13177-019-00184-3

Fault tolerance

Single point of failure

Glitch

Replication

SLA, SLO, SLI

Fault tolerance in Asynchronous systems

Fault tolerance in Synchronous systems

Failure detection

Disaster recovery

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

Circuit breaker

Stabilization

Self-stabilization is a desirable property in distributed algorithms. It refers to the ability of a system to recover from illegitimate states to legitimate states without external intervention. This property can be particularly useful in systems where there may be faults, or in systems that start in an illegitimate state.

The Maximal Strongly Independent Set (MSIS) problem in graphs is a well-known NP-hard problem in graph theory and computer science. The MSIS is a subset of nodes such that no node is adjacent to all nodes in the MSIS, an none of them can be added to the independent set without violating this property.

Halin graphs are a particular class of graphs, named after the mathematician Rudolf Halin, which are constructed from a rooted planar tree by embedding a cycle through all the leaves.

A self-stabilizing algorithm for the MSIS problem in geometric Halin graphs could be developed iteratively, where the nodes of the graph update their status based on the states of their neighbors until the system reaches a stable state. A key consideration here would be ensuring that the algorithm is efficient, i.e., it converges quickly and uses a minimal amount of resources.

Process resilience

Reliable client-server communication

Distributed commit

Capacity Planning

CAP theorem

Consistency

Summary

Critical systems

Aircraft systems

https://www.youtube.com/watch?v=BAbevxL_D1E

Ship systems

https://www.youtube.com/watch?v=3Hx0172DSjU

Nuclear systems

Security

Monitoring and observability

Observability is a technique used to observe the state of a system based on metrics, logs, and traces in a distributed system. On the other hand, monitoring consists of checking processes and analyzing metrics over a period of time in a monolithic. systems. Therefore, monitoring tasks are a subset of Observability tasks.

Logs

Event logs

syslogs

netflow

packet captures

firewall logs

Tools

OpenTelemetry

OpenMetrics

Nagios

Prometheus

Grafana

New relic

logz.io

Instana

Splunk

AppDynamics

Elastic

Dynatrace

Datadog

OpenCensus

Operating system journals

https://github.com/louislam/uptime-kuma

Grafapha and Prometheus

References

R. E. Kalman, “On the General Theory of Control Systems,” Proceedings of the First International Congress on Automatic Control, Butterworth, London, 1960, pp. 481-493.

Observability at Twitter https://blog.twitter.com/engineering/en_us/a/2013/observability-at-twitter, https://blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-overview-part-i

Practical Monitoring by Mike Julian

The Art of Monitoring by James Turnbull

https://riemann.io/

Security

Data centers, supercomputing and HPC

Computer cluster

SLURM

Lustre, Ceph

job scheduler . Workload Manager

Cockpit Project

https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html

https://researchcomputing.rice.edu/rice-supercomputing-nots

IPICYT Mexico

Autonomic computing

cellular automaton

Von Neumann universal constructor

Logical universaility

Codd's cellular automaton

Constructability

Poslad, Nami, Sharifi, IBM, Von Neunmann proposes that autonomic computing system has the properties:

References

Von Neumann, J., & Burks, A. W. (1966). Theory of self-reproducing automata. IEEE Transactions on Neural Networks5(1), 3-14.

Kephart, J.O.; Chess, D.M. (2003), "The vision of autonomic computing", Computer36: 41–52, CiteSeerX 10.1.1.70.613doi:10.1109/MC.2003.1160055

Poslad, Stefan (2009). Autonomous systems and Artificial Life, In: Ubiquitous Computing Smart Devices, Smart Environments and Smart Interaction. Wiley. pp. 317–341. ISBN 978-0-470-03560-3. Archived from the original on 2014-12-10. Retrieved 2015-03-17.

Nami, M.R.; Sharifi, M. (2007). "A survey of autonomic computing systems". Intelligent Information Processing III. Third International Conference on Autonomic and Autonomous Systems (ICAS'07). IFIP International Federation for Information Processing. Vol. 228. pp. 26–30. doi:10.1007/978-0-387-44641-7_11ISBN 978-0-387-44639-4S2CID 6974127.

Tutorials and assignments

We will explore some alternatives to perform distributed computing, both locally and in a network, starting from the most basic approaches and progressing to the use of the most feature-rich libraries.

The “&” operator

(sleep 1 && echo "abc") & (sleep 2 && echo "def") & (echo "efg")

GNU parallel

In my opinion, GNU Parallel works well for complex tasks when run locally, but it can be difficult to use on a cluster. You can see for yourself using https://www.msi.umn.edu/support/faq/how-can-i-use-gnu-parallel-run-lot-commands-parallel.

Examples

cd /tmp/ && touch {1..5}.my_random_file | find . -type f -name '*.my_random_file' | parallel gzip --best
parallel echo ::: 😀 😃 😄 😁 😆 😅 😂 🤣 🥲 done!
touch images.txt && parallel printf 'https://picsum.photos/200/300\\n%.0s' ::: {1..1000} | parallel -j+0 -k "echo {} | tee -a images.txt" && cat images.txt | parallel wget
touch images.txt && parallel -j+0 "printf 'https://picsum.photos/200/300\\n%.0s'" ::: {1..2} | parallel -j+0 "echo {} >> images.txt" && parallel -a images.txt -j+0 wget

xargs

cat images.txt | xargs -n 1 -P 10 curl -s | gawk '{print $1}'

JobLib

Dask

Ray

On the other hand, Ray performs well for complex tasks on an on-premise cluster, but when used locally, it can be overwhelming.

https://www.ray.io/

A distributed Hello World program

Network File System

Hadoop and Spark

Open MPI

Low-level alternative

LibP2P

Location-addressing approach vs content addressing

Content identifier

Peer. A participant in a decentralized network that is equally privileged.

Luigi

Cloud services

Backend-as-service

Firebase

Supabase

Herok

Postgresql

Blockchain and smart contracts

Kubernetes and docker

Queue messaging

https://mqtt.org/

MQTT

Message broker

Quality of service

OASIS (organization)

ZeroMQ

Apache Kafka

WebSphere MQ

https://zeromq.org/

Quality of service

Message-oriented middleware

Message queuing service

Ope

Projects

Your own map-reducer

https://pdos.csail.mit.edu/6.824/labs/lab-mr.html

URL Shortener

Chat app

Video Streaming app

Transportation app

Other Alternatives

Object storage

1. Network Attached Storage (NAS) - a file-level storage architecture that allows multiple users to access shared files over a network.
2. Distributed File System (DFS) - a protocol that allows users to access files on multiple servers as if they were on a single shared drive.
3. iSCSI (Internet Small Computer System Interface) - a protocol used for connecting storage devices over a network as if they were directly attached to a computer.
4. Remote Direct Memory Access (RDMA) - a protocol used for direct memory access between servers over a network, often used for high-performance computing and data-intensive applications.
6. AFS (Andrew File System) - a distributed file system designed for large-scale collaboration and data sharing built by Carnegie Mellon University
.
8. GlusterFS - a distributed file system designed for scalability and high availability, often used for cloud computing and big data applications.
9. Lustre - a parallel distributed file system designed for high-performance computing and scientific applications.

10. File Transfer Protocol (FTP) and Secure File Transfer Protocol (SFTP) - a protocol used for transferring files over a network, often used for website hosting.

version: '3'

services:
  sftp:
    image: atmoz/sftp
    ports:
      - "2222:22"
    environment:
      SFTP_USERS: sftpuser:pass:1001:/data
    volumes:
      - /path/to/sftpdata:/data

11. WebNative FileSystem (WNFS)

  1. MinIO - an open-source, self-hosted cloud-based object storage service that can be deployed on-premises or in a cloud environment.
  1. S3
  1. Ceph - an open-source, software-defined storage system that can be used to create a distributed object store.
  1. OpenStack Swift - an open-source object storage system that can be used to create a scalable and highly available object-store.
  1. Apache Cassandra - a distributed NoSQL database system that can be used to store and retrieve large amounts of data.
  1. GlusterFS - an open-source, distributed file system that can be used to create a scalable and highly available object store.
  1. Riak KV - an open-source, distributed NoSQL database system that can be used to store and retrieve large amounts of data.
  1. OpenIO - an open source, object storage, and data management platform that can be used to store and retrieve large amounts of data.
  1. Scality Ring - an open-source, distributed storage system that can be used to create a highly available and scalable object store.
  1. SwiftStack - an open-source, software-defined storage system that can be used to create a distributed object store.
  1. Nextcloud Object Storage - an open-source, self-hosted cloud storage system that includes an object storage component that can be used to store and retrieve large amounts of data.
  1. SSH and RSYNC over SSH

https://hub.docker.com/r/linuxserver/openssh-server

---
version: "2.1"
services:
  openssh-server:
    image: lscr.io/linuxserver/openssh-server:latest
    container_name: openssh-server
    hostname: openssh-server #optional
    environment:
      - PUID=1000
      - PGID=1000
      - TZ=Etc/UTC
    volumes:
      - ./config:/config
    ports:
      - 2222:2222
    restart: unless-stopped
  1. Network File System
  1. Server Message Block (SMB) and CIFS/SMB
  1. BitTorrent Sync
  1. Atom (web standard)
  1. Content Management Interoperability Services
  1. Linked Data Platform
  1. Unison File Synchronizer https://github.com/bcpierce00/unison
  1. netcat (nc)
nc -lk 1234 < ~/shared

Storage Area Network

Simulations

Simgrid

OverSim

Electronic voting

Next steps

References

Tanenbaum, Andrew S.; Steen, Maarten van (2002). Distributed systems: principles and paradigms. Upper Saddle River, NJ: Pearson Prentice Hall. ISBN 0-13-088893-1. Archived from the original on 2020-08-12. Retrieved 2020-08-28.

6.5840 Home Page: Spring 2023. (2023, January 19). Retrieved from https://pdos.csail.mit.edu/6.824

"Distributed Programs". Texts in Computer Science. London: Springer London. 2010. pp. 373–406. doi:10.1007/978-1-84882-745-5_11. ISBN 978-1-84882-744-8. ISSN 1868-0941.

Lynch, Nancy. Distributed Algorithms. Burlington, MA: Morgan Kaufmann, 1996. ISBN:9781558603486.

Attiya, Hagit, and Jennifer Welch. Distributed Computing: Fundamentals, Simulations, and Advanced Topics. 2nd ed. New York, NY: Wiley-Interscience, 2004. ISBN: 9780471453246.

Herlihy, Maurice, and Nir Shavit. The Art of Multiprocessor Programming. Burlington, MA: Morgan Kaufmann, 2008. ISBN: 9780123705914.

Guerraoui, Rachid, and Michal Kapalka. Transactional Memory: The Theory. San Rafael, CA: Morgan and Claypool, 2010. ISBN: 9781608450114.

Dolev, Shlomi. Self-Stabilization. Cambridge, MA: MIT Press, 2000. ISBN:9780262041782.

Kaynar, Disun, Nancy Lynch, Roberto Segala, and Frits Vaandrager. The Theory of Timed I/O Automata. 2nd ed. San Rafael, CA: Morgan and Claypool, 2010. ISBN:9781608450022. https://groups.csail.mit.edu/tds/papers/Lynch/Monograph-second-edition.pdf

MIT 6.033

MIT 6.5840

MIT 6.824. (2023, January 27). MIT OpenCourseWare. Retrieved from https://ocw.mit.edu/courses/6-824-distributed-computer-systems-engineering-spring-2006

Systems, Mit 6. 824: Distributed. "Lecture 1: Introduction." YouTube, 6 Feb. 2020, www.youtube.com/watch?v=cQP8WApzIQQ&list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB&ab_channel=MIT6.824%3ADistributedSystems.

Saltzer and Kaashoek's Principles of Computer System Design: An Introduction (Morgan Kaufmann
2009).

Nancy Lynch, Michael Merritt, William Weihl, and Alan Fekete. Atomic Transactions. Morgan Kaufmann Publishers, 1994.

MIT OpenCourseWare. (2023, January 27). MIT OpenCourseWare. Retrieved from https://ocw.mit.edu/courses/6-852j-distributed-algorithms-fall-2009

cs6234-16-pds https://www.comp.nus.edu.sg/~rahul/allfiles/cs6234-16-pds.pdf

Stevens, W. Richard, Bill Fenner, and Andrew M. Rudoff. UNIX Network Programming, Vol. 1: The Sockets Networking API. 3rd ed. Reading, MA: Addison-Wesley Professional, 2003. ISBN: 9780131411555.

McKusick, Marshall Kirk, Keith Bostic, Michael J. Karels, and John S. Quarterman. The Design and Implementation of the 4.4 BSD Operating System. Reading, MA: Addison-Wesley Professional, 1996. ISBN: 9780201549799.

Stevens, W. Richard, and Stephen Rago. Advanced Programming in the UNIX Environment. 2nd ed. Reading, MA: Addison-Wesley Professional, 2005. ISBN: 9780201433074.

Indrasiri, K., & Suhothayan, S. (2021). Design Patterns for Cloud Native Applications. " O'Reilly Media, Inc.".

. "Distributed Systems lecture series." YouTube, 31 Jan. 2023, www.youtube.com/playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB.

https://ki.pwr.edu.pl/lemiesz/info/Tel.pdf

http://bafybeiemcaemivgxkzkuqq2kvvfmyo32joipgp6pivk2ea32la2stnmkwm.ipfs.localhost:8080/?filename=Maarten van Steen%2C Andrew S. Tanenbaum - Distributed Systems_ Principles and Paradigms-Maarten van Steen (2023).pdf

https://www.youtube.com/@DistributedSystems

https://www.youtube.com/@lindseykuperwithasharpie

https://www.youtube.com/@rbdiwate626

IPFS and libp2p

TODO

Secure multi-party computation

proxies

reverse-proxy