databases

also check the databases dir

links

cap theorem
database transactions
graph theory

best practices

use purpose built dbs instead of general purpose
- refrain from having a single, shared general purpose db
- purpose built DBS excel in as peicfic domain with unmatched performance relative to general purpose dbs
- deploy mulitple DB engines for specific needs and run analytics across a data warehouse
- match the data store to the business need and the type of transactions it needs to support
model your data stores into transactional vs query needs using concepts from CQRS to design for the type of work the db needs to do
using multiple dbs requires
- managing distributed transactions/partial execution failures with business logic/step functinos
- source of truth is scatterred across data stores and must be shared with other domains
- embrace eventual consistency
- choose the right ETL pattern

basics

OLTP workloads: online transactional processing
- focus on recording Update, Insertion, and Deletion data transactions
OLAP workloads: online analytical processing focusing on read operations, e.g. a data warehouse
- store historical data that has been input by OLTP
- can extract information from a large database and analyze it for decision-making
command query respnsibility segregation: aka polyglot persistence
- having a single big db instance thats queried by analytics services to provide views into the data for specific microservices/consumers
consistency patterns: all about ensuring previous writes are reflected in future reads
- strongly consistent: i.e. read after write; all writes are always reflected
- eventually consistent: previous writes may NOT be reflected
- always design around eventually consistency
  - strongly consistency means you received the latest, but subsequent writes it may be stale whenever based on previous reads
sharding: enables horizontal scaling to serve more requests
- distribute your data volume across multiple db instances for storage flexibility
- the entire dataset is partitioned amongst multiple db instances
- items stored in a particular partition is based on a hash of a partition key
no/newsql
- all about minimizing CPU time relative to data storage and i/o
- data storage is now much cheaper than cpu time, hence the popularity of nosql dbs
- does have higher i/o costs as generally you have to retrieve duplicated data at each trip
- provides greater flexibilty in structuring your data, since your not concerned with normalization
- OLTP with known request patterns
  - live/interactive applications
  - hot/active data
  - smaller documents
  - known query patterns

purpose built databases

data stores that solve specific data needs
- READ heavy
- WRITE heavy
- scalability, faol-tolerence, availability
- ACID, CAP-thereom
- data structure

data sources

structured: stored as a series of data values in related tables for complex queries and analysis
unstructured: stored as files (e.g. media, video, documents, etc), thus querying requires specialized tools to catalog and query
semistructured: stored in json/xml files that are loaded at runtime; the structure is highly flexible enabling some querying and analysis but not at the level of structured data

relational

structured, normalized schemas
all about minimizing data storage via normalization
this strict requirement on deduplication causes some queries to be cumbersome/cpu intensive
either OLTP or OLAP, anything with unknown query patterns
- only normalized data can efficiently respond to unknown queries
Normalized relational or dimensional data warehouse
Optimized for storage
best scaled vertically
referential integrity, ACID transactions, schema-on-write

Non Relational

NoSQL: un/semistructured, lacking the ACID mechanisms of relational dbs
NewSQL: un/semistructured, gaining traction with implementing ACID mechanisms similar to that of relational DBs
Denormalized document, wide column, or key-value
Optimized for compute
best scaled horizontally
OLTP web/mobile apps

document

store semi/unstructured data as some type of file

graph

for traversing multi-layered relationships and highly connected datasets purposely built for semi/un/structured data
In a graph data model, relationships are first-class citizens, and the data is modeled as nodes (vertices) and links (edges)
Nodes: are usually a person, place, or thing
links: how nodes are all connected
graph workloads: tracking relationships between entities.
- complex domain model: large data set with many different entities
- variable schema: similar entities may have different properties
- variable structure: highly connected entities in many different ways
- connected queries: navigate connected structures taking into account strength, weight, or quality of relationships

Types

directed property graphs

Vertices: aka nodes; things
edges: connections
properties: both vertices and edges can have properties (effectively, key-value pairs) associated with them.
directed: every edge has a direction, out from one vertex and into another

use cases

Identity Graphs

a single unified view of customers and prospects by linking multiple identifiers such as cookies, device identifiers, IP addresses, email IDs, and internal enterprise IDs to a known person or anonymous profile using privacy-compliant methods
captures customer behavior and preferences across devices and marketing channels.
acts as a central hub and drives targeted advertising, personalization of customer experiences, and measurement of marketing effectiveness.

knowledge graphs

a means of structuring and organizing information for improved access and understanding
build
- systems that can recommend people to projects, connect related projects, or centralize access to avoid duplicate efforts.
- context-aware systems that can derive answers based on queries and a vast knowledge base
use machine learning services with knowledge graphs for better decision making and knowledge discovery.

Fraud Detection

stores the relationships between the transactions, actors, and other relevant information to help customers find common patterns in the data and build applications that can detect fraudulent activities.
people can collude to commit fraudulent transactions, creating fraud rings.
- can have hundreds of members, which makes it challenging to find the bad actors and detect fraud.

Social Networking

store human interactions between people, track skills, and roles at organizations.
understand social interactions, likes, and preferences
find beneficial connections across companies, fill occupational needs that may not be obvious from job title alone, and understand how information spreads through a population

Network/IT operations

store resources, configurations, and access patterns across the technology landscape
Storing this topology in a graph gives better insight to the critical resources in the production environment that are essential to protect and better understand whether the changes observed daily in an environment are critical, expected, or anomalous.

Recommendation Engines

provide predictive capability based on existing connections in a graph
exploit the fact that similarity breeds connections.
make recommendations based on a well-understood phenomenon called triadic closure.
- If a connection between Bill and Sarah exists and a connection between Bill and Terry exists, there is a tendency for a connection between Terry and Sarah to form.

Life Sciences

store connections ranging from relationships between biological compounds and biochemical reactions to those connecting multiple datasets.
used for applications such as data integration, management of research publications, drug discovery, precision medicine, and cancer research.

Converting other models to graph

it is best to design an application graph data model and graph queries by working backwards from a set of application use cases, and using the model as a target for any subsequent data ingest from other systems.
however, in some cases you have to start from an existing (e.g. relational) and convert to a graph
- you can apply a number of 'mechanical' transformations that yield a naive graph model.
- provide the basis for exploration and the iterative development of an application graph data model.

relational to graph

in general
- each row in a table can be converted to a vertex in a property graph, or a set of statements with a common subject in an RDF graph.
- Foreign key relations can be modelled as edges.
- Each row in a join table with two foreign keys can be converted to an edge with edge properties.
- If a table contains more than two foreign keys, convert each row to an intermediate node, and convert the foreign keys into edges connecting the intermediate node to other nodes.
to property graph
- tables
  - convert the column names to property keys.
  - Use the table name as the label for each vertex.
  - Concatenate primary key values to generate each vertex ID.
- foreign keys
  - Choose an edge direction and label that best express the domain semantics of the relationship.
  - Concatenate primary and foreign key values to generate the edge ID.
- join tables with two foreign keys
  - Use the join table name for the edge label.
  - Concatenate foreign keys to create the edge ID.
  - Convert the remaining columns to edge properties.
to RDF graph
- tables
  - Column names and values become predicates and object literals.
  - The table name becomes an object literal value of an rdf:type predicate.
  - Concatenate primary key values to generate each resource ID.
- foreign keys
  - Represent the relationship using a triple whose subject and object are URIs identifying the resources to be connected.
  - Choose a direction and predicate value that best express the domain semantics of the relationship.
- join tables with two foreign keys
  - you will have to introduce an intermediate node:
  - Use the join table name to type the node.
  - Concatenate foreign keys to create the subject URI.

document oriented to graph

read this

key value to graph

read this

key value

optimized to store and retrieve unstructured non-relational data in key-value pairs in large volumes and in milliseconds
high throughput, low latency reads/writes, endless scale

in memory

used for read-heavy and compute intensive applications that require low latency access to semi/structured data
performance is improved because data is retrieved from in-memory data stores instead of waiting on databases and disk I/O.
cache types
- built in:
- application
- centralized: stores data externally from the database in a remote non-relational key-value database
cache strategies
- lazy: reactive. Data is put into the cache the first time it is requested
- write through: proactive. Data is put into the cache at the same time it is put into the database.

Vector

time series

collect, store and process data by sequences by time

ledger

complete, immutable and verifiable history of data changes

data warehouses

designed and used as repositories for analytical data
store and maintain aggregate values generated from other databases.
data is stored in columns instead of rows
- and is indexed in a way that matches the way analytical queries are written.

data lakes

a centralized repository that allows you to migrate, store, and manage all structured and unstructured data at an unlimited scale.
Once the data is centralized, you can extract value and gain insights from your data through analytics and machine learning.
data catalog: provides a query-able interface of all assets stored
primary goals: data cataloguing and processing
- keeping track of all of the raw assets
- keeping track of all of the new data assets and versions created by data transformation, data processing, and analytics.
use cases
- systems that generate so much disparate data, the primary goal is to store all of it for later use and analysis, without having any immediate goals
- makes the data and the analytics tools available to more of your users, across more lines of business enabling them to get the business insights they need, whenever they need them.

migration

homogenous migrations: migrating from/to the same db engine
heterogeneous migrations: migration from/to different db engines
back-fill: porting data from one place to another, often used in db migrations to port data between two timestamps
use cases
- modernization: legacy dbs to modern db engines
- platform switching
- replication: frequenlty copying of data to share with other users or environments

migration planning

not all steps are applicable to each migration

envisioning and assessment

understand the scope of work required based on your database schema, data volumes, data types, resources, and stakeholders.
- assess your current environment, evaluate any known risks, and create a business case for the migration.
determine availability of integrated tools that support the project plan and automate the migration as much as possible.

db schema conversion

convert your database objects from the source engine to the target engine.
includes converting your tables, indexes, constraints, foreign keys, triggers, and stored procedures.
- but NOT migrating the actual data records in your database.
for features not supported by the target db
- you will need to recode the objects using design patterns based on the target engine’s capabilities.

application conversion and remediation

process of porting application code, written in languages such as Java or C, to your new target database.
arguably the most complex aspect of a migration process.
- applications running against your database maybe developed long ago by resources who are no longer available.
- These applications are often mission critical and may be difficult to rewrite without extensive research and testing.

scripts conversion

looks at batch scripts used for:
- extract, transform, and load (ETL) processes;
- database maintenance;
- disaster recovery;
- etc
scripts may not directly relate to the applications using the database, but require analysis to ensure that they work on the new database engine.

integration with third party services

when you have to support databases for third-party applications, or have third-party applications access their database.
applications like third-party business intelligence and ETL tools
identify these third-party applications, and validate that they continue to work post-migration
process may involve upgrading the third-party tools, or changing adapters or APIs to connect to your new databases.
Other third-party applications might be tightly coupled to a third-party database.
- consider whether you will maintain a legacy database for these applications, or whether you want to migrate from them.

data migration

the process of moving data records from the source to the target.
challenging if you are dealing with large data volumes and have to keep the source and target systems in sync until you can “cut over” your applications to the target system.
if you’re changing the type of your target database, then you will have to translate most (if not all) data values to conform to the target system requirements.

functional system tests

confirm that the migration went as planned
ensure that all applications interacting with the database perform as before, from a functional perspective.
typically involves business stakeholders and analysts who understand user-facing applications and can drive test cases that exercise system boundaries.

performance testing

confirm that the migration is working correctly
applications have specified performance criteria to meet as part of functional testing
sometimes occurs in parallel with functional testing. This activity involves both business stakeholders and technical personnel.
When a performance issue is discovered, each system level is checked for bottlenecks,
- the user-facing application,
- the SQL statements prepared by the application,
- the database engine
- associated storage layer

integration and deployment

process of cutting over to your new database system
typically involves a series of steps detailing how applications will be cut over to the new database system.
Depending on business needs, the deployment may require:
- minimized downtime.
- specified time window
- be performed in phases, where individual applications are cut over one by one.
it is important to plan for what to do in case you need to roll back changes
- test this plan in your preproduction environment as well, so the team is ready for any issues that occur during the rollout.

training and knowledge transfer

Training is a critical aspect in deploying any new system.
ensure that db/system/application changes are documented, and share that knowledge with team members who support and maintain the application.
the move to new technologies also changes the relative status of people within the team.
- introducing undercurrents of organizational politics into the process.
develop features to manage the new environment, such as monitoring and paging.
- Many of the features you used previously may not be available for your new database.

documentation and version control

documentation is one of the most important tasks prior to putting a system into production.
document all changes that have been made to the system, and how the new system operates.
this is a good opportunity to automate all manual steps.
- consider treating all of these artifacts as code.

post-production support

Once the migrated application is running, you will need at least some support
automate backups and other support tasks, but it is a good idea to plan for support your application might need
Ensure that automated tasks are occasionally checked, and that you have personnel for tasks that are not automated.

copypasta from somewhere else

database technology predates the web, since the 1960s
SQL databases
- are relational, storing data in one/more tables that related to each other in formally prescribed ways
- DDL: data definition language
  - any statement using CREATE, DROP or MODIFY to create, drop and modify table structures
- DML: data manipulation language
  - any statement using SELECT, INSERT, UPDATE, and DELETE for CRUDing records
NoSQL databases
- sacriface the strict data integrity requirements of SQL databases to achieve greater scalability
- often schemaless, allowing you to add fields to new records with having to upgarde any data structures
distrubed caches
- in-memory databases, that load data from disk and stores it in cache
- caching refers to the process of storing a copy of data in an easily retrievable form to speed up responding to requests for that data

data fabric

data mesh

enables collection, integration and analysis of data from disparate systems concurrently in a single location

data lake

allows storing yuuuge amounts of raw, structured, and/or unstructured data in a single repository enabling comprehensive analysis from a single location
- i.e. you push any and everything into a data lake, whether or not the data has a purpose

data warehouse

allows storing yuuuge amounts of structured, filtered data that has already been processed for a specific purpose (like data already in use by app/biz)
- i.e. you push filtered data into a warehouse, for later analysis

Databases

Relational

relational, structured
best for online analytical processing

NewSql

nosql

hierarchical, unstructured
best for online transaction processing at scale
scales out
main distinctions are data model driven: e.g. rdbms vs nosql vs wide-column etc
secondary distinctions is according to the cap thereom: usually a choice between C and A, as all are susceptible to P (failures)
- consistency: will every read receive the most recent write
- availability: will ever request receive a non-error response (but doesnt have to be the most recent write)
- partition tolerance: will the system operate in the face of network failures/msg loss

key/val

data model
- key value pairs: each key has only one value
- fast queries, no need for a query language

document store

data model
- data stored in documents of tagged elements (like a row in rdbms)

column (oriented)

long list
data model
characteristics
- wheres SQL stores record by record (row by row), column oriented dbs store column by column
- improves storage and retrieval performance

wide column store

long list
data model
- store data in columns
- related columns are grouped into tables (column families)
characteristics
- supports large numbers of dynamic columns:
- aka 2 dimensional key value stores

graph

long list
data model
- use nodes & edges to store data

multi-model

go for the native multi models, and not the ones enhanced with extensions/plugins/etc (timescale is dope tho)

Files

databases.md

Latest commit

History

databases.md

File metadata and controls

databases

links

best practices

basics

purpose built databases

data sources

relational

Non Relational

document

graph

Types

directed property graphs

use cases

Identity Graphs

knowledge graphs

Fraud Detection

Social Networking

Network/IT operations

Recommendation Engines

Life Sciences

Converting other models to graph

relational to graph

document oriented to graph

key value to graph

key value

in memory

Vector

time series

ledger

data warehouses

data lakes

migration

migration planning

envisioning and assessment

db schema conversion

application conversion and remediation

scripts conversion

integration with third party services

data migration

functional system tests

performance testing

integration and deployment

training and knowledge transfer

documentation and version control

post-production support

copypasta from somewhere else

data fabric

data mesh

data lake

data warehouse

Databases

Relational

NewSql

nosql

key/val

document store

column (oriented)

wide column store

graph

multi-model