Fundamentals of Database Engineering

Key concepts and internals of database systems, including ACID properties, indexing, partitioning, and replication.

Acid

What is a transaction? 🔁

Life cycle of a transaction

mermaid
flowchart TD
    start[DB Starting State] -->
    begin[BEGIN Transaction] --> work[One unit of work]
    work --> commit[COMMIT Transaction]
    work --> rollback[ROLLBACK Transaction]
    rollback --> start
  • Begin,
    Tell the DB that you are starting a transaction.

  • Commit,
    Persist the changes to the database.

  • Rollback,
    Undo the changes to the database.
    Rollback due to db crashes,
    commit crashes.

Properties of a transaction

  • A transaction is a collection of queries
    that is run as a single unit of work.

  • A transaction is used to modify / update / add data.

  • Sometimes a transaction can be read-only.
    use case:
    getting consistent snapshots of data before a transaction starts.

  • Some transitions are user defined,
    but most updates are implicitly wrapped in a transaction.

Example

account_id
balance

1

1000

2

500

sql
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;
python
balance - 100 > 0;
sql
UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;
sql
COMMIT TRANSACTION;

Atomicity

  • All queries in a transaction must succeed,
    or the transaction must be rolled back.

  • Even if 1/100 query fails,
    the entire transaction must be rolled back.

  • It the database went down prior to a commit of a transaction,
    the transaction must be rolled back.

mermaid
flowchart TD
    start[DB Starting Snapshot] -->
    begin[BEGIN Transaction] --> update_1[Queries 5/100]
    update_1 --> db_crash{DB CRASH}
    db_crash -->|yes| cleanup[DB Cleanup / Rollback]
    db_crash -->|no| COMMIT[COMMIT Transaction]
    cleanup --> start

Isolation

No title

Can my inflight transaction see the changes made by other transactions?

Read phenomena's,

  • Dirty reads
    A read of uncommitted data,
    the data can be rolled back / committed.
    The data needs to be fully flushed to disk (committed to disk)

    mermaid
    flowchart TD
      initial_state[DB Starting State] -->
      transaction_1[BEGIN transaction_1] --> update_1[UPDATE Data]
      initial_state ---> transaction_2[BEGIN transaction_2] --> |dirty read| read_1[SELECT Data]
      update_1 --> |roolback| initial_state
  • Non-repeatable reads
    Two different reads of the same data, return different results.

    mermaid
    flowchart TD
      initial_state[DB Starting State] -->
      transaction_1[BEGIN transaction_1] ---> update_1[UPDATE Data]  -->
      commit[commit]
      transaction_2[BEGIN transaction_2] --> read_1[SELECT Data] --->
      |same query diff result| read_2[SELECT Data]
      initial_state --> transaction_2

    Avoid it by starting a repeatable read transaction,

    postgres
    begin transaction isolation level repeatable read;
  • Phantom reads
    A read of a set of rows,
    where the set of rows changes between reads.

    mermaid
    flowchart TD
      initial_state[DB Starting State] -->
      transaction_1[BEGIN transaction_1] --> read_1[SELECT Data]
      transaction_2[BEGIN transaction_2] --> update_1[INSERT Data] --> commit
      read_1 ---> |same query has more results| read_2[SELECT Data]
      initial_state --> transaction_2

    On postgres you can use the following to prevent phantom reads,

    sql
    begin transaction isolation level repeatable read;
    # or
    begin transaction isolation level serializable;

    mysql,

    sql
    begin transaction isolation level serializable;
  • Lost updates
    Two transactions update the same data,
    and one overwrites the other.
    The read of the first transaction is lost.

    mermaid
    flowchart TD
      initial_state[SUM 10] -->
      transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
    
      initial_state -->
      transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commit

Isolation levels

  • Read uncommitted
    Dirty reads are allowed.

  • Read committed
    Dirty reads are not allowed.
    Non-repeatable reads are allowed,
    committed data will be picked up mid transaction

  • Repeatable reads
    Read query results in same transactions are always consistent.

  • Snapshot
    Only changes committed up to the start of the transaction are visible.

  • Serializable
    All transactions are executed in order.

*Each DBMS system implements isolation levels differently

isolation level
dirty reads
non-repeatable reads
phantom reads
lost updates

read uncommitted

read committed

repeatable read

snapshot

serializable

Serialisable is slower than snapshots

Approaches to isolation

  • Pessimist approach
    Data (rows / tables / page) are locked until they are committed.
    Expensive and slow ( transaction wait for locks to be released ).
    Needs to track locked data.

  • Optimestic approach
    No locks,
    If there is a conflict, the transaction fails.

  • Repeatable read locks
    Locks the data that is read,
    locks are released when the transaction is committed.

  • Searliability
    Optimistic concurrency control, no locks until conflicts during commit.
    Pessimistic concurrency control, locks are used to prevent conflicts.

Consistency

  • Consistency in data is user defined (DBA team)
    Referential integrity (through foreign keys) data is inconsistent across tables.
    Atomicity, isolation affects consistency.

    Data consistency between tables,

    id
    blob
    likes

    1

    a

    1

    2

    b

    2

    users
    id

    Mate

    1

    Mark

    1

    Jack

    2

  • Consistency in reads,

    mermaid
    flowchart LR
     data --> |update| db[(DB)] --> |read| data

    If a transaction update a row,
    will the change be immediately visible during a read?
    .

    Both SQL and NoSQL databases suffer from this problem.

  • Eventual consistency

    mermaid
    flowchart LR
     data_v1 --> |update| db[(DB)] -->
     |read| data_v2
     db --> |read| data_vn
     db --> |read| data_v1

    You will eventually get the latest version of the data.

    mermaid
    flowchart LR
      master_db[(Master - X)]
      slave_1_db[(Follower - Z)]
      slave_2_db[(Follower - Z)]
      data -->|updated to x| master_db
      master_db --> slave_1_db
      master_db --> slave_2_db
      slave_1_db -->|Inconsistent data| read
    mermaid
    flowchart LR
      master_db[(Master - X)]
      slave_1_db[(Follower - X)]
      slave_2_db[(Follower - X)]
      data --> master_db
      master_db -->|replicate| slave_1_db
      master_db -->|replicate| slave_2_db
      slave_1_db -->|Consistent data| read

    When we introduce caching or we introduce horizontal scaling,
    you are eventually consistent.

    If you do not have atomicity,
    or have inconsistent data,
    you do not have consistency at all.

Durability

  • Process of persisting data to disk (non-volatile storage).

  • Durability slows write down,
    write to disk is slower than write to memory

  • Durability techniques

    • WAL - Write ahead log / AOF - Append only file

      Any change to the database is written to a log file, the log file is flushed to disk.

      mermaid
      flowchart LR
       data --> |update| db[(DB)] --> |write| log[(Log)] --> |flush| disk[(Disk)]
      mermaid
      flowchart LR
       db[(DB)]
       log[(Log)]
       disk[(Disk)]
       log --> db
       disk --> db
       db --> reconstructed_db[(DB reconstructed)]

      Writing all the changes to disk is expensive (indexes, data, files, rows...)
      The log file is a compressed version of the data stored in the database.

      mermaid
      flowchart LR
          db[(DB)] --> |flush request| os -->
          os_cache[(OS Cache)] -->
          |batched flushes| disk[(Disk)]
      
          db --> |fSync| disk

      fSync forces the OS to flush the data to disk.
      This prevents data loss in case of OS crash.

    • Async snapshots

      mermaid
      flowchart LR
       data --> |update| db[(DB)] --> |async| snapshot[(Snapshot)] --> |flush| disk[(Disk)]
       db --> |sync| snapshot[(Snapshot)]

      Asynchronously write the data to disk

Serializability vs Repeatable reads

Repeatable reads

mermaid
flowchart LR
    transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit
    transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit -->
    result[a, a, b, b]

both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].

When the transactions are committed only modified rows are updated,
[a, a, b, b]

Serializability

mermaid
flowchart LR
    transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit --> result[a, a, b, b]
    transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit_error[Could not serialize transaction]

Database Internals

How tables and indexes are stored on disk?

Storage concepts

  • Table representation:

    id
    name
    age

    1

    Mate

    30

    bits and bytes of data,
    stored on disk

  • row_id
    in mysql, the row_id is the primary key.
    in postgres, the row_id is a hidden column, generated by the database.

  • Page
    Fixed size rows of table, stored in a memory location.
    Databases **reads and writes data in pages **, it cant read a single row.
    Each page has a fixed size (8kb in postgres, 16kb in mySQL),

  • IO
    Operations that read and write data from disk.
    IO can fetch one or more pages at a time, depending on the disk partitions and other factors.
    IO's are very expensive, you want to minimize the number of IO's you do.
    Postgres relies on OS cache to minimize the number of IO's it does.

  • Heap data structure
    Heap data structure is used to store the pages on disk.
    Traversing a heap is expensive, indexes tell us identify sections of the heap that we are interested in.

  • Index
    You can create indexes on one or more columns.
    Index is also stored on disk as pages.
    An index is an another data structure that stores the references to the pages on heap.
    An index tells you where to find the page on the heap, instead of traversing the heap.
    B-Tree is the most common index data structure.
    Indexes need to be small to reduce the number of IO's.

mermaid
flowchart TD
    heap[(Heap)]
    index[(Index)]
    indexed_query --> |select|index --> |page,row|heap --> |row|indexed_query

In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.

Row vs Column oriented databases

Row vise

row_id
name
age
  • Tables are stored as rows on disk.

  • A single IO fetches (1 page) multiple rows with their columns.

  • More IO's are required to fetch a single row.

  • Once you find the row, you have all the columns without extra computation.

  • A row can span multiple pages, row size is limited by the page size.

  • You need to go through a lot of pages to find all the values for a column. (IO intensive)

mermaid
flowchart TD
    page_1[1-200,name,age]
    page_2[200-400,name,age]
    page_3[400-500,name,age]
    sum[sum age] --> |fetch| page_1 --> |fetch| page_2 --> |fetch| page_3

Column vise

col_id

name

age

  • Tables are stored as columns on disk.

  • A single IO can have one or more columns (depends on row size).

  • Less IO's are required to fetch more values in an column.

mermaid
flowchart TD
    page_1[age: 1:row_id-500:row_id]
    sum[sum age] --> |fetch| page_1
  • To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)

Row based
Column based

simple

complex

writes are fast

slow

optimal for transactions

optimal for ETL's (one column-row's)

low compression

high

aggregate queries slow

fast

efficient multi column queries

less efficient

Joining row based tables and column based tables are messy AF

Primary key vs Secondary key

There is no order maintained by default in the table,

Clustering, order enforced by the database on a table,
through a primary key or secondary key.

Postgres Clustered table / heap organised table, mySQL Index ordered table

Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual table.

Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.

Buffer pool

mermaid
flowchart LR
  heap[(Heap)]
  index[(Index)]
  buffer_pool[(Buffer pool)]
  wal[(WAL)]
  read_query --> |select|index --> |select|heap --> |page,row|buffer_pool --> |row|wal
  update_query --> buffer_pool --> wal

To reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the buffer pool,
thus reducing the number of IO's.

Database Indexing

Creating table with random vaues.

sql
insert into temp (t) select random()*100 from generate_series(0,100000)

Explaining the explain command

Returns statistics about the query plan,

sql
explain select * from temp where t > 50;
Seq Scan on temp  (cost=0.00..2087.00 rows=100000 width=4)
  • Scan types ( Seq vs Index )

  • Cost Time to start executing the query,
    Time to finish executing the query

  • Rows
    Number of rows return

  • Width
    Avg size of returned rows in bytes

Creating an index

sql
create index temp_t_idx on temp(t);

Without index scan,

text
Seq Scan on temp  (cost=0.00..1443.01 rows=100001 width=4)

With index scan,

text
Index Only Scan using temp_temp_idx on temp  (cost=0.29..1258.40 rows=59549 width=4)

If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.

With an index search,

  • Time to prepare increases 0.00 vs 0.29

  • Time to execute decreases

  • Number of rows returned visited decreases

Seq table scan vs Index scan vs Bitmap index scan (postgres)

No title

Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.

Sequential

If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.

sql
explain select id from t_records where t > 10;

 Seq Scan on t_records  (cost=0.00..15233.01 rows=898039 width=4)
   Filter: (t > 10)
 (2 rows)
mermaid
flowchart LR
  index[(Index)]
  heap[(Heap)]
  select --> heap --> resolve

Index

sql
explain select id from t_records where t > 5000;

 Index Scan using t_records_t_idx on t_records  (cost=0.42..8.44 rows=1 width=4)
   Index Cond: (t > 5000)
 (2 rows)
mermaid
flowchart LR
  index[(Index)]
  heap[(Heap)]
  select -->| Index Scan | index --> heap --> resolve

Bitmap

If the no of rows to be scanned is too high, and the

mermaid
flowchart LR
  index[(Index)]
  bitmap[(Bitmap)]
  heap[(Heap)]
  recheck[Condition Recheck]
  index --> bitmap
  select --> bitmap --> recheck --> heap

bitmap is a bit array of the index values, pointing to the page on heap.

python
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition

Postgres would only scan the pages that have rows that satisfy the condition.

sql
explain select id from t_records where t > 4000;

 Bitmap Heap Scan on t_records  (cost=1942.75..8103.70 rows=174236 width=4)
   Recheck Cond: (t > 4000)
   ->  Bitmap Index Scan on t_records_t_idx  (cost=0.00..1899.19 rows=174236 width=0)
         Index Cond: (t > 4000)
 (4 rows)

Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.

Bitmap And / Or

mermaid
flowchart LR
  index[(Index)]
  bitmap[(Bitmap)]
  heap[(Heap)]
  recheck[Condition Recheck]
  index --> bitmap
  where_clause_1 --> bitmap --> recheck --> heap
  where_clause_2 --> bitmap
python
condition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]

bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions
sql
explain select id from t_records where t > 5000 and t < 1000;

 Bitmap Heap Scan on t_records  (cost=62.55..4302.44 rows=4500 width=4)
   Recheck Cond: ((t > 5000) AND (t < 1000))
   ->  Bitmap Index Scan on t_records_t_idx  (cost=0.00..61.42 rows=4500 width=0)
         Index Cond: ((t > 5000) AND (t < 1000))
 (4 rows)
sql
explain select id from t_records where t > 5000 and t < 1000;

Vaccum

The vaccum command is a clean up function, that's auto run by postgres, to reclaim disk space.

sql
vacuum t_records;

The vacuum command is recommended to be run after a large number of CRED operations.

Analyze

Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.

Non-Key Indexes / Covering Indexes

Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,

sql
create index a_idx on temp(a) include (b,c);
mermaid
flowchart LR
  index[(Index)]
  select -->| Index only Scan | index --> resolve

Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).

When searching for b / c, postgres cannot use a_idx, and must go to the heap.

Composite / Combined Indexes

sql
create index a_b_c_combined_idx on temp(a,b,c);

Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.

a
b
c

1

2

3

1

2

4

1

3

3

1

3

4

2

2

3

Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.

Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.

sql
select * from temp where a > 1 and b < 3 and c > 5;
mermaid
flowchart LR
  index[(a,b,c)]
  select --> |index only scan|index --> resolve

When should I use combined indexes?

When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map intersections between data sets.

When should I not use combined indexes?

  • When not including the first column in the where clause. (just use a covering index)

  • When not using the second and up columns in the where clause, and just selecting them.
    (consider using a covering index, this avoids creating b-trees that are not used)

Clustered Indexes vs Non-Clustered Indexes

Clustered Indexes
Non-Clustered Indexes

Data is ordered

Data is not ordered

Reads are faster

Slower

Edits are Slower

Faster

In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.

Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.

Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id

How Database Optimizers Decide to Use Indexes

  • If we get too many rows from the index, it will be faster to just scan the heap.

  • If two indexes are used, one index returns more rows than the other,
    the database will use the index that returns the least rows.

Concurrent Index Creation's

  • Waits for all transactions to finish, before creating the index.

  • Does not lock the table read's or writes.

  • Takes longer time to execute.

  • Can crash if inserting duplicate values, while creating the index.

sql
create index concurrently a_idx on temp(a);

Bloom Filters

mermaid
flowchart LR
  client[Client]
  server[Server]
  bloom_filter[(Bloom Filter)]
  db[(DB)]
  client --> |request| server <--> |get / set| bloom_filter
  server --> |request| db

A bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.

py
# bloom filter
filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
filter_posstion = hash("req") % len(filter)

if filter[filter_posstion] == 1:
  # maybe in set
  return request_db()

# set bit
filter[filter_posstion] = 1

Working with Billion row tables

Why do we need a billion row table? We have many ways to solve a problem.

  • Avoid having a billion row.

If thats not possible,

  • Can you index it?

  • Can you partition it in the same disk.

  • Can you shard horizontally on different disks, on multiple DB's.

  • Map Reduce (not suitabale for transactions).

B and B+ Trees

Full Table Scan

  • Reading the entire table, to find the rows that satisfy the query.

  • Reading pages from disk, that are not in memory.

  • Requires a lot of I/O to read the pages. (8KB - 16KB per page)

  • DBMS's implement multi threading to read pages in parallel.

B-Tree (Original)

  • B trees are not binary trees (each node can have more than 2 children),
    B trees stands for balanced tree.(distance from root to any leaf node is the same)

  • Each node in a B tree contains multiple keys that form a page.

mermaid
graph TD
  root[2:702, 4:704]
  internal_1[1:701]
  internal_2[3:703]
  internal_3[5:705]
  root --> |page|internal_1
  root --> |page|internal_2
  root --> |page|internal_3

The above example has 2 keys, and 3 children.

  • Each key in the node points to a value on the heap (tuple),
    or the primary key.

  • Max no of degrees, no of keys per node.

  • The parent must have (n - 1) keys,
    where n is the number of children.

  • The nodes are ordered from left to right.

  • You cannot see this in the diagram,
    each key in the node has a before and after pointer.

  • Before pointer is used to find the next node,
    that has a value less than the following key.

Limitations of B-Trees

  • Because each node stores both keys and values,
    the number of keys per node is limited.

  • The values are not used until the node is found.

  • This increases the number of I/O's required to find the node
    (reading each node / page is an IO).

  • Range queries are slow, since we need to traverse for all the keys in the range.

B+ Tree

  • Nodes only store keys in internal and root nodes.

  • The values are only stored in the leaf nodes.

  • This allows us to store more keys per node.

  • Leaf nodes are linked, eacn node has a pointer to the next node linked list.

  • Range queries are much faster.

mermaid
graph TD
    root[5] --> root_1[3]
    root --> root_2[7 9]
    root_1 --> root_3[2]
    root_1 --> root_4[4]
    root_2 --> root_5[6]
    root_2 --> root_6[8]
    root_2 --> root_7[10]
    root_3 --> leaf_1[1:row_id]
    root_3 --> leaf_2[2]
    root_4 --> leaf_3[3]
    root_4 --> leaf_4[4]
    root_5 --> leaf_5[5]
    root_5 --> leaf_6[6]
    root_6 --> leaf_7[7]
    root_6 --> leaf_8[8]
    root_7 --> leaf_9[9]
    root_7 --> leaf_10[10,11]
    subgraph linked list
      leaf_1
      leaf_2
      leaf_3
      leaf_4
      leaf_5
      leaf_6
      leaf_7
      leaf_8
      leaf_9
      leaf_10
    end

Each node usually store 100's of thousands of keys, above diagram is just an example.

  • Leaf nodes have duplicate keys.

  • Leaf pointers are cheap to traverse.

  • 1 Node = 1 DBMS Page

  • Internal nodes fit in memory.

  • Leaf nodes can be stored in the heap.

  • Most DBMS uses B+ trees.

Storage cost in Postgres vs mySQL

  • Postgres points to the heap. mySQL points to the primary key.

  • Write amplification is higher in Postgres. (any update to the heap, needs to be replicated in the index).

  • If primary keys are large (UUID), inserts in mySQL will be slower.
    (Since UUID's don't have a pattern.)

  • Leaf nodes in mySQL contains all the data (clustered index).

Database Partitioning

Partitioning types.

  • Range partitioning By date, customer id, etc.

  • By List By location, zip codes, etc.

  • Hash partitioning By hash of the primary key, etc. (consistent hashing)

Partitioning vs Sharding

  • In partitioning, the data is stored in the same DB. The client is agnostic.

  • Each shard sits in a seperate DB, the client needs to know which shard to query.

  • In partitioning, the table name changes.

  • In sharding, the table name is the same, but the DB ip changes.

Creating a partition

sql
create table grades (id serial not null, g not null) partition by range (g);

create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);

alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);
sql
insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
create index on grades (g);

or

sql
insert into grades (g) select g from another_table;
sql
select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99

Make sure ENABLE_PARTITION_PRUNING is set to on in postgresql.conf.

use the show command to check.

sql
show enable_partition_pruning;
sql
set enable_partition_pruning = on;

Checking partition size

sql
select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;

 pg_relation_size |                    relname
------------------+-----------------------------------------------
        126935040 | g3570
        126754816 | g0035
         72572928 | g7090
         36241408 | g90100
         24305664 | g3570_g_idx
         24272896 | g0035_g_idx
         13910016 | g7090_g_idx
          6955008 | g90100_g_idx

Pros of partitioning

  • Increases query performanace, when the query is limited to a partition.

  • DB decides to use Seq scan's instead of Scattered index scan (multi index search).

  • Faster bulk loads (attach / detach partitions).

  • Archiving old data to a cheaper storage (HDD drive).

Cons of partitioning

  • Row updates are a lot slower (the row needs to be moved to a different partition).

  • If Query is not limited to a partition, it will be slower.

  • Schema changes are harder (need to change all partitions), DBMS dependent.

Sharding

Pros of sharding

  • Horizontal scaling (Data / Memory / CPU).

  • Security (Access control).

  • Smaller Indexes.

Cons of sharding

  • Complex client side code. (need to know which shard to query)

  • Atomic transactions across shards are impossible.

  • Rollbacks are complicated AF.

  • Schema changes are harder (need to change all shards).

  • Joins are harder (need to query all shards).

  • Sharding key should be known.

When should we shard?

  • Sharding should be your last resort.

  • Partition first, then shard.

  • TCP connections are reaching the limit?
    Try replication first.
    Writes only to primary server.

  • Two way replication, two availabilty zones (multiple primary databases).

  • You cannot maintain ACID complaiance accross shards. (eventual consistency).

  • Sharding is an overkill for most applications.

  • Vitesse is a good sharding library for go.

Concurrency control

Exclusive lock's vs Shared lock's

Exclusive lock's

  • Other queries cannot read or write to the locked row, until the lock is released.

  • Concurrent attemps resolve with an error.

Shared lock's

  • Only updates are locked.

  • Reads are allowed.

You cannot have both shared and exclusive locks on the same row.

Account
Balance

Alice

100

Bob

200

Charlie

300

mermaid
gantt
    title Timeline
    section Alice
        Exclusive lock  :a1, 2020-01-10, 42h
    section Bob
        Shared lock     :b1, 2020-01-11, 24h
    section Charlie to Bob
        Shared lock on Bob failed  :crit, c1, 2020-01-11, 1h
        Update bob                 :cirt c2, 2020-01-12, 24h

Dead locks

mermaid
graph LR
  A[Transaction 1]
  B[T2]
  l1{{Release lock 1 event}}
  l2{{R2}}
  A --> l1 --> B
  B --> l2 --> A
mermaid
flowchart TD
    t1[Transaction 1]
    t2[Transaction 2]
    a1[Insert 10]
    a2[Insert 20]
    b1[Insert 20]
    b2[Insert 10]
    t1 --> a1 --> a2 --> wait --> commit
    t2 --> b1 --> b2 --> rollback[rollback due to deadlock]
  • Most DBMS's have a deadlock detection mechanism,
    They will fail one of the transactions.

Two phase locking

The double booking phenomenon.

mermaid
graph TD
    t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
    t2[Transaction 2] --> s2[id=10, isBooked=false] --> u2[id=10, isBooked=true, name=Steve] --> commit -->
    status[id=10, isBooked=true, name=Steve]

Both transactions read the same data, and updated it.

Solution, create a two phase locking mechanism.

mermaid
graph LR
    t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
    t2[Transaction 2] --> hang --> s2[id=10, isBooked=true]
sql
begin transaction;
select * from seats where id = 10 for update; -- creates an exclusive lock
update seats set isBooked = true, name = 'Bob' where id = 10;
commit;
sql
begin transaction;
select * from seats where id = 10 for update;
-- hang
-- updated data is returned

for UPDATE creates an exclusive lock on the row indefinitely
set lock_timeout = 1000; will timeout the lock after 1 second.

Alternate solution (not ideal)

mermaid
flowchart LR
    t1[Read committed transaction 1] --> s1[Update when clause] ---> commit
    t2[Read committed transaction 2] --> s2[Update when clause] --> hang --> rollback
text
update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;

This solution is imperfect, You are at the mercy of the DBMS implementation.
Postgres refreshes the row after the lock is released, the same cannot be said for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.

mermaid
flowchart TD
    t1[update clause] --> |reading row on heap for lock info|heap[(Heap)] --> wait{{Wait for lock to be released}}
    wait --> |refresh row|heap
    wait --> resolve

SQL pagination with offset is very slow

This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.

sql
select * from users order by id limit 10 offset 1000000;

instead, make user query for more data using the last-seen-id.

sql
select * from users where id > 1000000 order by id limit 10;

Connection pooling.

  • For every connection, the DBMS needs to allocate resources, this takes time.

  • If we create a connection for every request, the latency and throughput will be very bad.

  • If the no of connections increase, the DBMS will overload and crash.

  • Connection pooling is a way to reuse connections.

  • The connections are kept alive, and reused.

  • Each connection is leased to a request, and returned to the pool after the request is complete.

mermaid
graph LR
    A[Request 1]
    B[Request 2]
    C((Connection pool))
    C --> |lease|A
    A --> |release|C
    C --> |lease|B
    B --> |release|C

    C --> D[(DB ip:port)]

Default connection pool size is 10, in postgres.

mermaid
graph LR
  db[(DB ip_address:port)]
  instance_1((instance 1))
  instance_2((instance 2))
  instance_n((instance n))
  lb[Load balancer]
  request --> lb --> instance_1
  lb --> instance_2
  lb --> instance_n
  instance_1 --> |10 connection pool|db
  instance_2 --> |10 connection pool|db
  instance_n --> |10 connection pool|db
  • Each instance keeps alive 10 connections to the DB.

  • Each instance can send upto 10 concurrent requests to the DB.

  • Each function invocation leases a connection from the pool.

  • Once the function is executed, the connection is returned to the pool.

js
const { Pool } = require('pg');
const pool = new Pool({
	max: 20,
	idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
	connectionTimeoutMillis: 2000, // timeout after 2 seconds
});

await pool.query('SELECT something FROM sometable');

We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.

js
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
	await client.query('BEGIN');
	await client.query('COMMIT');
} catch (e) {
	await client.query('ROLLBACK');
} finally {
	// release the connection to the pool
	client.release();
}

Database replication

Master slave replication

mermaid
graph TD
  query[C-ED queries] --> master[(Master)]
  master --> |replication| slave_1[(replica 1)]
  master --> |replication| slave_2[(replica 2)]
  master --> |replication| slave_n[(replica n)]
  read_query[R queries] -.-> slave_1
  read_query -.-> slave_2
  read_query -.-> slave_n

Multi-Master replication

mermaid
graph LR
  query[C-ED queries]
  master_1[(Master 1)]
  master_2[(Master 2)]
  slave_1[(replica 1)]
  slave_2[(replica 2)]
  slave_3[(replica 3)]

  query --> master_1
  query --> master_2
  master_1 --> |replication| master_2
  master_2 --> |replication| slave_1
  master_2 --> |replication| slave_2
  master_1 --> |replication| slave_3

  read_query[R queries] -.-> slave_1
  read_query -.-> slave_2
  read_query -.-> slave_3
  • Difficult to implement.

  • Need to deal with conflicts.

  • Not usually preferred over master-slave replication.

Synchronous replication

mermaid
graph LR
  query[C-ED queries] --> master[(Master)]
  master[(Master)] --> |replication| slave_1[(replica 1)]
  master --> |replication| slave_2[(replica 2)]
  master --> |replication| slave_n[(replica 3)]
  slave_1 -.-> |ack| master
  slave_2 -.-> |ack| master
  master -.-> |resolve| query
  • Replication must be successful before the query is resolved.

  • If the replication fails, the query fails.

  • Can specify the number of replicas that must acknowledge the replication.
    (first 2 / first 1 / any)

Asynchronous replication

mermaid
graph LR
  query[C-ED queries] --> master[(Master)]
  master[(Master)] -..-> |Async replication| slave_1[(replica 1)]
  master -..-> |Async replication| slave_2[(replica 2)]
  master -..-> |Async replication| slave_3[(replica 3)]
  master --> |resolve| query

Pros and cons of replication

Pros

  • Horizontal scaling

  • Availability zones

Cons

  • Eventual consistency

  • Slow writes in sync mode

  • Multi-master is hard to implement

Twitter system design

  • Post tweet 140 chars

  • Follow users

  • Timeline

mermaid
flowchart LR
    classDef cache fill:#FFFF00
    classDef msg_queue fill:#f96
    classDef purple fill:#800080

    db[(Postgres)]
    client_1[Request]
    pub((Publisher))
    sub((Subscriber))
    lb((L Balancer))
    message_queue[[Message queue / Kafka]]:::msg_queue
    client_1 --> lb --> pub --> message_queue --> sub -->|pool| db
  • L4, LB (transparent proxy)
    no TLS encryption handling.

  • L7, LB
    handles TLS encryption.

ID
Name
Pic

Extremely large table.

ID*
Following ID*

1

2

URL shortener system design

  • Create a shortened URL.

  • Redirects to the original URL.

mermaid
flowchart LR
    classDef cache fill:#FFFF00
    classDef msg_queue fill:#f96
    classDef purple fill:#800080

    db_replica[(PG replica)]
    db[(Postgres)]
    db_replica_1[(PG replica)]
    client_1[Request]
    web_server_1((Server))
    reverse_proxy((Rev proxy))

    db -.-> |replication| db_replica
    db -.-> |replication| db_replica_1
    client_1 --> reverse_proxy --> |POST: url| web_server_1 <-->|UPDATE| db

    web_server_1 <--> |SELECT| db_replica
    web_server_1 <--> |SELECT| db_replica_1
url
short_url

primary key

  • Only shard when you are not able to handle writes

py
base64(sha(url, salt)[:8]) # short url

Database Engines / Storage Engines / Embedded Databases

  • A library that takes care of CRED operations on the actual disk.

  • Databases can build on top of existing database engines.

  • MariaDB / MySQL allows us to switch between DB engines.

  • Postgres does not allow us to switch between DB engines.

LSM Trees ( watch?v=I6jB0nM9SKU )

mermaid
flowchart LR
    classDef cache fill:#FFFF00
    classDef msg_queue fill:#f96
    classDef purple fill:#800080

    memtable[(Memtable / B tree)]:::cache
    subgraph SSTable
        sstable_1[(SSTable)]
        sstable_2[(SSTable 1)]
        sstable_3[(SSTable ...)]
    end

    writes --> memtable -->|flush to disk| sstable_1

    read -->|1| memtable
    read -->|2| sstable_1
    read -->|3| sstable_2
    read -->|4| sstable_3
  • Log structured merge trees.

  • Each change to an existing key/value pairs are appended to the Memtable.

  • When the Memtable is full, it is flushed to disk as a SSTable.

  • SSTable is immutable, and is stored in the disk.

  • SSTables are have keys sorted in order.

  • Any update / delete to a key, is appended to the Memtable.

  • Memtable is flushed to a new SSTable, when the Memtable is full.

  • Bloomfilters are used to check if a key exists in the SSTable before reading it.

  • Compaction is used to merge multiple SSTables into one. (routine background process, for reclaiming disk space)

  • Duplicate and stale keys are removed during compaction.

Database Cursors

Pros

  • Cursors create a temporary table in memory.

  • Cursors are used to iterate over a large number of rows. (without loading all the rows into memory)

  • Allows data to be streamed.

  • Can cancel the cursor at any time.

  • Paging can be done using cursors.

Cons

  • Cursors are stateful, if the connection is lost, the cursor is lost.

  • Cannot horizontally scale cursors.

  • Long running cursors can cause locking issues.

Database Security

Steps to enable SSL

  • Run pg admin

    bash
    sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80  --name pgadmin dpage/pgadmin4
  • Install vim in postgres docker container

    bash
    docker exec -it pg bash;
    apt-get update;
    apt-get install vim;
  • Generate SSH key

    bash
    openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365
  • Update postgresql.conf

    bash
    vim var/lib/postgresql/data/postgresql.conf
    ssl = on
    ssl_cert_file = 'cert.pem'
    ssl_key_file = 'key.pem'
  • Connect to the db using pg admin

What is the largest sql query we can send before it fails?

  • Create table if not exists.

  • Different users for CRED operations and Table owners.

  • Have different privileges for different users.

  • Normal users should not be able to DROP tables.

  • Different pools for different routes (routes that require read access only will use a different pool, logged in as normal user).

Best practices when working with databases

Separate read and write pools.

  • This adds additional security to the database.

  • An compromised read only user, cannot drop tables.

Use Pools

  • All read operations should use a read only pool.

  • A database pool is a collection of TCP connections.

  • Each connection is leased to a request, and returned to the pool after the request is complete.

  • This way we can reuse connections, and avoid creating new connections for every request.

  • Less overhead to create and close connections for the database.

Don't use unbounded queries

  • select * from users instead of select * from users limit 10 offset 1000000.

Use template literals

  • select * from users where id = ${id} instead of select * from users where id = hardcoded_id.

Don't show database errors to the client

  • Show generic error messages to the client instead.

Homomorphic Encryption

What is encryption?

  • Symmetric encryption

    • Same key is used to encrypt and decrypt.

    • AES, DES, 3DES, RC4, RC5, RC6, Blowfish, Twofish, etc.

  • Asymmetric encryption

    • Different keys are used to encrypt and decrypt.

    • Public key and private key pair.

Why can't we always encrypt data?

  • Queries cannot be executed on encrypted data.

  • Analysis, indexing, sorting, etc cannot be done on encrypted data.

  • Applications must read the data, decrypt it, and then perform the operation.

  • TLS termination at layer 7 load balancer requires the data to be decrypted. (needs to know the request, payload, etc)

What is homomorphic encryption?

  • Allows us to perform arithmetic operations on encrypted data.

  • No need to decrypt the data to perform operations.

  • Its damn slow, at the moment.

  • Databases can index and optimize without decrypting the data.

  • True end to end encryption.

Answering your questions

Why does posgres do a bit map heap scan, instead of an index only scan?

sql
create index g_index on grades(g);
explain analyze select count(*) from grades where g = 1;
  • Postgres needs to visit the heap to check if the row is visible to all transactions.

  • The visibility col is a hidden col in the heap for each row.

  • Visibility map is used to check if the row is visible to all transactions.

sql
vaccum grades;
  • vaccum command creates a visibility map.

What is cost in explain analyze?

  • Cost is the estimated cost of the query.

  • Its just a number that postgres uses.

  • Higher the cost, slower the query.

All Isolation levels

Repeatable read

  • The rows that we read, will remain the same, until the transaction is complete.

  • Obtains the lock on the things we read.

  • Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )

  • New rows that where inserted during the transaction, will not be visible to the transaction.

sql
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10

-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11

-- same query returns different results, within the same transaction

Snapshot isolation level

  • Each transaction has a version.

  • xmin is the transaction version of the transaction that created the row.

  • xmax is the transaction version of the transaction that deleted the row.

  • xmin and xmax are stored in the heap.

  • When I start a transaction, only the rows that are visible to the transaction are returned.

  • Any new rows that are inserted during the transaction, will not be visible to the transaction.

  • Any rows that are deleted during the transaction, will be visible to the transaction.

  • xmin is used to filter out rows that where inserted after the transaction started.

sql
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10

-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10

Repeatable read vs Snapshot isolation level

  • Depends on the database.

  • In Posgres Repeatable read is Repeatable read.

Select for update. (Pessimistic locking)

sql
select * from users where id = 1 for update;
  • Obtains an exclusive lock on the rows that we read.

  • Other transactions cannot read / update the rows that where locked.

  • They have to wait for the previous transaction to release the lock.

Serializable isolation level. (Optimistic locking)

  • Uses MVCC to check if the rows that we read are modified by other transactions.

  • Detects conflicts, and fails the transaction.

Duplicate indexes in a row

  • Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.

  • Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)

  • Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]

Why use SELECT FOR UPDATE?

  • Each transaction obtains a exclusive lock on the rows that it needs to read / update.

  • Other transactions cannot read / update the rows that where locked.

  • You killed concurrency, your throughput is low.

  • Your transactions will never fail due to conflicts.

Serialisable isolation level is better.

  • Detects conflicts and fails the transaction, if there is a serialisable error.

  • Your throughput is high, and you have concurrency.

  • Its a optimistic locking mechanism.

  • Your transactions can fail, due to conflicts.

  • You need to handle retry logics in your application.

Why use a connection pool?

  • If you are using one database connection for every request, you are doing it wrong.

  • TCP blocking could happen if a lot of queries need to be executed at the same time.

mermaid
flowchart LR
    client[Client1]
    client_2[Client2]
    server[Server]
    db[DB]
    client --> |request| server --> |get / set| db
    client_2 --> |faster request| server --> |get / set| db
mermaid
flowchart RL
    classDef green fill:#ccffcc,stroke:#000,stroke-width:2px;
    classDef red fill:#ffcccc,stroke:#000,stroke-width:2px;

    client[Client1]:::green
    client_2[Client2]:::red
    server[Server]
    db[DB]

    db --> |fast response| server --> |fast response| client
    db --> |slow response| server --> |slow response| client_2
  • There is a chance of response misdirection.

  • Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.

But in case of a connection pool,

  • Each connection is leased to a request, and returned to the pool after the request is complete.

  • No parallel requests are made on the same connection.

What is the value of a bitmap index scan?

  • Row can be in multiple pages.

  • Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.

  • Updates a bit map on the second condition.

  • Goes to the heap pulls those pages.

  • Filters out the page for the query condition

See, "Bitmap And / Or" for more info this file

  • If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.

Does composite index order matter?

  • Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.

  • Composite indexes are created from left to right.

  • Minimum filter first. (filter that results in a larger dataset first)

    sql
    select a,b,c where a=10 and b=10 and c=20;
    -- a is minimum filter

In an index only scan, pg still needs to visit the heap?

  • To check if the table is visible to the transaction.

  • To get old versions (snapshot) of the row, when the transaction started.

  • To avoid visits to the heap, use VACUUM command.

  • Avoid long running transactions.

Persistence vs Durability

  • Persistence is the ability to store data on disk and retrieve it at a later time after a crash.

  • My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.

Can we create index when row's are being inserted / updated?

  • Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.

  • Postgres allows you to create index concurrently, without locking the table.

Graph databases

  • Database persists and retrieves data from disk.

  • IO, fetching pages from disk / heap is heavy.

  • More IO, more latency.

  • Storage engine, classic storage engines use tables and rows.

Postgres, does not store large strings in-line in the row, it compresses to an external TOAST table, and points to it.

  • Meta data can be stored in an external data-structure.

WAL, REDO, UNDO logs

  • When a transaction is committed, data is expected to be persistent even after system termination.

  • A transaction does changes to everything in memory.

  • On commit, all changes are flushed to disk. ( this is slow )

  • If databases crashes mid commit, what should happen to data?.

  • A log of all changes are persisted to disk. (WAL write ahead log).

  • Dirty pages are kept in memory.

  • Dirty pages (changed pages) are flushed to WAL.

  • DB comes back up, detects that WAL is ahead of the data on disk.

  • No one reads from the WAL, it's only used for recovery.

  • Page is loaded in memory, and the WAL is replayed ( WAL is also called as redo log ).

  • Page is flushed to disk, and WAL is marked as flushed.

  • How often should the WAL be flushed to disk? (checkpointing)

    • When the wall is full, we need to flush it to disk, before we can write to the wall again.

    • Checkpointing is a IO intensive operation, we need to pause everything until the checkpoint is done.

    • DB warns you before the checkpoint is initiated. (causes spikes in CPU usage)

    • If the WAL is small, the checkpoint is faster and more frequent.

  • OS buffers the writes in memory, and flushes it to disk when it's ready ( to reduce IO and wear on the disk ).

  • OS waits for the write cache to be full, before flushing it to disk.

  • This is a problem, because the WAL is not flushed to disk, and the OS crashes, the WAL is lost.

  • fsync() system call is used to flush WAL is flushed to disk directly bypassing the OS cache.

UNDO logs?

  • Not all databases have UNDO logs.

  • Rollback Transactions: If a transaction fails or is explicitly rolled back by the user, the UNDO log records are used to revert the changes made by that transaction.

  • Recovery: In the event of a system crash, UNDO logs are used during the recovery process to ensure that any incomplete transactions are undone, maintaining the consistency of the database.

  • Long running transactions, can cause the UNDO logs to grow, and for any new transactions to see the old data, the UNDO logs need to be replayed to get to the old state.

  • Undo logs are applied after wall changes are applied when DB is recovering from a crash.

Can Null values improve database performance?

  • Default values will take up space, eg (int 32bit will take up 32bits in the row even if the value like 0 does not take up 32 bits))

  • PostGres uses a null bitmap, to store null values. Every row has a null bitmap, that stores the null values in the row. Every bit corresponds to a column in the row.

  • If you have more than 8 columns, the null bitmap will increment by 8 bytes. | 8cols | 8 bits | | 9cols | 16 bits | | 17cols | 24 bits |

  • Nulls are better than default values, in terms of space.

Don't rely on nulls all the time, there are some gotcha's

  • select count(field); ignores nulls based on db

  • select count(*); counts nulls

  • T is null or T is not null, is fine.

  • Not all databases support indexes on cols with nulls in them. (PG allows it, but oracle doesn't)

  • Postgres limits each page size to 8kb.

  • We need to try to fit as many rows as possible in a page.

  • More IO's if the row is split across pages.

  • If most of the cols are null, the no of rows that can fit in a page can be increased.

  • There is a fixed cost of saving null's

  • Select user_id from users where user_id is null; will not use the index on user_id.

Postgres Architecture

  • Postmaster process

  • Everything in postgres is append only mutli version concurrency control.

  • Process vs Threads, (threads are faster, but postgres uses processes)

well established mind physical body, mind, emotions, life energies

Distributed transaction ( a transaction spanning across multiple databases / services / processes )

  • Each microservice has its own database.

  • Making a transaction across multiple microservices is hard.

    mermaid
    flowchart LR
        A[Order service] -->|success| B[Payment service]
        B -->|failed| C[Order service needs to revert]
        C -->|transactions cannot be reverted| D[Problem]
  • Ongoing transactions where stored in memory, instead of disk

  • But for distributed transactions this is not the case

  • Google atomic clocks to sequence of transactions

  • Compensatory transactions, revert the previous transaction

  • Eventing model, Kafka / Rabbit MQ,

    mermaid
    graph TD
        A[Order queue] --> B[Payment service]
        B --> C[Payment queue]
        C --> D[Shipping service]
        B --> E[Payment fail queue]
        E --> F[...]
  • Go back to the old days, have a macro service.

  • Debugging is hard in a distributed micro services.

Why did Uber migrate from postgres to mysql

  • ctid is a internal parameter used by postgres to identify for physical row identification, it's not meant to be used as a primary key.

  • ctid's change with every row update and are not stable.

  • When a row is updated, PostgreSQL often creates a new version of the row with a different ctid. The index is updated to point to the new ctid. This is why indexes might get bloated over time with dead tuples, necessitating regular maintenance tasks like VACUUM or REINDEX.

  • This is called Write amplification.

  • WAL's are used to keep track of these changes.

  • WAL's are pushed to standby servers, to keep them in sync.

  • SSD's are not good for updates, they are good for inserts (RocksDB and LevelsDB, optimised for SSD's).

    Adding too much indexes are a bad idea, it will cause write amplifications.

  • Uber had databases across multiple data centers, and they had to keep them in sync.

  • Push WAL updates across multiple data centers, is expensive (network wise).