Acid
What is a transaction? 🔁
Life cycle of a transaction
flowchart TD
start[DB Starting State] -->
begin[BEGIN Transaction] --> work[One unit of work]
work --> commit[COMMIT Transaction]
work --> rollback[ROLLBACK Transaction]
rollback --> startBegin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;balance - 100 > 0;UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;COMMIT TRANSACTION;Atomicity
All queries in a transaction must succeed,
or the transaction must be rolled back.
Even if 1/100 query fails,
the entire transaction must be rolled back.
It the database went down prior to a commit of a transaction,
the transaction must be rolled back.
flowchart TD
start[DB Starting Snapshot] -->
begin[BEGIN Transaction] --> update_1[Queries 5/100]
update_1 --> db_crash{DB CRASH}
db_crash -->|yes| cleanup[DB Cleanup / Rollback]
db_crash -->|no| COMMIT[COMMIT Transaction]
cleanup --> startIsolation
No title
Can my inflight transaction see the changes made by other transactions?
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] --> update_1[UPDATE Data]
initial_state ---> transaction_2[BEGIN transaction_2] --> |dirty read| read_1[SELECT Data]
update_1 --> |roolback| initial_stateNon-repeatable reads
Two different reads of the same data, return different results.
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] ---> update_1[UPDATE Data] -->
commit[commit]
transaction_2[BEGIN transaction_2] --> read_1[SELECT Data] --->
|same query diff result| read_2[SELECT Data]
initial_state --> transaction_2Avoid it by starting a repeatable read transaction,
begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] --> read_1[SELECT Data]
transaction_2[BEGIN transaction_2] --> update_1[INSERT Data] --> commit
read_1 ---> |same query has more results| read_2[SELECT Data]
initial_state --> transaction_2On postgres you can use the following to prevent phantom reads,
begin transaction isolation level repeatable read;
# or
begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
flowchart TD
initial_state[SUM 10] -->
transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
initial_state -->
transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commitIsolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Consistency
Consistency in data is user defined (DBA team)
Referential integrity (through foreign keys) data is
inconsistent across tables.
Atomicity, isolation affects consistency.
Data consistency between tables,
1
a
1
2
b
2
Mate
1
Mark
1
Jack
2
Consistency in reads,
flowchart LR
data --> |update| db[(DB)] --> |read| dataIf a transaction update a row,
will the change be immediately visible during a read?.
Both SQL and NoSQL databases suffer from this problem.
Eventual consistency
flowchart LR
data_v1 --> |update| db[(DB)] -->
|read| data_v2
db --> |read| data_vn
db --> |read| data_v1You will eventually get the latest version of the data.
flowchart LR
master_db[(Master - X)]
slave_1_db[(Follower - Z)]
slave_2_db[(Follower - Z)]
data -->|updated to x| master_db
master_db --> slave_1_db
master_db --> slave_2_db
slave_1_db -->|Inconsistent data| readflowchart LR
master_db[(Master - X)]
slave_1_db[(Follower - X)]
slave_2_db[(Follower - X)]
data --> master_db
master_db -->|replicate| slave_1_db
master_db -->|replicate| slave_2_db
slave_1_db -->|Consistent data| readWhen we introduce caching or we introduce horizontal scaling,
you are eventually consistent.
If you do not have atomicity,
or have inconsistent data,
you do not have consistency at all.
Durability
Process of persisting data to disk (non-volatile storage).
Durability slows write down,
write to disk is slower than write to memory
Durability techniques
WAL - Write ahead log / AOF - Append only file
Any change to the database is written to a log file, the log file is flushed to disk.
flowchart LR
data --> |update| db[(DB)] --> |write| log[(Log)] --> |flush| disk[(Disk)]flowchart LR
db[(DB)]
log[(Log)]
disk[(Disk)]
log --> db
disk --> db
db --> reconstructed_db[(DB reconstructed)]Writing all the changes to disk is expensive (indexes, data, files,
rows...)
The log file is a compressed version of the data stored in the database.
flowchart LR
db[(DB)] --> |flush request| os -->
os_cache[(OS Cache)] -->
|batched flushes| disk[(Disk)]
db --> |fSync| diskfSync forces the OS to flush the data to disk.
This prevents data loss in case of OS crash.
Async snapshots
flowchart LR
data --> |update| db[(DB)] --> |async| snapshot[(Snapshot)] --> |flush| disk[(Disk)]
db --> |sync| snapshot[(Snapshot)]Asynchronously write the data to disk
Serializability vs Repeatable reads
Repeatable reads
flowchart LR
transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit
transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit -->
result[a, a, b, b]both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Serializability
flowchart LR
transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit --> result[a, a, b, b]
transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit_error[Could not serialize transaction]Database Internals
How tables and indexes are stored on disk?
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
flowchart TD
heap[(Heap)]
index[(Index)]
indexed_query --> |select|index --> |page,row|heap --> |row|indexed_queryIn postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Row vs Column oriented databases
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
flowchart TD
page_1[1-200,name,age]
page_2[200-400,name,age]
page_3[400-500,name,age]
sum[sum age] --> |fetch| page_1 --> |fetch| page_2 --> |fetch| page_3Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
flowchart TD
page_1[age: 1:row_id-500:row_id]
sum[sum age] --> |fetch| page_1To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Primary key vs Secondary key
There is no order maintained by default in the table,
Clustering, order enforced by the database on a table,
through a primary key or secondary key.
Postgres Clustered table / heap organised table, mySQL Index ordered table
Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual
table.
Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.
Buffer pool
flowchart LR
heap[(Heap)]
index[(Index)]
buffer_pool[(Buffer pool)]
wal[(WAL)]
read_query --> |select|index --> |select|heap --> |page,row|buffer_pool --> |row|wal
update_query --> buffer_pool --> walTo reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the
buffer pool,
thus reducing the number of IO's.
Database Indexing
Creating table with random vaues.
insert into temp (t) select random()*100 from generate_series(0,100000)Explaining the explain command
Returns statistics about the query plan,
explain select * from temp where t > 50;
Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Scan types ( Seq vs Index )
Cost Time to start executing the query,
Time to finish executing the query
Rows
Number of rows return
Width
Avg size of returned rows in bytes
Creating an index
create index temp_t_idx on temp(t);Without index scan,
Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)With index scan,
Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.
With an index search,
Time to prepare increases 0.00 vs 0.29
Time to execute decreases
Number of rows returned visited decreases
Seq table scan vs Index scan vs Bitmap index scan (postgres)
No title
Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
explain select id from t_records where t > 10;
Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
Filter: (t > 10)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select --> heap --> resolveIndex
explain select id from t_records where t > 5000;
Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
Index Cond: (t > 5000)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select -->| Index Scan | index --> heap --> resolveBitmap
If the no of rows to be scanned is too high, and the
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
select --> bitmap --> recheck --> heapbitmap is a bit array of the index values, pointing to the page on heap.
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
explain select id from t_records where t > 4000;
Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
Recheck Cond: (t > 4000)
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
Index Cond: (t > 4000)
(4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Bitmap And / Or
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
where_clause_1 --> bitmap --> recheck --> heap
where_clause_2 --> bitmapcondition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]
bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
create index a_idx on temp(a) include (b,c);flowchart LR
index[(Index)]
select -->| Index only Scan | index --> resolveAdding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
Composite / Combined Indexes
create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
select * from temp where a > 1 and b < 3 and c > 5;flowchart LR
index[(a,b,c)]
select --> |index only scan|index --> resolveWhen should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Concurrent Index Creation's
Waits for all transactions to finish, before creating the index.
Does not lock the table read's or writes.
Takes longer time to execute.
Can crash if inserting duplicate values, while creating the index.
create index concurrently a_idx on temp(a);Bloom Filters
flowchart LR
client[Client]
server[Server]
bloom_filter[(Bloom Filter)]
db[(DB)]
client --> |request| server <--> |get / set| bloom_filter
server --> |request| dbA bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.
# bloom filter
filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
filter_posstion = hash("req") % len(filter)
if filter[filter_posstion] == 1:
# maybe in set
return request_db()
# set bit
filter[filter_posstion] = 1Working with Billion row tables
Why do we need a billion row table? We have many ways to solve a problem.
Avoid having a billion row.
If thats not possible,
Can you index it?
Can you partition it in the same disk.
Can you shard horizontally on different disks, on multiple DB's.
Map Reduce (not suitabale for transactions).
B and B+ Trees
Full Table Scan
Reading the entire table, to find the rows that satisfy the query.
Reading pages from disk, that are not in memory.
Requires a lot of I/O to read the pages. (8KB - 16KB per page)
DBMS's implement multi threading to read pages in parallel.
B-Tree (Original)
B trees are not binary trees (each node can have more than 2 children),
B trees stands for balanced tree.(distance from root to any leaf node is
the same)
Each node in a B tree contains multiple keys that form a page.
graph TD
root[2:702, 4:704]
internal_1[1:701]
internal_2[3:703]
internal_3[5:705]
root --> |page|internal_1
root --> |page|internal_2
root --> |page|internal_3The above example has 2 keys, and 3 children.
Each key in the node points to a value on the heap (tuple),
or the primary key.
Max no of degrees, no of keys per node.
The parent must have (n - 1) keys,
where n is the number of children.
The nodes are ordered from left to right.
You cannot see this in the diagram,
each key in the node has a before and after pointer.
Before pointer is used to find the next node,
that has a value less than the following key.
Limitations of B-Trees
Because each node stores both keys and values,
the number of keys per node is limited.
The values are not used until the node is found.
This increases the number of I/O's required to find the node
(reading each node / page is an IO).
Range queries are slow, since we need to traverse for all the keys in the range.
B+ Tree
Nodes only store keys in internal and root nodes.
The values are only stored in the leaf nodes.
This allows us to store more keys per node.
Leaf nodes are linked, eacn node has a pointer to the next node linked list.
Range queries are much faster.
graph TD
root[5] --> root_1[3]
root --> root_2[7 9]
root_1 --> root_3[2]
root_1 --> root_4[4]
root_2 --> root_5[6]
root_2 --> root_6[8]
root_2 --> root_7[10]
root_3 --> leaf_1[1:row_id]
root_3 --> leaf_2[2]
root_4 --> leaf_3[3]
root_4 --> leaf_4[4]
root_5 --> leaf_5[5]
root_5 --> leaf_6[6]
root_6 --> leaf_7[7]
root_6 --> leaf_8[8]
root_7 --> leaf_9[9]
root_7 --> leaf_10[10,11]
subgraph linked list
leaf_1
leaf_2
leaf_3
leaf_4
leaf_5
leaf_6
leaf_7
leaf_8
leaf_9
leaf_10
endEach node usually store 100's of thousands of keys, above diagram is just an example.
Leaf nodes have duplicate keys.
Leaf pointers are cheap to traverse.
1 Node = 1 DBMS Page
Internal nodes fit in memory.
Leaf nodes can be stored in the heap.
Most DBMS uses B+ trees.
Storage cost in Postgres vs mySQL
Postgres points to the heap. mySQL points to the primary key.
Write amplification is higher in Postgres. (any update to the heap, needs to be replicated in the index).
If primary keys are large (UUID), inserts in mySQL will be slower.
(Since UUID's don't have a pattern.)
Leaf nodes in mySQL contains all the data (clustered index).
Database Partitioning
Partitioning types.
Range partitioning By date, customer id, etc.
By List By location, zip codes, etc.
Hash partitioning By hash of the primary key, etc. (consistent hashing)
Partitioning vs Sharding
In partitioning, the data is stored in the same DB. The client is agnostic.
Each shard sits in a seperate DB, the client needs to know which shard to query.
In partitioning, the table name changes.
In sharding, the table name is the same, but the DB ip changes.
Creating a partition
create table grades (id serial not null, g not null) partition by range (g);
create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);
alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
create index on grades (g);or
insert into grades (g) select g from another_table;select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99Make sure ENABLE_PARTITION_PRUNING is set to on in
postgresql.conf.
use the show command to check.
show enable_partition_pruning;set enable_partition_pruning = on;Checking partition size
select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
pg_relation_size | relname
------------------+-----------------------------------------------
126935040 | g3570
126754816 | g0035
72572928 | g7090
36241408 | g90100
24305664 | g3570_g_idx
24272896 | g0035_g_idx
13910016 | g7090_g_idx
6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
Sharding
Pros of sharding
Horizontal scaling (Data / Memory / CPU).
Security (Access control).
Smaller Indexes.
Cons of sharding
Complex client side code. (need to know which shard to query)
Atomic transactions across shards are impossible.
Rollbacks are complicated AF.
Schema changes are harder (need to change all shards).
Joins are harder (need to query all shards).
Sharding key should be known.
When should we shard?
Sharding should be your last resort.
Partition first, then shard.
TCP connections are reaching the limit?
Try replication first.
Writes only to primary server.
Two way replication, two availabilty zones (multiple primary databases).
You cannot maintain ACID complaiance accross shards. (eventual consistency).
Sharding is an overkill for most applications.
Vitesse is a good sharding library for go.
Concurrency control
Exclusive lock's vs Shared lock's
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
gantt
title Timeline
section Alice
Exclusive lock :a1, 2020-01-10, 42h
section Bob
Shared lock :b1, 2020-01-11, 24h
section Charlie to Bob
Shared lock on Bob failed :crit, c1, 2020-01-11, 1h
Update bob :cirt c2, 2020-01-12, 24hDead locks
graph LR
A[Transaction 1]
B[T2]
l1{{Release lock 1 event}}
l2{{R2}}
A --> l1 --> B
B --> l2 --> Aflowchart TD
t1[Transaction 1]
t2[Transaction 2]
a1[Insert 10]
a2[Insert 20]
b1[Insert 20]
b2[Insert 10]
t1 --> a1 --> a2 --> wait --> commit
t2 --> b1 --> b2 --> rollback[rollback due to deadlock]Most DBMS's have a deadlock detection mechanism,
They will fail one of the transactions.
Two phase locking
The double booking phenomenon.
graph TD
t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
t2[Transaction 2] --> s2[id=10, isBooked=false] --> u2[id=10, isBooked=true, name=Steve] --> commit -->
status[id=10, isBooked=true, name=Steve]Both transactions read the same data, and updated it.
Solution, create a two phase locking mechanism.
graph LR
t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
t2[Transaction 2] --> hang --> s2[id=10, isBooked=true]begin transaction;
select * from seats where id = 10 for update; -- creates an exclusive lock
update seats set isBooked = true, name = 'Bob' where id = 10;
commit;begin transaction;
select * from seats where id = 10 for update;
-- hang
-- updated data is returnedfor UPDATE creates an exclusive lock on the row indefinitelyset lock_timeout = 1000; will timeout the lock after 1 second.
Alternate solution (not ideal)
flowchart LR
t1[Read committed transaction 1] --> s1[Update when clause] ---> commit
t2[Read committed transaction 2] --> s2[Update when clause] --> hang --> rollbackupdate seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;This solution is imperfect, You are at the mercy of the DBMS
implementation.
Postgres refreshes the row after the lock is released, the same cannot be said
for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.
flowchart TD
t1[update clause] --> |reading row on heap for lock info|heap[(Heap)] --> wait{{Wait for lock to be released}}
wait --> |refresh row|heap
wait --> resolveSQL pagination with offset is very slow
This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.
select * from users order by id limit 10 offset 1000000;instead, make user query for more data using the last-seen-id.
select * from users where id > 1000000 order by id limit 10;Connection pooling.
For every connection, the DBMS needs to allocate resources, this takes time.
If we create a connection for every request, the latency and throughput will be very bad.
If the no of connections increase, the DBMS will overload and crash.
Connection pooling is a way to reuse connections.
The connections are kept alive, and reused.
Each connection is leased to a request, and returned to the pool after the request is complete.
graph LR
A[Request 1]
B[Request 2]
C((Connection pool))
C --> |lease|A
A --> |release|C
C --> |lease|B
B --> |release|C
C --> D[(DB ip:port)]Default connection pool size is 10, in postgres.
graph LR
db[(DB ip_address:port)]
instance_1((instance 1))
instance_2((instance 2))
instance_n((instance n))
lb[Load balancer]
request --> lb --> instance_1
lb --> instance_2
lb --> instance_n
instance_1 --> |10 connection pool|db
instance_2 --> |10 connection pool|db
instance_n --> |10 connection pool|dbEach instance keeps alive 10 connections to the DB.
Each instance can send upto 10 concurrent requests to the DB.
Each function invocation leases a connection from the pool.
Once the function is executed, the connection is returned to the pool.
const { Pool } = require('pg');
const pool = new Pool({
max: 20,
idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
connectionTimeoutMillis: 2000, // timeout after 2 seconds
});
await pool.query('SELECT something FROM sometable');We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
} finally {
// release the connection to the pool
client.release();
}Database replication
Master slave replication
graph TD
query[C-ED queries] --> master[(Master)]
master --> |replication| slave_1[(replica 1)]
master --> |replication| slave_2[(replica 2)]
master --> |replication| slave_n[(replica n)]
read_query[R queries] -.-> slave_1
read_query -.-> slave_2
read_query -.-> slave_nMulti-Master replication
graph LR
query[C-ED queries]
master_1[(Master 1)]
master_2[(Master 2)]
slave_1[(replica 1)]
slave_2[(replica 2)]
slave_3[(replica 3)]
query --> master_1
query --> master_2
master_1 --> |replication| master_2
master_2 --> |replication| slave_1
master_2 --> |replication| slave_2
master_1 --> |replication| slave_3
read_query[R queries] -.-> slave_1
read_query -.-> slave_2
read_query -.-> slave_3Difficult to implement.
Need to deal with conflicts.
Not usually preferred over master-slave replication.
Synchronous replication
graph LR
query[C-ED queries] --> master[(Master)]
master[(Master)] --> |replication| slave_1[(replica 1)]
master --> |replication| slave_2[(replica 2)]
master --> |replication| slave_n[(replica 3)]
slave_1 -.-> |ack| master
slave_2 -.-> |ack| master
master -.-> |resolve| queryReplication must be successful before the query is resolved.
If the replication fails, the query fails.
Can specify the number of replicas that must acknowledge the replication.
(first 2 / first 1 / any)
Asynchronous replication
graph LR
query[C-ED queries] --> master[(Master)]
master[(Master)] -..-> |Async replication| slave_1[(replica 1)]
master -..-> |Async replication| slave_2[(replica 2)]
master -..-> |Async replication| slave_3[(replica 3)]
master --> |resolve| queryPros and cons of replication
Pros
Horizontal scaling
Availability zones
Cons
Eventual consistency
Slow writes in sync mode
Multi-master is hard to implement
Twitter system design
Post tweet 140 chars
Follow users
Timeline
flowchart LR
classDef cache fill:#FFFF00
classDef msg_queue fill:#f96
classDef purple fill:#800080
db[(Postgres)]
client_1[Request]
pub((Publisher))
sub((Subscriber))
lb((L Balancer))
message_queue[[Message queue / Kafka]]:::msg_queue
client_1 --> lb --> pub --> message_queue --> sub -->|pool| dbL4, LB (transparent proxy)
no TLS encryption handling.
L7, LB
handles TLS encryption.
Extremely large table.
1
2
URL shortener system design
Create a shortened URL.
Redirects to the original URL.
flowchart LR
classDef cache fill:#FFFF00
classDef msg_queue fill:#f96
classDef purple fill:#800080
db_replica[(PG replica)]
db[(Postgres)]
db_replica_1[(PG replica)]
client_1[Request]
web_server_1((Server))
reverse_proxy((Rev proxy))
db -.-> |replication| db_replica
db -.-> |replication| db_replica_1
client_1 --> reverse_proxy --> |POST: url| web_server_1 <-->|UPDATE| db
web_server_1 <--> |SELECT| db_replica
web_server_1 <--> |SELECT| db_replica_1primary key
Only shard when you are not able to handle writes
base64(sha(url, salt)[:8]) # short urlDatabase Engines / Storage Engines / Embedded Databases
A library that takes care of CRED operations on the actual disk.
Databases can build on top of existing database engines.
MariaDB / MySQL allows us to switch between DB engines.
Postgres does not allow us to switch between DB engines.
LSM Trees ( watch?v=I6jB0nM9SKU )
flowchart LR
classDef cache fill:#FFFF00
classDef msg_queue fill:#f96
classDef purple fill:#800080
memtable[(Memtable / B tree)]:::cache
subgraph SSTable
sstable_1[(SSTable)]
sstable_2[(SSTable 1)]
sstable_3[(SSTable ...)]
end
writes --> memtable -->|flush to disk| sstable_1
read -->|1| memtable
read -->|2| sstable_1
read -->|3| sstable_2
read -->|4| sstable_3Log structured merge trees.
Each change to an existing key/value pairs are appended to the Memtable.
When the Memtable is full, it is flushed to disk as a SSTable.
SSTable is immutable, and is stored in the disk.
SSTables are have keys sorted in order.
Any update / delete to a key, is appended to the Memtable.
Memtable is flushed to a new SSTable, when the Memtable is full.
Bloomfilters are used to check if a key exists in the SSTable before reading it.
Compaction is used to merge multiple SSTables into one. (routine background process, for reclaiming disk space)
Duplicate and stale keys are removed during compaction.
Database Cursors
Pros
Cursors create a temporary table in memory.
Cursors are used to iterate over a large number of rows. (without loading all the rows into memory)
Allows data to be streamed.
Can cancel the cursor at any time.
Paging can be done using cursors.
Cons
Cursors are stateful, if the connection is lost, the cursor is lost.
Cannot horizontally scale cursors.
Long running cursors can cause locking issues.
Database Security
Steps to enable SSL
Run pg admin
sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80 --name pgadmin dpage/pgadmin4Install vim in postgres docker container
docker exec -it pg bash;
apt-get update;
apt-get install vim;Generate SSH key
openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365Update postgresql.conf
vim var/lib/postgresql/data/postgresql.conf
ssl = on
ssl_cert_file = 'cert.pem'
ssl_key_file = 'key.pem'Connect to the db using pg admin
What is the largest sql query we can send before it fails?
Create table if not exists.
Different users for CRED operations and Table owners.
Have different privileges for different users.
Normal users should not be able to DROP tables.
Different pools for different routes (routes that require read access only will use a different pool, logged in as normal user).
Best practices when working with databases
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Homomorphic Encryption
What is encryption?
Symmetric encryption
Same key is used to encrypt and decrypt.
AES, DES, 3DES, RC4, RC5, RC6, Blowfish, Twofish, etc.
Asymmetric encryption
Different keys are used to encrypt and decrypt.
Public key and private key pair.
Why can't we always encrypt data?
Queries cannot be executed on encrypted data.
Analysis, indexing, sorting, etc cannot be done on encrypted data.
Applications must read the data, decrypt it, and then perform the operation.
TLS termination at layer 7 load balancer requires the data to be decrypted. (needs to know the request, payload, etc)
What is homomorphic encryption?
Allows us to perform arithmetic operations on encrypted data.
No need to decrypt the data to perform operations.
Its damn slow, at the moment.
Databases can index and optimize without decrypting the data.
True end to end encryption.
Answering your questions
Why does posgres do a bit map heap scan, instead of an index only scan?
create index g_index on grades(g);
explain analyze select count(*) from grades where g = 1;Postgres needs to visit the heap to check if the row is visible to all transactions.
The visibility col is a hidden col in the heap for each row.
Visibility map is used to check if the row is visible to all transactions.
vaccum grades;vaccum command creates a visibility map.
What is cost in explain analyze?
Cost is the estimated cost of the query.
Its just a number that postgres uses.
Higher the cost, slower the query.
All Isolation levels
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11
-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Duplicate indexes in a row
Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.
Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)
Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]
Why use SELECT FOR UPDATE?
Each transaction obtains a exclusive lock on the rows that it needs to read / update.
Other transactions cannot read / update the rows that where locked.
You killed concurrency, your throughput is low.
Your transactions will never fail due to conflicts.
Serialisable isolation level is better.
Detects conflicts and fails the transaction, if there is a serialisable error.
Your throughput is high, and you have concurrency.
Its a optimistic locking mechanism.
Your transactions can fail, due to conflicts.
You need to handle retry logics in your application.
Why use a connection pool?
If you are using one database connection for every request, you are doing it wrong.
TCP blocking could happen if a lot of queries need to be executed at the same time.
flowchart LR
client[Client1]
client_2[Client2]
server[Server]
db[DB]
client --> |request| server --> |get / set| db
client_2 --> |faster request| server --> |get / set| dbflowchart RL
classDef green fill:#ccffcc,stroke:#000,stroke-width:2px;
classDef red fill:#ffcccc,stroke:#000,stroke-width:2px;
client[Client1]:::green
client_2[Client2]:::red
server[Server]
db[DB]
db --> |fast response| server --> |fast response| client
db --> |slow response| server --> |slow response| client_2There is a chance of response misdirection.
Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.
But in case of a connection pool,
Each connection is leased to a request, and returned to the pool after the request is complete.
No parallel requests are made on the same connection.
What is the value of a bitmap index scan?
Row can be in multiple pages.
Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.
Updates a bit map on the second condition.
Goes to the heap pulls those pages.
Filters out the page for the query condition
See, "Bitmap And / Or" for more info this file
If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.
Does composite index order matter?
Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.
Composite indexes are created from left to right.
Minimum filter first. (filter that results in a larger dataset first)
select a,b,c where a=10 and b=10 and c=20;
-- a is minimum filterIn an index only scan, pg still needs to visit the heap?
To check if the table is visible to the transaction.
To get old versions (snapshot) of the row, when the transaction started.
To avoid visits to the heap, use VACUUM command.
Avoid long running transactions.
Persistence vs Durability
Persistence is the ability to store data on disk and retrieve it at a later time after a crash.
My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.
Can we create index when row's are being inserted / updated?
Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.
Postgres allows you to create index concurrently, without locking the table.
Graph databases
Database persists and retrieves data from disk.
IO, fetching pages from disk / heap is heavy.
More IO, more latency.
Storage engine, classic storage engines use tables and rows.
Postgres, does not store large strings in-line in the row, it compresses to an external TOAST table, and points to it.
Meta data can be stored in an external data-structure.
WAL, REDO, UNDO logs
When a transaction is committed, data is expected to be persistent even after system termination.
A transaction does changes to everything in memory.
On commit, all changes are flushed to disk. ( this is slow )
If databases crashes mid commit, what should happen to data?.
A log of all changes are persisted to disk. (WAL write ahead log).
Dirty pages are kept in memory.
Dirty pages (changed pages) are flushed to WAL.
DB comes back up, detects that WAL is ahead of the data on disk.
No one reads from the WAL, it's only used for recovery.
Page is loaded in memory, and the WAL is replayed ( WAL is also called as redo log ).
Page is flushed to disk, and WAL is marked as flushed.
How often should the WAL be flushed to disk? (checkpointing)
When the wall is full, we need to flush it to disk, before we can write to the wall again.
Checkpointing is a IO intensive operation, we need to pause everything until the checkpoint is done.
DB warns you before the checkpoint is initiated. (causes spikes in CPU usage)
If the WAL is small, the checkpoint is faster and more frequent.
OS buffers the writes in memory, and flushes it to disk when it's ready ( to reduce IO and wear on the disk ).
OS waits for the write cache to be full, before flushing it to disk.
This is a problem, because the WAL is not flushed to disk, and the OS crashes, the WAL is lost.
fsync() system call is used to flush WAL is flushed to disk directly bypassing the OS cache.
UNDO logs?
Not all databases have UNDO logs.
Rollback Transactions: If a transaction fails or is explicitly rolled back by the user, the UNDO log records are used to revert the changes made by that transaction.
Recovery: In the event of a system crash, UNDO logs are used during the recovery process to ensure that any incomplete transactions are undone, maintaining the consistency of the database.
Long running transactions, can cause the UNDO logs to grow, and for any new transactions to see the old data, the UNDO logs need to be replayed to get to the old state.
Undo logs are applied after wall changes are applied when DB is recovering from a crash.
Can Null values improve database performance?
Default values will take up space, eg (int 32bit will take up 32bits in the row even if the value like 0 does not take up 32 bits))
PostGres uses a null bitmap, to store null values. Every row has a null bitmap, that stores the null values in the row. Every bit corresponds to a column in the row.
If you have more than 8 columns, the null bitmap will increment by 8 bytes. | 8cols | 8 bits | | 9cols | 16 bits | | 17cols | 24 bits |
Nulls are better than default values, in terms of space.
Don't rely on nulls all the time, there are some gotcha's
select count(field);
ignores nulls based on db
select count(*);
counts nulls
T is null or T is not null, is fine.
Not all databases support indexes on cols with nulls in them. (PG allows it, but oracle doesn't)
Postgres limits each page size to 8kb.
We need to try to fit as many rows as possible in a page.
More IO's if the row is split across pages.
If most of the cols are null, the no of rows that can fit in a page can be increased.
There is a fixed cost of saving null's
Select user_id from users where user_id is null; will not use the index on user_id.
Postgres Architecture
Postmaster process
Everything in postgres is append only mutli version concurrency control.
Process vs Threads, (threads are faster, but postgres uses processes)
well established mind physical body, mind, emotions, life energies
Distributed transaction ( a transaction spanning across multiple databases / services / processes )
Each microservice has its own database.
Making a transaction across multiple microservices is hard.
flowchart LR
A[Order service] -->|success| B[Payment service]
B -->|failed| C[Order service needs to revert]
C -->|transactions cannot be reverted| D[Problem]Ongoing transactions where stored in memory, instead of disk
But for distributed transactions this is not the case
Google atomic clocks to sequence of transactions
Compensatory transactions, revert the previous transaction
Eventing model, Kafka / Rabbit MQ,
graph TD
A[Order queue] --> B[Payment service]
B --> C[Payment queue]
C --> D[Shipping service]
B --> E[Payment fail queue]
E --> F[...]Go back to the old days, have a macro service.
Debugging is hard in a distributed micro services.
Why did Uber migrate from postgres to mysql
ctid is a internal parameter used by postgres to identify for physical row identification,
it's not meant to be used as a primary key.
ctid's change with every row update and are not stable.
When a row is updated, PostgreSQL often creates a new version of the row with a different ctid. The index is updated to point to the new ctid. This is why indexes might get bloated over time with dead tuples, necessitating regular maintenance tasks like VACUUM or REINDEX.
This is called Write amplification.
WAL's are used to keep track of these changes.
WAL's are pushed to standby servers, to keep them in sync.
SSD's are not good for updates, they are good for inserts (RocksDB and LevelsDB, optimised for SSD's).
Adding too much indexes are a bad idea, it will cause write amplifications.
Uber had databases across multiple data centers, and they had to keep them in sync.
Push WAL updates across multiple data centers, is expensive (network wise).
Acid
What is a transaction? 🔁
Life cycle of a transaction
flowchart TD
start[DB Starting State] -->
begin[BEGIN Transaction] --> work[One unit of work]
work --> commit[COMMIT Transaction]
work --> rollback[ROLLBACK Transaction]
rollback --> startBegin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;balance - 100 > 0;UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;COMMIT TRANSACTION;Atomicity
All queries in a transaction must succeed,
or the transaction must be rolled back.
Even if 1/100 query fails,
the entire transaction must be rolled back.
It the database went down prior to a commit of a transaction,
the transaction must be rolled back.
flowchart TD
start[DB Starting Snapshot] -->
begin[BEGIN Transaction] --> update_1[Queries 5/100]
update_1 --> db_crash{DB CRASH}
db_crash -->|yes| cleanup[DB Cleanup / Rollback]
db_crash -->|no| COMMIT[COMMIT Transaction]
cleanup --> startIsolation
No title
Can my inflight transaction see the changes made by other transactions?
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] --> update_1[UPDATE Data]
initial_state ---> transaction_2[BEGIN transaction_2] --> |dirty read| read_1[SELECT Data]
update_1 --> |roolback| initial_stateNon-repeatable reads
Two different reads of the same data, return different results.
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] ---> update_1[UPDATE Data] -->
commit[commit]
transaction_2[BEGIN transaction_2] --> read_1[SELECT Data] --->
|same query diff result| read_2[SELECT Data]
initial_state --> transaction_2Avoid it by starting a repeatable read transaction,
begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] --> read_1[SELECT Data]
transaction_2[BEGIN transaction_2] --> update_1[INSERT Data] --> commit
read_1 ---> |same query has more results| read_2[SELECT Data]
initial_state --> transaction_2On postgres you can use the following to prevent phantom reads,
begin transaction isolation level repeatable read;
# or
begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
flowchart TD
initial_state[SUM 10] -->
transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
initial_state -->
transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commitIsolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Consistency
Consistency in data is user defined (DBA team)
Referential integrity (through foreign keys) data is
inconsistent across tables.
Atomicity, isolation affects consistency.
Data consistency between tables,
1
a
1
2
b
2
Mate
1
Mark
1
Jack
2
Consistency in reads,
flowchart LR
data --> |update| db[(DB)] --> |read| dataIf a transaction update a row,
will the change be immediately visible during a read?.
Both SQL and NoSQL databases suffer from this problem.
Eventual consistency
flowchart LR
data_v1 --> |update| db[(DB)] -->
|read| data_v2
db --> |read| data_vn
db --> |read| data_v1You will eventually get the latest version of the data.
flowchart LR
master_db[(Master - X)]
slave_1_db[(Follower - Z)]
slave_2_db[(Follower - Z)]
data -->|updated to x| master_db
master_db --> slave_1_db
master_db --> slave_2_db
slave_1_db -->|Inconsistent data| readflowchart LR
master_db[(Master - X)]
slave_1_db[(Follower - X)]
slave_2_db[(Follower - X)]
data --> master_db
master_db -->|replicate| slave_1_db
master_db -->|replicate| slave_2_db
slave_1_db -->|Consistent data| readWhen we introduce caching or we introduce horizontal scaling,
you are eventually consistent.
If you do not have atomicity,
or have inconsistent data,
you do not have consistency at all.
Durability
Process of persisting data to disk (non-volatile storage).
Durability slows write down,
write to disk is slower than write to memory
Durability techniques
WAL - Write ahead log / AOF - Append only file
Any change to the database is written to a log file, the log file is flushed to disk.
flowchart LR
data --> |update| db[(DB)] --> |write| log[(Log)] --> |flush| disk[(Disk)]flowchart LR
db[(DB)]
log[(Log)]
disk[(Disk)]
log --> db
disk --> db
db --> reconstructed_db[(DB reconstructed)]Writing all the changes to disk is expensive (indexes, data, files,
rows...)
The log file is a compressed version of the data stored in the database.
flowchart LR
db[(DB)] --> |flush request| os -->
os_cache[(OS Cache)] -->
|batched flushes| disk[(Disk)]
db --> |fSync| diskfSync forces the OS to flush the data to disk.
This prevents data loss in case of OS crash.
Async snapshots
flowchart LR
data --> |update| db[(DB)] --> |async| snapshot[(Snapshot)] --> |flush| disk[(Disk)]
db --> |sync| snapshot[(Snapshot)]Asynchronously write the data to disk
Serializability vs Repeatable reads
Repeatable reads
flowchart LR
transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit
transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit -->
result[a, a, b, b]both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Serializability
flowchart LR
transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit --> result[a, a, b, b]
transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit_error[Could not serialize transaction]Database Internals
How tables and indexes are stored on disk?
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
flowchart TD
heap[(Heap)]
index[(Index)]
indexed_query --> |select|index --> |page,row|heap --> |row|indexed_queryIn postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Row vs Column oriented databases
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
flowchart TD
page_1[1-200,name,age]
page_2[200-400,name,age]
page_3[400-500,name,age]
sum[sum age] --> |fetch| page_1 --> |fetch| page_2 --> |fetch| page_3Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
flowchart TD
page_1[age: 1:row_id-500:row_id]
sum[sum age] --> |fetch| page_1To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Primary key vs Secondary key
There is no order maintained by default in the table,
Clustering, order enforced by the database on a table,
through a primary key or secondary key.
Postgres Clustered table / heap organised table, mySQL Index ordered table
Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual
table.
Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.
Buffer pool
flowchart LR
heap[(Heap)]
index[(Index)]
buffer_pool[(Buffer pool)]
wal[(WAL)]
read_query --> |select|index --> |select|heap --> |page,row|buffer_pool --> |row|wal
update_query --> buffer_pool --> walTo reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the
buffer pool,
thus reducing the number of IO's.
Database Indexing
Creating table with random vaues.
insert into temp (t) select random()*100 from generate_series(0,100000)Explaining the explain command
Returns statistics about the query plan,
explain select * from temp where t > 50;
Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Scan types ( Seq vs Index )
Cost Time to start executing the query,
Time to finish executing the query
Rows
Number of rows return
Width
Avg size of returned rows in bytes
Creating an index
create index temp_t_idx on temp(t);Without index scan,
Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)With index scan,
Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.
With an index search,
Time to prepare increases 0.00 vs 0.29
Time to execute decreases
Number of rows returned visited decreases
Seq table scan vs Index scan vs Bitmap index scan (postgres)
No title
Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
explain select id from t_records where t > 10;
Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
Filter: (t > 10)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select --> heap --> resolveIndex
explain select id from t_records where t > 5000;
Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
Index Cond: (t > 5000)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select -->| Index Scan | index --> heap --> resolveBitmap
If the no of rows to be scanned is too high, and the
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
select --> bitmap --> recheck --> heapbitmap is a bit array of the index values, pointing to the page on heap.
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
explain select id from t_records where t > 4000;
Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
Recheck Cond: (t > 4000)
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
Index Cond: (t > 4000)
(4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Bitmap And / Or
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
where_clause_1 --> bitmap --> recheck --> heap
where_clause_2 --> bitmapcondition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]
bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
create index a_idx on temp(a) include (b,c);flowchart LR
index[(Index)]
select -->| Index only Scan | index --> resolveAdding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
Composite / Combined Indexes
create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
select * from temp where a > 1 and b < 3 and c > 5;flowchart LR
index[(a,b,c)]
select --> |index only scan|index --> resolveWhen should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Concurrent Index Creation's
Waits for all transactions to finish, before creating the index.
Does not lock the table read's or writes.
Takes longer time to execute.
Can crash if inserting duplicate values, while creating the index.
create index concurrently a_idx on temp(a);Bloom Filters
flowchart LR
client[Client]
server[Server]
bloom_filter[(Bloom Filter)]
db[(DB)]
client --> |request| server <--> |get / set| bloom_filter
server --> |request| dbA bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.
# bloom filter
filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
filter_posstion = hash("req") % len(filter)
if filter[filter_posstion] == 1:
# maybe in set
return request_db()
# set bit
filter[filter_posstion] = 1Working with Billion row tables
Why do we need a billion row table? We have many ways to solve a problem.
Avoid having a billion row.
If thats not possible,
Can you index it?
Can you partition it in the same disk.
Can you shard horizontally on different disks, on multiple DB's.
Map Reduce (not suitabale for transactions).
B and B+ Trees
Full Table Scan
Reading the entire table, to find the rows that satisfy the query.
Reading pages from disk, that are not in memory.
Requires a lot of I/O to read the pages. (8KB - 16KB per page)
DBMS's implement multi threading to read pages in parallel.
B-Tree (Original)
B trees are not binary trees (each node can have more than 2 children),
B trees stands for balanced tree.(distance from root to any leaf node is
the same)
Each node in a B tree contains multiple keys that form a page.
graph TD
root[2:702, 4:704]
internal_1[1:701]
internal_2[3:703]
internal_3[5:705]
root --> |page|internal_1
root --> |page|internal_2
root --> |page|internal_3The above example has 2 keys, and 3 children.
Each key in the node points to a value on the heap (tuple),
or the primary key.
Max no of degrees, no of keys per node.
The parent must have (n - 1) keys,
where n is the number of children.
The nodes are ordered from left to right.
You cannot see this in the diagram,
each key in the node has a before and after pointer.
Before pointer is used to find the next node,
that has a value less than the following key.
Limitations of B-Trees
Because each node stores both keys and values,
the number of keys per node is limited.
The values are not used until the node is found.
This increases the number of I/O's required to find the node
(reading each node / page is an IO).
Range queries are slow, since we need to traverse for all the keys in the range.
B+ Tree
Nodes only store keys in internal and root nodes.
The values are only stored in the leaf nodes.
This allows us to store more keys per node.
Leaf nodes are linked, eacn node has a pointer to the next node linked list.
Range queries are much faster.
graph TD
root[5] --> root_1[3]
root --> root_2[7 9]
root_1 --> root_3[2]
root_1 --> root_4[4]
root_2 --> root_5[6]
root_2 --> root_6[8]
root_2 --> root_7[10]
root_3 --> leaf_1[1:row_id]
root_3 --> leaf_2[2]
root_4 --> leaf_3[3]
root_4 --> leaf_4[4]
root_5 --> leaf_5[5]
root_5 --> leaf_6[6]
root_6 --> leaf_7[7]
root_6 --> leaf_8[8]
root_7 --> leaf_9[9]
root_7 --> leaf_10[10,11]
subgraph linked list
leaf_1
leaf_2
leaf_3
leaf_4
leaf_5
leaf_6
leaf_7
leaf_8
leaf_9
leaf_10
endEach node usually store 100's of thousands of keys, above diagram is just an example.
Leaf nodes have duplicate keys.
Leaf pointers are cheap to traverse.
1 Node = 1 DBMS Page
Internal nodes fit in memory.
Leaf nodes can be stored in the heap.
Most DBMS uses B+ trees.
Storage cost in Postgres vs mySQL
Postgres points to the heap. mySQL points to the primary key.
Write amplification is higher in Postgres. (any update to the heap, needs to be replicated in the index).
If primary keys are large (UUID), inserts in mySQL will be slower.
(Since UUID's don't have a pattern.)
Leaf nodes in mySQL contains all the data (clustered index).
Database Partitioning
Partitioning types.
Range partitioning By date, customer id, etc.
By List By location, zip codes, etc.
Hash partitioning By hash of the primary key, etc. (consistent hashing)
Partitioning vs Sharding
In partitioning, the data is stored in the same DB. The client is agnostic.
Each shard sits in a seperate DB, the client needs to know which shard to query.
In partitioning, the table name changes.
In sharding, the table name is the same, but the DB ip changes.
Creating a partition
create table grades (id serial not null, g not null) partition by range (g);
create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);
alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
create index on grades (g);or
insert into grades (g) select g from another_table;select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99Make sure ENABLE_PARTITION_PRUNING is set to on in
postgresql.conf.
use the show command to check.
show enable_partition_pruning;set enable_partition_pruning = on;Checking partition size
select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
pg_relation_size | relname
------------------+-----------------------------------------------
126935040 | g3570
126754816 | g0035
72572928 | g7090
36241408 | g90100
24305664 | g3570_g_idx
24272896 | g0035_g_idx
13910016 | g7090_g_idx
6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
Sharding
Pros of sharding
Horizontal scaling (Data / Memory / CPU).
Security (Access control).
Smaller Indexes.
Cons of sharding
Complex client side code. (need to know which shard to query)
Atomic transactions across shards are impossible.
Rollbacks are complicated AF.
Schema changes are harder (need to change all shards).
Joins are harder (need to query all shards).
Sharding key should be known.
When should we shard?
Sharding should be your last resort.
Partition first, then shard.
TCP connections are reaching the limit?
Try replication first.
Writes only to primary server.
Two way replication, two availabilty zones (multiple primary databases).
You cannot maintain ACID complaiance accross shards. (eventual consistency).
Sharding is an overkill for most applications.
Vitesse is a good sharding library for go.
Concurrency control
Exclusive lock's vs Shared lock's
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
gantt
title Timeline
section Alice
Exclusive lock :a1, 2020-01-10, 42h
section Bob
Shared lock :b1, 2020-01-11, 24h
section Charlie to Bob
Shared lock on Bob failed :crit, c1, 2020-01-11, 1h
Update bob :cirt c2, 2020-01-12, 24hDead locks
graph LR
A[Transaction 1]
B[T2]
l1{{Release lock 1 event}}
l2{{R2}}
A --> l1 --> B
B --> l2 --> Aflowchart TD
t1[Transaction 1]
t2[Transaction 2]
a1[Insert 10]
a2[Insert 20]
b1[Insert 20]
b2[Insert 10]
t1 --> a1 --> a2 --> wait --> commit
t2 --> b1 --> b2 --> rollback[rollback due to deadlock]Most DBMS's have a deadlock detection mechanism,
They will fail one of the transactions.
Two phase locking
The double booking phenomenon.
graph TD
t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
t2[Transaction 2] --> s2[id=10, isBooked=false] --> u2[id=10, isBooked=true, name=Steve] --> commit -->
status[id=10, isBooked=true, name=Steve]Both transactions read the same data, and updated it.
Solution, create a two phase locking mechanism.
graph LR
t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
t2[Transaction 2] --> hang --> s2[id=10, isBooked=true]begin transaction;
select * from seats where id = 10 for update; -- creates an exclusive lock
update seats set isBooked = true, name = 'Bob' where id = 10;
commit;begin transaction;
select * from seats where id = 10 for update;
-- hang
-- updated data is returnedfor UPDATE creates an exclusive lock on the row indefinitelyset lock_timeout = 1000; will timeout the lock after 1 second.
Alternate solution (not ideal)
flowchart LR
t1[Read committed transaction 1] --> s1[Update when clause] ---> commit
t2[Read committed transaction 2] --> s2[Update when clause] --> hang --> rollbackupdate seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;This solution is imperfect, You are at the mercy of the DBMS
implementation.
Postgres refreshes the row after the lock is released, the same cannot be said
for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.
flowchart TD
t1[update clause] --> |reading row on heap for lock info|heap[(Heap)] --> wait{{Wait for lock to be released}}
wait --> |refresh row|heap
wait --> resolveSQL pagination with offset is very slow
This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.
select * from users order by id limit 10 offset 1000000;instead, make user query for more data using the last-seen-id.
select * from users where id > 1000000 order by id limit 10;Connection pooling.
For every connection, the DBMS needs to allocate resources, this takes time.
If we create a connection for every request, the latency and throughput will be very bad.
If the no of connections increase, the DBMS will overload and crash.
Connection pooling is a way to reuse connections.
The connections are kept alive, and reused.
Each connection is leased to a request, and returned to the pool after the request is complete.
graph LR
A[Request 1]
B[Request 2]
C((Connection pool))
C --> |lease|A
A --> |release|C
C --> |lease|B
B --> |release|C
C --> D[(DB ip:port)]Default connection pool size is 10, in postgres.
graph LR
db[(DB ip_address:port)]
instance_1((instance 1))
instance_2((instance 2))
instance_n((instance n))
lb[Load balancer]
request --> lb --> instance_1
lb --> instance_2
lb --> instance_n
instance_1 --> |10 connection pool|db
instance_2 --> |10 connection pool|db
instance_n --> |10 connection pool|dbEach instance keeps alive 10 connections to the DB.
Each instance can send upto 10 concurrent requests to the DB.
Each function invocation leases a connection from the pool.
Once the function is executed, the connection is returned to the pool.
const { Pool } = require('pg');
const pool = new Pool({
max: 20,
idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
connectionTimeoutMillis: 2000, // timeout after 2 seconds
});
await pool.query('SELECT something FROM sometable');We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
} finally {
// release the connection to the pool
client.release();
}Database replication
Master slave replication
graph TD
query[C-ED queries] --> master[(Master)]
master --> |replication| slave_1[(replica 1)]
master --> |replication| slave_2[(replica 2)]
master --> |replication| slave_n[(replica n)]
read_query[R queries] -.-> slave_1
read_query -.-> slave_2
read_query -.-> slave_nMulti-Master replication
graph LR
query[C-ED queries]
master_1[(Master 1)]
master_2[(Master 2)]
slave_1[(replica 1)]
slave_2[(replica 2)]
slave_3[(replica 3)]
query --> master_1
query --> master_2
master_1 --> |replication| master_2
master_2 --> |replication| slave_1
master_2 --> |replication| slave_2
master_1 --> |replication| slave_3
read_query[R queries] -.-> slave_1
read_query -.-> slave_2
read_query -.-> slave_3Difficult to implement.
Need to deal with conflicts.
Not usually preferred over master-slave replication.
Synchronous replication
graph LR
query[C-ED queries] --> master[(Master)]
master[(Master)] --> |replication| slave_1[(replica 1)]
master --> |replication| slave_2[(replica 2)]
master --> |replication| slave_n[(replica 3)]
slave_1 -.-> |ack| master
slave_2 -.-> |ack| master
master -.-> |resolve| queryReplication must be successful before the query is resolved.
If the replication fails, the query fails.
Can specify the number of replicas that must acknowledge the replication.
(first 2 / first 1 / any)
Asynchronous replication
graph LR
query[C-ED queries] --> master[(Master)]
master[(Master)] -..-> |Async replication| slave_1[(replica 1)]
master -..-> |Async replication| slave_2[(replica 2)]
master -..-> |Async replication| slave_3[(replica 3)]
master --> |resolve| queryPros and cons of replication
Pros
Horizontal scaling
Availability zones
Cons
Eventual consistency
Slow writes in sync mode
Multi-master is hard to implement
Twitter system design
Post tweet 140 chars
Follow users
Timeline
flowchart LR
classDef cache fill:#FFFF00
classDef msg_queue fill:#f96
classDef purple fill:#800080
db[(Postgres)]
client_1[Request]
pub((Publisher))
sub((Subscriber))
lb((L Balancer))
message_queue[[Message queue / Kafka]]:::msg_queue
client_1 --> lb --> pub --> message_queue --> sub -->|pool| dbL4, LB (transparent proxy)
no TLS encryption handling.
L7, LB
handles TLS encryption.
Extremely large table.
1
2
URL shortener system design
Create a shortened URL.
Redirects to the original URL.
flowchart LR
classDef cache fill:#FFFF00
classDef msg_queue fill:#f96
classDef purple fill:#800080
db_replica[(PG replica)]
db[(Postgres)]
db_replica_1[(PG replica)]
client_1[Request]
web_server_1((Server))
reverse_proxy((Rev proxy))
db -.-> |replication| db_replica
db -.-> |replication| db_replica_1
client_1 --> reverse_proxy --> |POST: url| web_server_1 <-->|UPDATE| db
web_server_1 <--> |SELECT| db_replica
web_server_1 <--> |SELECT| db_replica_1primary key
Only shard when you are not able to handle writes
base64(sha(url, salt)[:8]) # short urlDatabase Engines / Storage Engines / Embedded Databases
A library that takes care of CRED operations on the actual disk.
Databases can build on top of existing database engines.
MariaDB / MySQL allows us to switch between DB engines.
Postgres does not allow us to switch between DB engines.
LSM Trees ( watch?v=I6jB0nM9SKU )
flowchart LR
classDef cache fill:#FFFF00
classDef msg_queue fill:#f96
classDef purple fill:#800080
memtable[(Memtable / B tree)]:::cache
subgraph SSTable
sstable_1[(SSTable)]
sstable_2[(SSTable 1)]
sstable_3[(SSTable ...)]
end
writes --> memtable -->|flush to disk| sstable_1
read -->|1| memtable
read -->|2| sstable_1
read -->|3| sstable_2
read -->|4| sstable_3Log structured merge trees.
Each change to an existing key/value pairs are appended to the Memtable.
When the Memtable is full, it is flushed to disk as a SSTable.
SSTable is immutable, and is stored in the disk.
SSTables are have keys sorted in order.
Any update / delete to a key, is appended to the Memtable.
Memtable is flushed to a new SSTable, when the Memtable is full.
Bloomfilters are used to check if a key exists in the SSTable before reading it.
Compaction is used to merge multiple SSTables into one. (routine background process, for reclaiming disk space)
Duplicate and stale keys are removed during compaction.
Database Cursors
Pros
Cursors create a temporary table in memory.
Cursors are used to iterate over a large number of rows. (without loading all the rows into memory)
Allows data to be streamed.
Can cancel the cursor at any time.
Paging can be done using cursors.
Cons
Cursors are stateful, if the connection is lost, the cursor is lost.
Cannot horizontally scale cursors.
Long running cursors can cause locking issues.
Database Security
Steps to enable SSL
Run pg admin
sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80 --name pgadmin dpage/pgadmin4Install vim in postgres docker container
docker exec -it pg bash;
apt-get update;
apt-get install vim;Generate SSH key
openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365Update postgresql.conf
vim var/lib/postgresql/data/postgresql.conf
ssl = on
ssl_cert_file = 'cert.pem'
ssl_key_file = 'key.pem'Connect to the db using pg admin
What is the largest sql query we can send before it fails?
Create table if not exists.
Different users for CRED operations and Table owners.
Have different privileges for different users.
Normal users should not be able to DROP tables.
Different pools for different routes (routes that require read access only will use a different pool, logged in as normal user).
Best practices when working with databases
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Homomorphic Encryption
What is encryption?
Symmetric encryption
Same key is used to encrypt and decrypt.
AES, DES, 3DES, RC4, RC5, RC6, Blowfish, Twofish, etc.
Asymmetric encryption
Different keys are used to encrypt and decrypt.
Public key and private key pair.
Why can't we always encrypt data?
Queries cannot be executed on encrypted data.
Analysis, indexing, sorting, etc cannot be done on encrypted data.
Applications must read the data, decrypt it, and then perform the operation.
TLS termination at layer 7 load balancer requires the data to be decrypted. (needs to know the request, payload, etc)
What is homomorphic encryption?
Allows us to perform arithmetic operations on encrypted data.
No need to decrypt the data to perform operations.
Its damn slow, at the moment.
Databases can index and optimize without decrypting the data.
True end to end encryption.
Answering your questions
Why does posgres do a bit map heap scan, instead of an index only scan?
create index g_index on grades(g);
explain analyze select count(*) from grades where g = 1;Postgres needs to visit the heap to check if the row is visible to all transactions.
The visibility col is a hidden col in the heap for each row.
Visibility map is used to check if the row is visible to all transactions.
vaccum grades;vaccum command creates a visibility map.
What is cost in explain analyze?
Cost is the estimated cost of the query.
Its just a number that postgres uses.
Higher the cost, slower the query.
All Isolation levels
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11
-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Duplicate indexes in a row
Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.
Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)
Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]
Why use SELECT FOR UPDATE?
Each transaction obtains a exclusive lock on the rows that it needs to read / update.
Other transactions cannot read / update the rows that where locked.
You killed concurrency, your throughput is low.
Your transactions will never fail due to conflicts.
Serialisable isolation level is better.
Detects conflicts and fails the transaction, if there is a serialisable error.
Your throughput is high, and you have concurrency.
Its a optimistic locking mechanism.
Your transactions can fail, due to conflicts.
You need to handle retry logics in your application.
Why use a connection pool?
If you are using one database connection for every request, you are doing it wrong.
TCP blocking could happen if a lot of queries need to be executed at the same time.
flowchart LR
client[Client1]
client_2[Client2]
server[Server]
db[DB]
client --> |request| server --> |get / set| db
client_2 --> |faster request| server --> |get / set| dbflowchart RL
classDef green fill:#ccffcc,stroke:#000,stroke-width:2px;
classDef red fill:#ffcccc,stroke:#000,stroke-width:2px;
client[Client1]:::green
client_2[Client2]:::red
server[Server]
db[DB]
db --> |fast response| server --> |fast response| client
db --> |slow response| server --> |slow response| client_2There is a chance of response misdirection.
Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.
But in case of a connection pool,
Each connection is leased to a request, and returned to the pool after the request is complete.
No parallel requests are made on the same connection.
What is the value of a bitmap index scan?
Row can be in multiple pages.
Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.
Updates a bit map on the second condition.
Goes to the heap pulls those pages.
Filters out the page for the query condition
See, "Bitmap And / Or" for more info this file
If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.
Does composite index order matter?
Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.
Composite indexes are created from left to right.
Minimum filter first. (filter that results in a larger dataset first)
select a,b,c where a=10 and b=10 and c=20;
-- a is minimum filterIn an index only scan, pg still needs to visit the heap?
To check if the table is visible to the transaction.
To get old versions (snapshot) of the row, when the transaction started.
To avoid visits to the heap, use VACUUM command.
Avoid long running transactions.
Persistence vs Durability
Persistence is the ability to store data on disk and retrieve it at a later time after a crash.
My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.
Can we create index when row's are being inserted / updated?
Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.
Postgres allows you to create index concurrently, without locking the table.
Graph databases
Database persists and retrieves data from disk.
IO, fetching pages from disk / heap is heavy.
More IO, more latency.
Storage engine, classic storage engines use tables and rows.
Postgres, does not store large strings in-line in the row, it compresses to an external TOAST table, and points to it.
Meta data can be stored in an external data-structure.
WAL, REDO, UNDO logs
When a transaction is committed, data is expected to be persistent even after system termination.
A transaction does changes to everything in memory.
On commit, all changes are flushed to disk. ( this is slow )
If databases crashes mid commit, what should happen to data?.
A log of all changes are persisted to disk. (WAL write ahead log).
Dirty pages are kept in memory.
Dirty pages (changed pages) are flushed to WAL.
DB comes back up, detects that WAL is ahead of the data on disk.
No one reads from the WAL, it's only used for recovery.
Page is loaded in memory, and the WAL is replayed ( WAL is also called as redo log ).
Page is flushed to disk, and WAL is marked as flushed.
How often should the WAL be flushed to disk? (checkpointing)
When the wall is full, we need to flush it to disk, before we can write to the wall again.
Checkpointing is a IO intensive operation, we need to pause everything until the checkpoint is done.
DB warns you before the checkpoint is initiated. (causes spikes in CPU usage)
If the WAL is small, the checkpoint is faster and more frequent.
OS buffers the writes in memory, and flushes it to disk when it's ready ( to reduce IO and wear on the disk ).
OS waits for the write cache to be full, before flushing it to disk.
This is a problem, because the WAL is not flushed to disk, and the OS crashes, the WAL is lost.
fsync() system call is used to flush WAL is flushed to disk directly bypassing the OS cache.
UNDO logs?
Not all databases have UNDO logs.
Rollback Transactions: If a transaction fails or is explicitly rolled back by the user, the UNDO log records are used to revert the changes made by that transaction.
Recovery: In the event of a system crash, UNDO logs are used during the recovery process to ensure that any incomplete transactions are undone, maintaining the consistency of the database.
Long running transactions, can cause the UNDO logs to grow, and for any new transactions to see the old data, the UNDO logs need to be replayed to get to the old state.
Undo logs are applied after wall changes are applied when DB is recovering from a crash.
Can Null values improve database performance?
Default values will take up space, eg (int 32bit will take up 32bits in the row even if the value like 0 does not take up 32 bits))
PostGres uses a null bitmap, to store null values. Every row has a null bitmap, that stores the null values in the row. Every bit corresponds to a column in the row.
If you have more than 8 columns, the null bitmap will increment by 8 bytes. | 8cols | 8 bits | | 9cols | 16 bits | | 17cols | 24 bits |
Nulls are better than default values, in terms of space.
Don't rely on nulls all the time, there are some gotcha's
select count(field);
ignores nulls based on db
select count(*);
counts nulls
T is null or T is not null, is fine.
Not all databases support indexes on cols with nulls in them. (PG allows it, but oracle doesn't)
Postgres limits each page size to 8kb.
We need to try to fit as many rows as possible in a page.
More IO's if the row is split across pages.
If most of the cols are null, the no of rows that can fit in a page can be increased.
There is a fixed cost of saving null's
Select user_id from users where user_id is null; will not use the index on user_id.
Postgres Architecture
Postmaster process
Everything in postgres is append only mutli version concurrency control.
Process vs Threads, (threads are faster, but postgres uses processes)
well established mind physical body, mind, emotions, life energies
Distributed transaction ( a transaction spanning across multiple databases / services / processes )
Each microservice has its own database.
Making a transaction across multiple microservices is hard.
flowchart LR
A[Order service] -->|success| B[Payment service]
B -->|failed| C[Order service needs to revert]
C -->|transactions cannot be reverted| D[Problem]Ongoing transactions where stored in memory, instead of disk
But for distributed transactions this is not the case
Google atomic clocks to sequence of transactions
Compensatory transactions, revert the previous transaction
Eventing model, Kafka / Rabbit MQ,
graph TD
A[Order queue] --> B[Payment service]
B --> C[Payment queue]
C --> D[Shipping service]
B --> E[Payment fail queue]
E --> F[...]Go back to the old days, have a macro service.
Debugging is hard in a distributed micro services.
Why did Uber migrate from postgres to mysql
ctid is a internal parameter used by postgres to identify for physical row identification,
it's not meant to be used as a primary key.
ctid's change with every row update and are not stable.
When a row is updated, PostgreSQL often creates a new version of the row with a different ctid. The index is updated to point to the new ctid. This is why indexes might get bloated over time with dead tuples, necessitating regular maintenance tasks like VACUUM or REINDEX.
This is called Write amplification.
WAL's are used to keep track of these changes.
WAL's are pushed to standby servers, to keep them in sync.
SSD's are not good for updates, they are good for inserts (RocksDB and LevelsDB, optimised for SSD's).
Adding too much indexes are a bad idea, it will cause write amplifications.
Uber had databases across multiple data centers, and they had to keep them in sync.
Push WAL updates across multiple data centers, is expensive (network wise).
Acid
What is a transaction? 🔁
Life cycle of a transaction
flowchart TD
start[DB Starting State] -->
begin[BEGIN Transaction] --> work[One unit of work]
work --> commit[COMMIT Transaction]
work --> rollback[ROLLBACK Transaction]
rollback --> startBegin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;balance - 100 > 0;UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;COMMIT TRANSACTION;Atomicity
All queries in a transaction must succeed,
or the transaction must be rolled back.
Even if 1/100 query fails,
the entire transaction must be rolled back.
It the database went down prior to a commit of a transaction,
the transaction must be rolled back.
flowchart TD
start[DB Starting Snapshot] -->
begin[BEGIN Transaction] --> update_1[Queries 5/100]
update_1 --> db_crash{DB CRASH}
db_crash -->|yes| cleanup[DB Cleanup / Rollback]
db_crash -->|no| COMMIT[COMMIT Transaction]
cleanup --> startIsolation
No title
Can my inflight transaction see the changes made by other transactions?
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] --> update_1[UPDATE Data]
initial_state ---> transaction_2[BEGIN transaction_2] --> |dirty read| read_1[SELECT Data]
update_1 --> |roolback| initial_stateNon-repeatable reads
Two different reads of the same data, return different results.
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] ---> update_1[UPDATE Data] -->
commit[commit]
transaction_2[BEGIN transaction_2] --> read_1[SELECT Data] --->
|same query diff result| read_2[SELECT Data]
initial_state --> transaction_2Avoid it by starting a repeatable read transaction,
begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] --> read_1[SELECT Data]
transaction_2[BEGIN transaction_2] --> update_1[INSERT Data] --> commit
read_1 ---> |same query has more results| read_2[SELECT Data]
initial_state --> transaction_2On postgres you can use the following to prevent phantom reads,
begin transaction isolation level repeatable read;
# or
begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
flowchart TD
initial_state[SUM 10] -->
transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
initial_state -->
transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commitIsolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Consistency
Consistency in data is user defined (DBA team)
Referential integrity (through foreign keys) data is
inconsistent across tables.
Atomicity, isolation affects consistency.
Data consistency between tables,
1
a
1
2
b
2
Mate
1
Mark
1
Jack
2
Consistency in reads,
flowchart LR
data --> |update| db[(DB)] --> |read| dataIf a transaction update a row,
will the change be immediately visible during a read?.
Both SQL and NoSQL databases suffer from this problem.
Eventual consistency
flowchart LR
data_v1 --> |update| db[(DB)] -->
|read| data_v2
db --> |read| data_vn
db --> |read| data_v1You will eventually get the latest version of the data.
flowchart LR
master_db[(Master - X)]
slave_1_db[(Follower - Z)]
slave_2_db[(Follower - Z)]
data -->|updated to x| master_db
master_db --> slave_1_db
master_db --> slave_2_db
slave_1_db -->|Inconsistent data| readflowchart LR
master_db[(Master - X)]
slave_1_db[(Follower - X)]
slave_2_db[(Follower - X)]
data --> master_db
master_db -->|replicate| slave_1_db
master_db -->|replicate| slave_2_db
slave_1_db -->|Consistent data| readWhen we introduce caching or we introduce horizontal scaling,
you are eventually consistent.
If you do not have atomicity,
or have inconsistent data,
you do not have consistency at all.
Durability
Process of persisting data to disk (non-volatile storage).
Durability slows write down,
write to disk is slower than write to memory
Durability techniques
WAL - Write ahead log / AOF - Append only file
Any change to the database is written to a log file, the log file is flushed to disk.
flowchart LR
data --> |update| db[(DB)] --> |write| log[(Log)] --> |flush| disk[(Disk)]flowchart LR
db[(DB)]
log[(Log)]
disk[(Disk)]
log --> db
disk --> db
db --> reconstructed_db[(DB reconstructed)]Writing all the changes to disk is expensive (indexes, data, files,
rows...)
The log file is a compressed version of the data stored in the database.
flowchart LR
db[(DB)] --> |flush request| os -->
os_cache[(OS Cache)] -->
|batched flushes| disk[(Disk)]
db --> |fSync| diskfSync forces the OS to flush the data to disk.
This prevents data loss in case of OS crash.
Async snapshots
flowchart LR
data --> |update| db[(DB)] --> |async| snapshot[(Snapshot)] --> |flush| disk[(Disk)]
db --> |sync| snapshot[(Snapshot)]Asynchronously write the data to disk
Serializability vs Repeatable reads
Repeatable reads
flowchart LR
transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit
transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit -->
result[a, a, b, b]both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Serializability
flowchart LR
transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit --> result[a, a, b, b]
transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit_error[Could not serialize transaction]What is a transaction? 🔁
Life cycle of a transaction
Rendering diagram...
Begin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;balance - 100 > 0;UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;1COMMIT TRANSACTION;Life cycle of a transaction
Rendering diagram...
Begin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Rendering diagram...
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;1COMMIT TRANSACTION;1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;1COMMIT TRANSACTION;Atomicity
All queries in a transaction must succeed,
or the transaction must be rolled back.
Even if 1/100 query fails,
the entire transaction must be rolled back.
It the database went down prior to a commit of a transaction,
the transaction must be rolled back.
Rendering diagram...
Rendering diagram...
Isolation
No title
Can my inflight transaction see the changes made by other transactions?
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
begin transaction isolation level repeatable read;
# or
begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
flowchart TD
initial_state[SUM 10] -->
transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
initial_state -->
transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commitIsolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
1begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
Rendering diagram...
Rendering diagram...
Rendering diagram...
1begin transaction isolation level repeatable read;Rendering diagram...
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;1begin transaction isolation level serializable;Rendering diagram...
Isolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Consistency
Consistency in data is user defined (DBA team)
Referential integrity (through foreign keys) data is
inconsistent across tables.
Atomicity, isolation affects consistency.
Data consistency between tables,
1
a
1
2
b
2
Mate
1
Mark
1
Jack
2
Consistency in reads,
Rendering diagram...
If a transaction update a row,
will the change be immediately visible during a read?.
Both SQL and NoSQL databases suffer from this problem.
Eventual consistency
Rendering diagram...
You will eventually get the latest version of the data.
Rendering diagram...
Rendering diagram...
When we introduce caching or we introduce horizontal scaling,
you are eventually consistent.
If you do not have atomicity,
or have inconsistent data,
you do not have consistency at all.
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Durability
Process of persisting data to disk (non-volatile storage).
Durability slows write down,
write to disk is slower than write to memory
Durability techniques
WAL - Write ahead log / AOF - Append only file
Any change to the database is written to a log file, the log file is flushed to disk.
Rendering diagram...
Rendering diagram...
Writing all the changes to disk is expensive (indexes, data, files,
rows...)
The log file is a compressed version of the data stored in the database.
Rendering diagram...
fSync forces the OS to flush the data to disk.
This prevents data loss in case of OS crash.
Async snapshots
Rendering diagram...
Asynchronously write the data to disk
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Serializability vs Repeatable reads
Repeatable reads
Rendering diagram...
both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Serializability
Rendering diagram...
Repeatable reads
Rendering diagram...
both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Rendering diagram...
Serializability
Rendering diagram...
Rendering diagram...
Database Internals
How tables and indexes are stored on disk?
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Row vs Column oriented databases
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
flowchart TD
page_1[1-200,name,age]
page_2[200-400,name,age]
page_3[400-500,name,age]
sum[sum age] --> |fetch| page_1 --> |fetch| page_2 --> |fetch| page_3Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
flowchart TD
page_1[age: 1:row_id-500:row_id]
sum[sum age] --> |fetch| page_1To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Primary key vs Secondary key
There is no order maintained by default in the table,
Clustering, order enforced by the database on a table,
through a primary key or secondary key.
Postgres Clustered table / heap organised table, mySQL Index ordered table
Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual
table.
Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.
Buffer pool
flowchart LR
heap[(Heap)]
index[(Index)]
buffer_pool[(Buffer pool)]
wal[(WAL)]
read_query --> |select|index --> |select|heap --> |page,row|buffer_pool --> |row|wal
update_query --> buffer_pool --> walTo reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the
buffer pool,
thus reducing the number of IO's.
How tables and indexes are stored on disk?
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Rendering diagram...
Row vs Column oriented databases
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
Rendering diagram...
Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
Rendering diagram...
To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
Rendering diagram...
Rendering diagram...
Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
Rendering diagram...
To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Rendering diagram...
Primary key vs Secondary key
There is no order maintained by default in the table,
Clustering, order enforced by the database on a table,
through a primary key or secondary key.
Postgres Clustered table / heap organised table, mySQL Index ordered table
Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual
table.
Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.
Buffer pool
Rendering diagram...
To reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the
buffer pool,
thus reducing the number of IO's.
Rendering diagram...
Database Indexing
Creating table with random vaues.
insert into temp (t) select random()*100 from generate_series(0,100000)Explaining the explain command
Returns statistics about the query plan,
explain select * from temp where t > 50;
Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Scan types ( Seq vs Index )
Cost Time to start executing the query,
Time to finish executing the query
Rows
Number of rows return
Width
Avg size of returned rows in bytes
Creating an index
create index temp_t_idx on temp(t);Without index scan,
Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)With index scan,
Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.
With an index search,
Time to prepare increases 0.00 vs 0.29
Time to execute decreases
Number of rows returned visited decreases
Seq table scan vs Index scan vs Bitmap index scan (postgres)
No title
Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
explain select id from t_records where t > 10;
Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
Filter: (t > 10)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select --> heap --> resolveIndex
explain select id from t_records where t > 5000;
Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
Index Cond: (t > 5000)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select -->| Index Scan | index --> heap --> resolveBitmap
If the no of rows to be scanned is too high, and the
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
select --> bitmap --> recheck --> heapbitmap is a bit array of the index values, pointing to the page on heap.
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
explain select id from t_records where t > 4000;
Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
Recheck Cond: (t > 4000)
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
Index Cond: (t > 4000)
(4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Bitmap And / Or
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
where_clause_1 --> bitmap --> recheck --> heap
where_clause_2 --> bitmapcondition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]
bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
create index a_idx on temp(a) include (b,c);flowchart LR
index[(Index)]
select -->| Index only Scan | index --> resolveAdding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
Composite / Combined Indexes
create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
select * from temp where a > 1 and b < 3 and c > 5;flowchart LR
index[(a,b,c)]
select --> |index only scan|index --> resolveWhen should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Concurrent Index Creation's
Waits for all transactions to finish, before creating the index.
Does not lock the table read's or writes.
Takes longer time to execute.
Can crash if inserting duplicate values, while creating the index.
create index concurrently a_idx on temp(a);Bloom Filters
flowchart LR
client[Client]
server[Server]
bloom_filter[(Bloom Filter)]
db[(DB)]
client --> |request| server <--> |get / set| bloom_filter
server --> |request| dbA bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.
# bloom filter
filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
filter_posstion = hash("req") % len(filter)
if filter[filter_posstion] == 1:
# maybe in set
return request_db()
# set bit
filter[filter_posstion] = 1Working with Billion row tables
Why do we need a billion row table? We have many ways to solve a problem.
Avoid having a billion row.
If thats not possible,
Can you index it?
Can you partition it in the same disk.
Can you shard horizontally on different disks, on multiple DB's.
Map Reduce (not suitabale for transactions).
Creating table with random vaues.
1insert into temp (t) select random()*100 from generate_series(0,100000)1insert into temp (t) select random()*100 from generate_series(0,100000)Explaining the explain command
Returns statistics about the query plan,
1explain select * from temp where t > 50;
2Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Scan types ( Seq vs Index )
Cost Time to start executing the query,
Time to finish executing the query
Rows
Number of rows return
Width
Avg size of returned rows in bytes
1explain select * from temp where t > 50;
2Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Creating an index
1create index temp_t_idx on temp(t);Without index scan,
1Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)With index scan,
1Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.
With an index search,
Time to prepare increases 0.00 vs 0.29
Time to execute decreases
Number of rows returned visited decreases
1create index temp_t_idx on temp(t);1Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)1Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)Seq table scan vs Index scan vs Bitmap index scan (postgres)
No title
Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
explain select id from t_records where t > 10;
Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
Filter: (t > 10)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select --> heap --> resolveIndex
explain select id from t_records where t > 5000;
Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
Index Cond: (t > 5000)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select -->| Index Scan | index --> heap --> resolveBitmap
If the no of rows to be scanned is too high, and the
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
select --> bitmap --> recheck --> heapbitmap is a bit array of the index values, pointing to the page on heap.
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
explain select id from t_records where t > 4000;
Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
Recheck Cond: (t > 4000)
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
Index Cond: (t > 4000)
(4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Bitmap And / Or
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
where_clause_1 --> bitmap --> recheck --> heap
where_clause_2 --> bitmapcondition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]
bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
create index a_idx on temp(a) include (b,c);flowchart LR
index[(Index)]
select -->| Index only Scan | index --> resolveAdding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
Composite / Combined Indexes
create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
select * from temp where a > 1 and b < 3 and c > 5;flowchart LR
index[(a,b,c)]
select --> |index only scan|index --> resolveWhen should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
Index
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
Bitmap
If the no of rows to be scanned is too high, and the
Rendering diagram...
bitmap is a bit array of the index values, pointing to the page on heap.
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Rendering diagram...
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Bitmap And / Or
Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)1explain select id from t_records where t > 5000 and t < 1000;Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
1vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
1vacuum t_records;Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Composite / Combined Indexes
1create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
When should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
1create index a_b_c_combined_idx on temp(a,b,c);1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Concurrent Index Creation's
Waits for all transactions to finish, before creating the index.
Does not lock the table read's or writes.
Takes longer time to execute.
Can crash if inserting duplicate values, while creating the index.
1create index concurrently a_idx on temp(a);1create index concurrently a_idx on temp(a);Bloom Filters
Rendering diagram...
A bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.
1# bloom filter
2filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
3filter_posstion = hash("req") % len(filter)
4
5if filter[filter_posstion] == 1:
6 # maybe in set
7 return request_db()
8
9# set bit
10filter[filter_posstion] = 1Rendering diagram...
1# bloom filter
2filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
3filter_posstion = hash("req") % len(filter)
4
5if filter[filter_posstion] == 1:
6 # maybe in set
7 return request_db()
8
9# set bit
10filter[filter_posstion] = 1Working with Billion row tables
Why do we need a billion row table? We have many ways to solve a problem.
Avoid having a billion row.
If thats not possible,
Can you index it?
Can you partition it in the same disk.
Can you shard horizontally on different disks, on multiple DB's.
Map Reduce (not suitabale for transactions).
B and B+ Trees
Full Table Scan
Reading the entire table, to find the rows that satisfy the query.
Reading pages from disk, that are not in memory.
Requires a lot of I/O to read the pages. (8KB - 16KB per page)
DBMS's implement multi threading to read pages in parallel.
B-Tree (Original)
B trees are not binary trees (each node can have more than 2 children),
B trees stands for balanced tree.(distance from root to any leaf node is
the same)
Each node in a B tree contains multiple keys that form a page.
Rendering diagram...
The above example has 2 keys, and 3 children.
Each key in the node points to a value on the heap (tuple),
or the primary key.
Max no of degrees, no of keys per node.
The parent must have (n - 1) keys,
where n is the number of children.
The nodes are ordered from left to right.
You cannot see this in the diagram,
each key in the node has a before and after pointer.
Before pointer is used to find the next node,
that has a value less than the following key.
Limitations of B-Trees
Because each node stores both keys and values,
the number of keys per node is limited.
The values are not used until the node is found.
This increases the number of I/O's required to find the node
(reading each node / page is an IO).
Range queries are slow, since we need to traverse for all the keys in the range.
B+ Tree
Nodes only store keys in internal and root nodes.
The values are only stored in the leaf nodes.
This allows us to store more keys per node.
Leaf nodes are linked, eacn node has a pointer to the next node linked list.
Range queries are much faster.
Rendering diagram...
Each node usually store 100's of thousands of keys, above diagram is just an example.
Leaf nodes have duplicate keys.
Leaf pointers are cheap to traverse.
1 Node = 1 DBMS Page
Internal nodes fit in memory.
Leaf nodes can be stored in the heap.
Most DBMS uses B+ trees.
Storage cost in Postgres vs mySQL
Postgres points to the heap. mySQL points to the primary key.
Write amplification is higher in Postgres. (any update to the heap, needs to be replicated in the index).
If primary keys are large (UUID), inserts in mySQL will be slower.
(Since UUID's don't have a pattern.)
Leaf nodes in mySQL contains all the data (clustered index).
Full Table Scan
Reading the entire table, to find the rows that satisfy the query.
Reading pages from disk, that are not in memory.
Requires a lot of I/O to read the pages. (8KB - 16KB per page)
DBMS's implement multi threading to read pages in parallel.
B-Tree (Original)
B trees are not binary trees (each node can have more than 2 children),
B trees stands for balanced tree.(distance from root to any leaf node is
the same)
Each node in a B tree contains multiple keys that form a page.
Rendering diagram...
The above example has 2 keys, and 3 children.
Each key in the node points to a value on the heap (tuple),
or the primary key.
Max no of degrees, no of keys per node.
The parent must have (n - 1) keys,
where n is the number of children.
The nodes are ordered from left to right.
You cannot see this in the diagram,
each key in the node has a before and after pointer.
Before pointer is used to find the next node,
that has a value less than the following key.
Rendering diagram...
Limitations of B-Trees
Because each node stores both keys and values,
the number of keys per node is limited.
The values are not used until the node is found.
This increases the number of I/O's required to find the node
(reading each node / page is an IO).
Range queries are slow, since we need to traverse for all the keys in the range.
B+ Tree
Nodes only store keys in internal and root nodes.
The values are only stored in the leaf nodes.
This allows us to store more keys per node.
Leaf nodes are linked, eacn node has a pointer to the next node linked list.
Range queries are much faster.
Rendering diagram...
Each node usually store 100's of thousands of keys, above diagram is just an example.
Leaf nodes have duplicate keys.
Leaf pointers are cheap to traverse.
1 Node = 1 DBMS Page
Internal nodes fit in memory.
Leaf nodes can be stored in the heap.
Most DBMS uses B+ trees.
Rendering diagram...
Storage cost in Postgres vs mySQL
Postgres points to the heap. mySQL points to the primary key.
Write amplification is higher in Postgres. (any update to the heap, needs to be replicated in the index).
If primary keys are large (UUID), inserts in mySQL will be slower.
(Since UUID's don't have a pattern.)
Leaf nodes in mySQL contains all the data (clustered index).
Database Partitioning
Partitioning types.
Range partitioning By date, customer id, etc.
By List By location, zip codes, etc.
Hash partitioning By hash of the primary key, etc. (consistent hashing)
Partitioning vs Sharding
In partitioning, the data is stored in the same DB. The client is agnostic.
Each shard sits in a seperate DB, the client needs to know which shard to query.
In partitioning, the table name changes.
In sharding, the table name is the same, but the DB ip changes.
Creating a partition
create table grades (id serial not null, g not null) partition by range (g);
create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);
alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
2create index on grades (g);or
insert into grades (g) select g from another_table;select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99Make sure ENABLE_PARTITION_PRUNING is set to on in
postgresql.conf.
use the show command to check.
show enable_partition_pruning;set enable_partition_pruning = on;Checking partition size
select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
pg_relation_size | relname
------------------+-----------------------------------------------
126935040 | g3570
126754816 | g0035
72572928 | g7090
36241408 | g90100
24305664 | g3570_g_idx
24272896 | g0035_g_idx
13910016 | g7090_g_idx
6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
Partitioning types.
Range partitioning By date, customer id, etc.
By List By location, zip codes, etc.
Hash partitioning By hash of the primary key, etc. (consistent hashing)
Partitioning vs Sharding
In partitioning, the data is stored in the same DB. The client is agnostic.
Each shard sits in a seperate DB, the client needs to know which shard to query.
In partitioning, the table name changes.
In sharding, the table name is the same, but the DB ip changes.
Creating a partition
create table grades (id serial not null, g not null) partition by range (g);
create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);
alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
2create index on grades (g);or
1insert into grades (g) select g from another_table;select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99Make sure ENABLE_PARTITION_PRUNING is set to on in
postgresql.conf.
use the show command to check.
1show enable_partition_pruning;set enable_partition_pruning = on;1create table grades (id serial not null, g not null) partition by range (g);
2
3create table g0035 (like grades including indexes);
4create table g3570 (like grades including indexes);
5create table g7080 (like grades including indexes);
6create table g80100 (like grades including indexes);
7
8alter table attach partition g0035 for values from (0) to (35);
9alter table attach partition g3570 for values from (35) to (70);
10alter table attach partition g7080 for values from (70) to (80);
11alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
2create index on grades (g);1insert into grades (g) select g from another_table;1select max(g) from g0035; -- 34
2select max(g) from g3570; -- 69
3select max(g) from g7080; -- 79
4select max(g) from g80100; -- 991show enable_partition_pruning;1set enable_partition_pruning = on;Checking partition size
No title
1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
No title
1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idx1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
Sharding
Pros of sharding
Horizontal scaling (Data / Memory / CPU).
Security (Access control).
Smaller Indexes.
Cons of sharding
Complex client side code. (need to know which shard to query)
Atomic transactions across shards are impossible.
Rollbacks are complicated AF.
Schema changes are harder (need to change all shards).
Joins are harder (need to query all shards).
Sharding key should be known.
When should we shard?
Sharding should be your last resort.
Partition first, then shard.
TCP connections are reaching the limit?
Try replication first.
Writes only to primary server.
Two way replication, two availabilty zones (multiple primary databases).
You cannot maintain ACID complaiance accross shards. (eventual consistency).
Sharding is an overkill for most applications.
Vitesse is a good sharding library for go.
Cons of sharding
Complex client side code. (need to know which shard to query)
Atomic transactions across shards are impossible.
Rollbacks are complicated AF.
Schema changes are harder (need to change all shards).
Joins are harder (need to query all shards).
Sharding key should be known.
When should we shard?
Sharding should be your last resort.
Partition first, then shard.
TCP connections are reaching the limit?
Try replication first.
Writes only to primary server.
Two way replication, two availabilty zones (multiple primary databases).
You cannot maintain ACID complaiance accross shards. (eventual consistency).
Sharding is an overkill for most applications.
Vitesse is a good sharding library for go.
Concurrency control
Exclusive lock's vs Shared lock's
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
gantt
title Timeline
section Alice
Exclusive lock :a1, 2020-01-10, 42h
section Bob
Shared lock :b1, 2020-01-11, 24h
section Charlie to Bob
Shared lock on Bob failed :crit, c1, 2020-01-11, 1h
Update bob :cirt c2, 2020-01-12, 24hDead locks
graph LR
A[Transaction 1]
B[T2]
l1{{Release lock 1 event}}
l2{{R2}}
A --> l1 --> B
B --> l2 --> Aflowchart TD
t1[Transaction 1]
t2[Transaction 2]
a1[Insert 10]
a2[Insert 20]
b1[Insert 20]
b2[Insert 10]
t1 --> a1 --> a2 --> wait --> commit
t2 --> b1 --> b2 --> rollback[rollback due to deadlock]Most DBMS's have a deadlock detection mechanism,
They will fail one of the transactions.
Two phase locking
The double booking phenomenon.
graph TD
t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
t2[Transaction 2] --> s2[id=10, isBooked=false] --> u2[id=10, isBooked=true, name=Steve] --> commit -->
status[id=10, isBooked=true, name=Steve]Both transactions read the same data, and updated it.
Solution, create a two phase locking mechanism.
graph LR
t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
t2[Transaction 2] --> hang --> s2[id=10, isBooked=true]begin transaction;
select * from seats where id = 10 for update; -- creates an exclusive lock
update seats set isBooked = true, name = 'Bob' where id = 10;
commit;begin transaction;
select * from seats where id = 10 for update;
-- hang
-- updated data is returnedfor UPDATE creates an exclusive lock on the row indefinitelyset lock_timeout = 1000; will timeout the lock after 1 second.
Alternate solution (not ideal)
flowchart LR
t1[Read committed transaction 1] --> s1[Update when clause] ---> commit
t2[Read committed transaction 2] --> s2[Update when clause] --> hang --> rollbackupdate seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;This solution is imperfect, You are at the mercy of the DBMS
implementation.
Postgres refreshes the row after the lock is released, the same cannot be said
for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.
flowchart TD
t1[update clause] --> |reading row on heap for lock info|heap[(Heap)] --> wait{{Wait for lock to be released}}
wait --> |refresh row|heap
wait --> resolveSQL pagination with offset is very slow
This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.
select * from users order by id limit 10 offset 1000000;instead, make user query for more data using the last-seen-id.
select * from users where id > 1000000 order by id limit 10;Connection pooling.
For every connection, the DBMS needs to allocate resources, this takes time.
If we create a connection for every request, the latency and throughput will be very bad.
If the no of connections increase, the DBMS will overload and crash.
Connection pooling is a way to reuse connections.
The connections are kept alive, and reused.
Each connection is leased to a request, and returned to the pool after the request is complete.
graph LR
A[Request 1]
B[Request 2]
C((Connection pool))
C --> |lease|A
A --> |release|C
C --> |lease|B
B --> |release|C
C --> D[(DB ip:port)]Default connection pool size is 10, in postgres.
graph LR
db[(DB ip_address:port)]
instance_1((instance 1))
instance_2((instance 2))
instance_n((instance n))
lb[Load balancer]
request --> lb --> instance_1
lb --> instance_2
lb --> instance_n
instance_1 --> |10 connection pool|db
instance_2 --> |10 connection pool|db
instance_n --> |10 connection pool|dbEach instance keeps alive 10 connections to the DB.
Each instance can send upto 10 concurrent requests to the DB.
Each function invocation leases a connection from the pool.
Once the function is executed, the connection is returned to the pool.
const { Pool } = require('pg');
const pool = new Pool({
max: 20,
idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
connectionTimeoutMillis: 2000, // timeout after 2 seconds
});
await pool.query('SELECT something FROM sometable');We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
} finally {
// release the connection to the pool
client.release();
}Exclusive lock's vs Shared lock's
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
Rendering diagram...
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
Rendering diagram...
Rendering diagram...
Dead locks
Rendering diagram...
Rendering diagram...
Most DBMS's have a deadlock detection mechanism,
They will fail one of the transactions.
Rendering diagram...
Rendering diagram...
Two phase locking
The double booking phenomenon.
Rendering diagram...
Both transactions read the same data, and updated it.
Solution, create a two phase locking mechanism.
Rendering diagram...
1begin transaction;
2select * from seats where id = 10 for update; -- creates an exclusive lock
3update seats set isBooked = true, name = 'Bob' where id = 10;
4commit;1begin transaction;
2select * from seats where id = 10 for update;
3-- hang
4-- updated data is returnedfor UPDATE creates an exclusive lock on the row indefinitelyset lock_timeout = 1000; will timeout the lock after 1 second.
Rendering diagram...
Rendering diagram...
1begin transaction;
2select * from seats where id = 10 for update; -- creates an exclusive lock
3update seats set isBooked = true, name = 'Bob' where id = 10;
4commit;1begin transaction;
2select * from seats where id = 10 for update;
3-- hang
4-- updated data is returnedAlternate solution (not ideal)
Rendering diagram...
1update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;This solution is imperfect, You are at the mercy of the DBMS
implementation.
Postgres refreshes the row after the lock is released, the same cannot be said
for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.
Rendering diagram...
Rendering diagram...
1update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;Rendering diagram...
SQL pagination with offset is very slow
This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.
1select * from users order by id limit 10 offset 1000000;instead, make user query for more data using the last-seen-id.
1select * from users where id > 1000000 order by id limit 10;1select * from users order by id limit 10 offset 1000000;1select * from users where id > 1000000 order by id limit 10;Connection pooling.
For every connection, the DBMS needs to allocate resources, this takes time.
If we create a connection for every request, the latency and throughput will be very bad.
If the no of connections increase, the DBMS will overload and crash.
Connection pooling is a way to reuse connections.
The connections are kept alive, and reused.
Each connection is leased to a request, and returned to the pool after the request is complete.
Rendering diagram...
Default connection pool size is 10, in postgres.
Rendering diagram...
Each instance keeps alive 10 connections to the DB.
Each instance can send upto 10 concurrent requests to the DB.
Each function invocation leases a connection from the pool.
Once the function is executed, the connection is returned to the pool.
1const { Pool } = require('pg');
2const pool = new Pool({
3 max: 20,
4 idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
5 connectionTimeoutMillis: 2000, // timeout after 2 seconds
6});
7
8await pool.query('SELECT something FROM sometable');We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
} finally {
// release the connection to the pool
client.release();
}Rendering diagram...
Rendering diagram...
1const { Pool } = require('pg');
2const pool = new Pool({
3 max: 20,
4 idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
5 connectionTimeoutMillis: 2000, // timeout after 2 seconds
6});
7
8await pool.query('SELECT something FROM sometable');1// asking the pool to give us a dedicated connection
2const client = await pool.connect();
3try {
4 await client.query('BEGIN');
5 await client.query('COMMIT');
6} catch (e) {
7 await client.query('ROLLBACK');
8} finally {
9 // release the connection to the pool
10 client.release();
11}Database replication
Master slave replication
Rendering diagram...
Multi-Master replication
Rendering diagram...
Difficult to implement.
Need to deal with conflicts.
Not usually preferred over master-slave replication.
Synchronous replication
Rendering diagram...
Replication must be successful before the query is resolved.
If the replication fails, the query fails.
Can specify the number of replicas that must acknowledge the replication.
(first 2 / first 1 / any)
Asynchronous replication
Rendering diagram...
Pros and cons of replication
Pros
Horizontal scaling
Availability zones
Cons
Eventual consistency
Slow writes in sync mode
Multi-master is hard to implement
Master slave replication
Rendering diagram...
Rendering diagram...
Multi-Master replication
Rendering diagram...
Difficult to implement.
Need to deal with conflicts.
Not usually preferred over master-slave replication.
Rendering diagram...
Synchronous replication
Rendering diagram...
Replication must be successful before the query is resolved.
If the replication fails, the query fails.
Can specify the number of replicas that must acknowledge the replication.
(first 2 / first 1 / any)
Rendering diagram...
Asynchronous replication
Rendering diagram...
Rendering diagram...
Twitter system design
Post tweet 140 chars
Follow users
Timeline
Rendering diagram...
L4, LB (transparent proxy)
no TLS encryption handling.
L7, LB
handles TLS encryption.
Extremely large table.
1
2
Rendering diagram...
URL shortener system design
Create a shortened URL.
Redirects to the original URL.
Rendering diagram...
primary key
Only shard when you are not able to handle writes
1base64(sha(url, salt)[:8]) # short urlRendering diagram...
1base64(sha(url, salt)[:8]) # short urlDatabase Engines / Storage Engines / Embedded Databases
A library that takes care of CRED operations on the actual disk.
Databases can build on top of existing database engines.
MariaDB / MySQL allows us to switch between DB engines.
Postgres does not allow us to switch between DB engines.
LSM Trees ( watch?v=I6jB0nM9SKU )
Rendering diagram...
Log structured merge trees.
Each change to an existing key/value pairs are appended to the Memtable.
When the Memtable is full, it is flushed to disk as a SSTable.
SSTable is immutable, and is stored in the disk.
SSTables are have keys sorted in order.
Any update / delete to a key, is appended to the Memtable.
Memtable is flushed to a new SSTable, when the Memtable is full.
Bloomfilters are used to check if a key exists in the SSTable before reading it.
Compaction is used to merge multiple SSTables into one. (routine background process, for reclaiming disk space)
Duplicate and stale keys are removed during compaction.
Rendering diagram...
Database Cursors
Pros
Cursors create a temporary table in memory.
Cursors are used to iterate over a large number of rows. (without loading all the rows into memory)
Allows data to be streamed.
Can cancel the cursor at any time.
Paging can be done using cursors.
Cons
Cursors are stateful, if the connection is lost, the cursor is lost.
Cannot horizontally scale cursors.
Long running cursors can cause locking issues.
Pros
Cursors create a temporary table in memory.
Cursors are used to iterate over a large number of rows. (without loading all the rows into memory)
Allows data to be streamed.
Can cancel the cursor at any time.
Paging can be done using cursors.
Cons
Cursors are stateful, if the connection is lost, the cursor is lost.
Cannot horizontally scale cursors.
Long running cursors can cause locking issues.
Database Security
Steps to enable SSL
Run pg admin
1sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80 --name pgadmin dpage/pgadmin4Install vim in postgres docker container
docker exec -it pg bash;
apt-get update;
apt-get install vim;Generate SSH key
openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365Update postgresql.conf
vim var/lib/postgresql/data/postgresql.conf
ssl = on
ssl_cert_file = 'cert.pem'
ssl_key_file = 'key.pem'Connect to the db using pg admin
What is the largest sql query we can send before it fails?
Create table if not exists.
Different users for CRED operations and Table owners.
Have different privileges for different users.
Normal users should not be able to DROP tables.
Different pools for different routes (routes that require read access only will use a different pool, logged in as normal user).
Best practices when working with databases
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Steps to enable SSL
Run pg admin
1sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80 --name pgadmin dpage/pgadmin4Install vim in postgres docker container
1docker exec -it pg bash;
2apt-get update;
3apt-get install vim;Generate SSH key
1openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365Update postgresql.conf
1vim var/lib/postgresql/data/postgresql.conf
2ssl = on
3ssl_cert_file = 'cert.pem'
4ssl_key_file = 'key.pem'Connect to the db using pg admin
1sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80 --name pgadmin dpage/pgadmin41docker exec -it pg bash;
2apt-get update;
3apt-get install vim;1openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 3651vim var/lib/postgresql/data/postgresql.conf
2ssl = on
3ssl_cert_file = 'cert.pem'
4ssl_key_file = 'key.pem'What is the largest sql query we can send before it fails?
Create table if not exists.
Different users for CRED operations and Table owners.
Have different privileges for different users.
Normal users should not be able to DROP tables.
Different pools for different routes (routes that require read access only will use a different pool, logged in as normal user).
Best practices when working with databases
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Homomorphic Encryption
What is encryption?
Symmetric encryption
Same key is used to encrypt and decrypt.
AES, DES, 3DES, RC4, RC5, RC6, Blowfish, Twofish, etc.
Asymmetric encryption
Different keys are used to encrypt and decrypt.
Public key and private key pair.
Why can't we always encrypt data?
Queries cannot be executed on encrypted data.
Analysis, indexing, sorting, etc cannot be done on encrypted data.
Applications must read the data, decrypt it, and then perform the operation.
TLS termination at layer 7 load balancer requires the data to be decrypted. (needs to know the request, payload, etc)
What is homomorphic encryption?
Allows us to perform arithmetic operations on encrypted data.
No need to decrypt the data to perform operations.
Its damn slow, at the moment.
Databases can index and optimize without decrypting the data.
True end to end encryption.
What is encryption?
Symmetric encryption
Same key is used to encrypt and decrypt.
AES, DES, 3DES, RC4, RC5, RC6, Blowfish, Twofish, etc.
Asymmetric encryption
Different keys are used to encrypt and decrypt.
Public key and private key pair.
Why can't we always encrypt data?
Queries cannot be executed on encrypted data.
Analysis, indexing, sorting, etc cannot be done on encrypted data.
Applications must read the data, decrypt it, and then perform the operation.
TLS termination at layer 7 load balancer requires the data to be decrypted. (needs to know the request, payload, etc)
What is homomorphic encryption?
Allows us to perform arithmetic operations on encrypted data.
No need to decrypt the data to perform operations.
Its damn slow, at the moment.
Databases can index and optimize without decrypting the data.
True end to end encryption.
Answering your questions
Why does posgres do a bit map heap scan, instead of an index only scan?
create index g_index on grades(g);
explain analyze select count(*) from grades where g = 1;Postgres needs to visit the heap to check if the row is visible to all transactions.
The visibility col is a hidden col in the heap for each row.
Visibility map is used to check if the row is visible to all transactions.
vaccum grades;vaccum command creates a visibility map.
What is cost in explain analyze?
Cost is the estimated cost of the query.
Its just a number that postgres uses.
Higher the cost, slower the query.
All Isolation levels
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11
-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Duplicate indexes in a row
Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.
Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)
Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]
Why use SELECT FOR UPDATE?
Each transaction obtains a exclusive lock on the rows that it needs to read / update.
Other transactions cannot read / update the rows that where locked.
You killed concurrency, your throughput is low.
Your transactions will never fail due to conflicts.
Serialisable isolation level is better.
Detects conflicts and fails the transaction, if there is a serialisable error.
Your throughput is high, and you have concurrency.
Its a optimistic locking mechanism.
Your transactions can fail, due to conflicts.
You need to handle retry logics in your application.
Why use a connection pool?
If you are using one database connection for every request, you are doing it wrong.
TCP blocking could happen if a lot of queries need to be executed at the same time.
flowchart LR
client[Client1]
client_2[Client2]
server[Server]
db[DB]
client --> |request| server --> |get / set| db
client_2 --> |faster request| server --> |get / set| dbflowchart RL
classDef green fill:#ccffcc,stroke:#000,stroke-width:2px;
classDef red fill:#ffcccc,stroke:#000,stroke-width:2px;
client[Client1]:::green
client_2[Client2]:::red
server[Server]
db[DB]
db --> |fast response| server --> |fast response| client
db --> |slow response| server --> |slow response| client_2There is a chance of response misdirection.
Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.
But in case of a connection pool,
Each connection is leased to a request, and returned to the pool after the request is complete.
No parallel requests are made on the same connection.
What is the value of a bitmap index scan?
Row can be in multiple pages.
Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.
Updates a bit map on the second condition.
Goes to the heap pulls those pages.
Filters out the page for the query condition
See, "Bitmap And / Or" for more info this file
If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.
Does composite index order matter?
Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.
Composite indexes are created from left to right.
Minimum filter first. (filter that results in a larger dataset first)
select a,b,c where a=10 and b=10 and c=20;
-- a is minimum filterIn an index only scan, pg still needs to visit the heap?
To check if the table is visible to the transaction.
To get old versions (snapshot) of the row, when the transaction started.
To avoid visits to the heap, use VACUUM command.
Avoid long running transactions.
Persistence vs Durability
Persistence is the ability to store data on disk and retrieve it at a later time after a crash.
My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.
Can we create index when row's are being inserted / updated?
Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.
Postgres allows you to create index concurrently, without locking the table.
Why does posgres do a bit map heap scan, instead of an index only scan?
1create index g_index on grades(g);
2explain analyze select count(*) from grades where g = 1;Postgres needs to visit the heap to check if the row is visible to all transactions.
The visibility col is a hidden col in the heap for each row.
Visibility map is used to check if the row is visible to all transactions.
1vaccum grades;vaccum command creates a visibility map.
1create index g_index on grades(g);
2explain analyze select count(*) from grades where g = 1;1vaccum grades;What is cost in explain analyze?
Cost is the estimated cost of the query.
Its just a number that postgres uses.
Higher the cost, slower the query.
All Isolation levels
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11
-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transaction1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 101BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
1select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
1select * from users where id = 1 for update;Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Duplicate indexes in a row
Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.
Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)
Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]
Why use SELECT FOR UPDATE?
Each transaction obtains a exclusive lock on the rows that it needs to read / update.
Other transactions cannot read / update the rows that where locked.
You killed concurrency, your throughput is low.
Your transactions will never fail due to conflicts.
Serialisable isolation level is better.
Detects conflicts and fails the transaction, if there is a serialisable error.
Your throughput is high, and you have concurrency.
Its a optimistic locking mechanism.
Your transactions can fail, due to conflicts.
You need to handle retry logics in your application.
Why use a connection pool?
If you are using one database connection for every request, you are doing it wrong.
TCP blocking could happen if a lot of queries need to be executed at the same time.
Rendering diagram...
Rendering diagram...
There is a chance of response misdirection.
Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.
But in case of a connection pool,
Each connection is leased to a request, and returned to the pool after the request is complete.
No parallel requests are made on the same connection.
Rendering diagram...
Rendering diagram...
What is the value of a bitmap index scan?
Row can be in multiple pages.
Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.
Updates a bit map on the second condition.
Goes to the heap pulls those pages.
Filters out the page for the query condition
See, "Bitmap And / Or" for more info this file
If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.
Does composite index order matter?
Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.
Composite indexes are created from left to right.
Minimum filter first. (filter that results in a larger dataset first)
1select a,b,c where a=10 and b=10 and c=20;
2-- a is minimum filter1select a,b,c where a=10 and b=10 and c=20;
2-- a is minimum filterIn an index only scan, pg still needs to visit the heap?
To check if the table is visible to the transaction.
To get old versions (snapshot) of the row, when the transaction started.
To avoid visits to the heap, use VACUUM command.
Avoid long running transactions.
Persistence vs Durability
Persistence is the ability to store data on disk and retrieve it at a later time after a crash.
My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.
Can we create index when row's are being inserted / updated?
Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.
Postgres allows you to create index concurrently, without locking the table.
Graph databases
Database persists and retrieves data from disk.
IO, fetching pages from disk / heap is heavy.
More IO, more latency.
Storage engine, classic storage engines use tables and rows.
Postgres, does not store large strings in-line in the row, it compresses to an external TOAST table, and points to it.
Meta data can be stored in an external data-structure.
WAL, REDO, UNDO logs
No title
When a transaction is committed, data is expected to be persistent even after system termination.
A transaction does changes to everything in memory.
On commit, all changes are flushed to disk. ( this is slow )
If databases crashes mid commit, what should happen to data?.
A log of all changes are persisted to disk. (WAL write ahead log).
Dirty pages are kept in memory.
Dirty pages (changed pages) are flushed to WAL.
DB comes back up, detects that WAL is ahead of the data on disk.
No one reads from the WAL, it's only used for recovery.
Page is loaded in memory, and the WAL is replayed ( WAL is also called as redo log ).
Page is flushed to disk, and WAL is marked as flushed.
How often should the WAL be flushed to disk? (checkpointing)
When the wall is full, we need to flush it to disk, before we can write to the wall again.
Checkpointing is a IO intensive operation, we need to pause everything until the checkpoint is done.
DB warns you before the checkpoint is initiated. (causes spikes in CPU usage)
If the WAL is small, the checkpoint is faster and more frequent.
OS buffers the writes in memory, and flushes it to disk when it's ready ( to reduce IO and wear on the disk ).
OS waits for the write cache to be full, before flushing it to disk.
This is a problem, because the WAL is not flushed to disk, and the OS crashes, the WAL is lost.
fsync() system call is used to flush WAL is flushed to disk directly bypassing the OS cache.
UNDO logs?
Not all databases have UNDO logs.
Rollback Transactions: If a transaction fails or is explicitly rolled back by the user, the UNDO log records are used to revert the changes made by that transaction.
Recovery: In the event of a system crash, UNDO logs are used during the recovery process to ensure that any incomplete transactions are undone, maintaining the consistency of the database.
Long running transactions, can cause the UNDO logs to grow, and for any new transactions to see the old data, the UNDO logs need to be replayed to get to the old state.
Undo logs are applied after wall changes are applied when DB is recovering from a crash.
Can Null values improve database performance?
Default values will take up space, eg (int 32bit will take up 32bits in the row even if the value like 0 does not take up 32 bits))
PostGres uses a null bitmap, to store null values. Every row has a null bitmap, that stores the null values in the row. Every bit corresponds to a column in the row.
If you have more than 8 columns, the null bitmap will increment by 8 bytes. | 8cols | 8 bits | | 9cols | 16 bits | | 17cols | 24 bits |
Nulls are better than default values, in terms of space.
Don't rely on nulls all the time, there are some gotcha's
select count(field);
ignores nulls based on db
select count(*);
counts nulls
T is null or T is not null, is fine.
Not all databases support indexes on cols with nulls in them. (PG allows it, but oracle doesn't)
Postgres limits each page size to 8kb.
We need to try to fit as many rows as possible in a page.
More IO's if the row is split across pages.
If most of the cols are null, the no of rows that can fit in a page can be increased.
There is a fixed cost of saving null's
Select user_id from users where user_id is null; will not use the index on user_id.
No title
When a transaction is committed, data is expected to be persistent even after system termination.
A transaction does changes to everything in memory.
On commit, all changes are flushed to disk. ( this is slow )
If databases crashes mid commit, what should happen to data?.
A log of all changes are persisted to disk. (WAL write ahead log).
Dirty pages are kept in memory.
Dirty pages (changed pages) are flushed to WAL.
DB comes back up, detects that WAL is ahead of the data on disk.
No one reads from the WAL, it's only used for recovery.
Page is loaded in memory, and the WAL is replayed ( WAL is also called as redo log ).
Page is flushed to disk, and WAL is marked as flushed.
How often should the WAL be flushed to disk? (checkpointing)
When the wall is full, we need to flush it to disk, before we can write to the wall again.
Checkpointing is a IO intensive operation, we need to pause everything until the checkpoint is done.
DB warns you before the checkpoint is initiated. (causes spikes in CPU usage)
If the WAL is small, the checkpoint is faster and more frequent.
OS buffers the writes in memory, and flushes it to disk when it's ready ( to reduce IO and wear on the disk ).
OS waits for the write cache to be full, before flushing it to disk.
This is a problem, because the WAL is not flushed to disk, and the OS crashes, the WAL is lost.
fsync() system call is used to flush WAL is flushed to disk directly bypassing the OS cache.
UNDO logs?
Not all databases have UNDO logs.
Rollback Transactions: If a transaction fails or is explicitly rolled back by the user, the UNDO log records are used to revert the changes made by that transaction.
Recovery: In the event of a system crash, UNDO logs are used during the recovery process to ensure that any incomplete transactions are undone, maintaining the consistency of the database.
Long running transactions, can cause the UNDO logs to grow, and for any new transactions to see the old data, the UNDO logs need to be replayed to get to the old state.
Undo logs are applied after wall changes are applied when DB is recovering from a crash.
Can Null values improve database performance?
Default values will take up space, eg (int 32bit will take up 32bits in the row even if the value like 0 does not take up 32 bits))
PostGres uses a null bitmap, to store null values. Every row has a null bitmap, that stores the null values in the row. Every bit corresponds to a column in the row.
If you have more than 8 columns, the null bitmap will increment by 8 bytes. | 8cols | 8 bits | | 9cols | 16 bits | | 17cols | 24 bits |
Nulls are better than default values, in terms of space.
Don't rely on nulls all the time, there are some gotcha's
select count(field);
ignores nulls based on db
select count(*);
counts nulls
T is null or T is not null, is fine.
Not all databases support indexes on cols with nulls in them. (PG allows it, but oracle doesn't)
Postgres limits each page size to 8kb.
We need to try to fit as many rows as possible in a page.
More IO's if the row is split across pages.
If most of the cols are null, the no of rows that can fit in a page can be increased.
There is a fixed cost of saving null's
Select user_id from users where user_id is null; will not use the index on user_id.
Postgres Architecture
No title
Postmaster process
Everything in postgres is append only mutli version concurrency control.
Process vs Threads, (threads are faster, but postgres uses processes)
well established mind physical body, mind, emotions, life energies
Distributed transaction ( a transaction spanning across multiple databases / services / processes )
Each microservice has its own database.
Making a transaction across multiple microservices is hard.
Rendering diagram...
Ongoing transactions where stored in memory, instead of disk
But for distributed transactions this is not the case
Google atomic clocks to sequence of transactions
Compensatory transactions, revert the previous transaction
Eventing model, Kafka / Rabbit MQ,
Rendering diagram...
Go back to the old days, have a macro service.
Debugging is hard in a distributed micro services.
Why did Uber migrate from postgres to mysql
ctid is a internal parameter used by postgres to identify for physical row identification,
it's not meant to be used as a primary key.
ctid's change with every row update and are not stable.
When a row is updated, PostgreSQL often creates a new version of the row with a different ctid. The index is updated to point to the new ctid. This is why indexes might get bloated over time with dead tuples, necessitating regular maintenance tasks like VACUUM or REINDEX.
This is called Write amplification.
WAL's are used to keep track of these changes.
WAL's are pushed to standby servers, to keep them in sync.
SSD's are not good for updates, they are good for inserts (RocksDB and LevelsDB, optimised for SSD's).
Adding too much indexes are a bad idea, it will cause write amplifications.
Uber had databases across multiple data centers, and they had to keep them in sync.
Push WAL updates across multiple data centers, is expensive (network wise).
No title
Postmaster process
Everything in postgres is append only mutli version concurrency control.
Process vs Threads, (threads are faster, but postgres uses processes)
well established mind physical body, mind, emotions, life energies
Distributed transaction ( a transaction spanning across multiple databases / services / processes )
Each microservice has its own database.
Making a transaction across multiple microservices is hard.
Rendering diagram...
Ongoing transactions where stored in memory, instead of disk
But for distributed transactions this is not the case
Google atomic clocks to sequence of transactions
Compensatory transactions, revert the previous transaction
Eventing model, Kafka / Rabbit MQ,
Rendering diagram...
Go back to the old days, have a macro service.
Debugging is hard in a distributed micro services.
Rendering diagram...
Rendering diagram...
Why did Uber migrate from postgres to mysql
ctid is a internal parameter used by postgres to identify for physical row identification,
it's not meant to be used as a primary key.
ctid's change with every row update and are not stable.
When a row is updated, PostgreSQL often creates a new version of the row with a different ctid. The index is updated to point to the new ctid. This is why indexes might get bloated over time with dead tuples, necessitating regular maintenance tasks like VACUUM or REINDEX.
This is called Write amplification.
WAL's are used to keep track of these changes.
WAL's are pushed to standby servers, to keep them in sync.
SSD's are not good for updates, they are good for inserts (RocksDB and LevelsDB, optimised for SSD's).
Adding too much indexes are a bad idea, it will cause write amplifications.
Uber had databases across multiple data centers, and they had to keep them in sync.
Push WAL updates across multiple data centers, is expensive (network wise).
Acid
What is a transaction? 🔁
Life cycle of a transaction
flowchart TD
start[DB Starting State] -->
begin[BEGIN Transaction] --> work[One unit of work]
work --> commit[COMMIT Transaction]
work --> rollback[ROLLBACK Transaction]
rollback --> startBegin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;balance - 100 > 0;UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;COMMIT TRANSACTION;Atomicity
All queries in a transaction must succeed,
or the transaction must be rolled back.
Even if 1/100 query fails,
the entire transaction must be rolled back.
It the database went down prior to a commit of a transaction,
the transaction must be rolled back.
flowchart TD
start[DB Starting Snapshot] -->
begin[BEGIN Transaction] --> update_1[Queries 5/100]
update_1 --> db_crash{DB CRASH}
db_crash -->|yes| cleanup[DB Cleanup / Rollback]
db_crash -->|no| COMMIT[COMMIT Transaction]
cleanup --> startIsolation
No title
Can my inflight transaction see the changes made by other transactions?
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] --> update_1[UPDATE Data]
initial_state ---> transaction_2[BEGIN transaction_2] --> |dirty read| read_1[SELECT Data]
update_1 --> |roolback| initial_stateNon-repeatable reads
Two different reads of the same data, return different results.
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] ---> update_1[UPDATE Data] -->
commit[commit]
transaction_2[BEGIN transaction_2] --> read_1[SELECT Data] --->
|same query diff result| read_2[SELECT Data]
initial_state --> transaction_2Avoid it by starting a repeatable read transaction,
begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
flowchart TD
initial_state[DB Starting State] -->
transaction_1[BEGIN transaction_1] --> read_1[SELECT Data]
transaction_2[BEGIN transaction_2] --> update_1[INSERT Data] --> commit
read_1 ---> |same query has more results| read_2[SELECT Data]
initial_state --> transaction_2On postgres you can use the following to prevent phantom reads,
begin transaction isolation level repeatable read;
# or
begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
flowchart TD
initial_state[SUM 10] -->
transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
initial_state -->
transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commitIsolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Consistency
Consistency in data is user defined (DBA team)
Referential integrity (through foreign keys) data is
inconsistent across tables.
Atomicity, isolation affects consistency.
Data consistency between tables,
1
a
1
2
b
2
Mate
1
Mark
1
Jack
2
Consistency in reads,
flowchart LR
data --> |update| db[(DB)] --> |read| dataIf a transaction update a row,
will the change be immediately visible during a read?.
Both SQL and NoSQL databases suffer from this problem.
Eventual consistency
flowchart LR
data_v1 --> |update| db[(DB)] -->
|read| data_v2
db --> |read| data_vn
db --> |read| data_v1You will eventually get the latest version of the data.
flowchart LR
master_db[(Master - X)]
slave_1_db[(Follower - Z)]
slave_2_db[(Follower - Z)]
data -->|updated to x| master_db
master_db --> slave_1_db
master_db --> slave_2_db
slave_1_db -->|Inconsistent data| readflowchart LR
master_db[(Master - X)]
slave_1_db[(Follower - X)]
slave_2_db[(Follower - X)]
data --> master_db
master_db -->|replicate| slave_1_db
master_db -->|replicate| slave_2_db
slave_1_db -->|Consistent data| readWhen we introduce caching or we introduce horizontal scaling,
you are eventually consistent.
If you do not have atomicity,
or have inconsistent data,
you do not have consistency at all.
Durability
Process of persisting data to disk (non-volatile storage).
Durability slows write down,
write to disk is slower than write to memory
Durability techniques
WAL - Write ahead log / AOF - Append only file
Any change to the database is written to a log file, the log file is flushed to disk.
flowchart LR
data --> |update| db[(DB)] --> |write| log[(Log)] --> |flush| disk[(Disk)]flowchart LR
db[(DB)]
log[(Log)]
disk[(Disk)]
log --> db
disk --> db
db --> reconstructed_db[(DB reconstructed)]Writing all the changes to disk is expensive (indexes, data, files,
rows...)
The log file is a compressed version of the data stored in the database.
flowchart LR
db[(DB)] --> |flush request| os -->
os_cache[(OS Cache)] -->
|batched flushes| disk[(Disk)]
db --> |fSync| diskfSync forces the OS to flush the data to disk.
This prevents data loss in case of OS crash.
Async snapshots
flowchart LR
data --> |update| db[(DB)] --> |async| snapshot[(Snapshot)] --> |flush| disk[(Disk)]
db --> |sync| snapshot[(Snapshot)]Asynchronously write the data to disk
Serializability vs Repeatable reads
Repeatable reads
flowchart LR
transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit
transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit -->
result[a, a, b, b]both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Serializability
flowchart LR
transaction_1[BEGIN transaction_1] --> read_1[a, a, b, b] --> |update a's to b's|update_1[b, b, b, b] --> commit --> result[a, a, b, b]
transaction_2[BEGIN transaction_2] --> read_2[a, a, b, b] --> |update b's to a's|update_2[a, a, a, a] --> commit_error[Could not serialize transaction]What is a transaction? 🔁
Life cycle of a transaction
Rendering diagram...
Begin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;balance - 100 > 0;UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;1COMMIT TRANSACTION;Life cycle of a transaction
Rendering diagram...
Begin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Rendering diagram...
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;1COMMIT TRANSACTION;1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;1COMMIT TRANSACTION;Atomicity
All queries in a transaction must succeed,
or the transaction must be rolled back.
Even if 1/100 query fails,
the entire transaction must be rolled back.
It the database went down prior to a commit of a transaction,
the transaction must be rolled back.
Rendering diagram...
Rendering diagram...
Isolation
No title
Can my inflight transaction see the changes made by other transactions?
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
begin transaction isolation level repeatable read;
# or
begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
flowchart TD
initial_state[SUM 10] -->
transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
initial_state -->
transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commitIsolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
1begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
Rendering diagram...
Rendering diagram...
Rendering diagram...
1begin transaction isolation level repeatable read;Rendering diagram...
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;1begin transaction isolation level serializable;Rendering diagram...
Isolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Consistency
Consistency in data is user defined (DBA team)
Referential integrity (through foreign keys) data is
inconsistent across tables.
Atomicity, isolation affects consistency.
Data consistency between tables,
1
a
1
2
b
2
Mate
1
Mark
1
Jack
2
Consistency in reads,
Rendering diagram...
If a transaction update a row,
will the change be immediately visible during a read?.
Both SQL and NoSQL databases suffer from this problem.
Eventual consistency
Rendering diagram...
You will eventually get the latest version of the data.
Rendering diagram...
Rendering diagram...
When we introduce caching or we introduce horizontal scaling,
you are eventually consistent.
If you do not have atomicity,
or have inconsistent data,
you do not have consistency at all.
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Durability
Process of persisting data to disk (non-volatile storage).
Durability slows write down,
write to disk is slower than write to memory
Durability techniques
WAL - Write ahead log / AOF - Append only file
Any change to the database is written to a log file, the log file is flushed to disk.
Rendering diagram...
Rendering diagram...
Writing all the changes to disk is expensive (indexes, data, files,
rows...)
The log file is a compressed version of the data stored in the database.
Rendering diagram...
fSync forces the OS to flush the data to disk.
This prevents data loss in case of OS crash.
Async snapshots
Rendering diagram...
Asynchronously write the data to disk
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Serializability vs Repeatable reads
Repeatable reads
Rendering diagram...
both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Serializability
Rendering diagram...
Repeatable reads
Rendering diagram...
both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Rendering diagram...
Serializability
Rendering diagram...
Rendering diagram...
Database Internals
How tables and indexes are stored on disk?
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Row vs Column oriented databases
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
flowchart TD
page_1[1-200,name,age]
page_2[200-400,name,age]
page_3[400-500,name,age]
sum[sum age] --> |fetch| page_1 --> |fetch| page_2 --> |fetch| page_3Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
flowchart TD
page_1[age: 1:row_id-500:row_id]
sum[sum age] --> |fetch| page_1To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Primary key vs Secondary key
There is no order maintained by default in the table,
Clustering, order enforced by the database on a table,
through a primary key or secondary key.
Postgres Clustered table / heap organised table, mySQL Index ordered table
Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual
table.
Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.
Buffer pool
flowchart LR
heap[(Heap)]
index[(Index)]
buffer_pool[(Buffer pool)]
wal[(WAL)]
read_query --> |select|index --> |select|heap --> |page,row|buffer_pool --> |row|wal
update_query --> buffer_pool --> walTo reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the
buffer pool,
thus reducing the number of IO's.
How tables and indexes are stored on disk?
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Rendering diagram...
Row vs Column oriented databases
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
Rendering diagram...
Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
Rendering diagram...
To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
Rendering diagram...
Rendering diagram...
Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
Rendering diagram...
To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Rendering diagram...
Primary key vs Secondary key
There is no order maintained by default in the table,
Clustering, order enforced by the database on a table,
through a primary key or secondary key.
Postgres Clustered table / heap organised table, mySQL Index ordered table
Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual
table.
Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.
Buffer pool
Rendering diagram...
To reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the
buffer pool,
thus reducing the number of IO's.
Rendering diagram...
Database Indexing
Creating table with random vaues.
insert into temp (t) select random()*100 from generate_series(0,100000)Explaining the explain command
Returns statistics about the query plan,
explain select * from temp where t > 50;
Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Scan types ( Seq vs Index )
Cost Time to start executing the query,
Time to finish executing the query
Rows
Number of rows return
Width
Avg size of returned rows in bytes
Creating an index
create index temp_t_idx on temp(t);Without index scan,
Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)With index scan,
Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.
With an index search,
Time to prepare increases 0.00 vs 0.29
Time to execute decreases
Number of rows returned visited decreases
Seq table scan vs Index scan vs Bitmap index scan (postgres)
No title
Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
explain select id from t_records where t > 10;
Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
Filter: (t > 10)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select --> heap --> resolveIndex
explain select id from t_records where t > 5000;
Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
Index Cond: (t > 5000)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select -->| Index Scan | index --> heap --> resolveBitmap
If the no of rows to be scanned is too high, and the
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
select --> bitmap --> recheck --> heapbitmap is a bit array of the index values, pointing to the page on heap.
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
explain select id from t_records where t > 4000;
Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
Recheck Cond: (t > 4000)
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
Index Cond: (t > 4000)
(4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Bitmap And / Or
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
where_clause_1 --> bitmap --> recheck --> heap
where_clause_2 --> bitmapcondition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]
bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
create index a_idx on temp(a) include (b,c);flowchart LR
index[(Index)]
select -->| Index only Scan | index --> resolveAdding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
Composite / Combined Indexes
create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
select * from temp where a > 1 and b < 3 and c > 5;flowchart LR
index[(a,b,c)]
select --> |index only scan|index --> resolveWhen should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Concurrent Index Creation's
Waits for all transactions to finish, before creating the index.
Does not lock the table read's or writes.
Takes longer time to execute.
Can crash if inserting duplicate values, while creating the index.
create index concurrently a_idx on temp(a);Bloom Filters
flowchart LR
client[Client]
server[Server]
bloom_filter[(Bloom Filter)]
db[(DB)]
client --> |request| server <--> |get / set| bloom_filter
server --> |request| dbA bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.
# bloom filter
filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
filter_posstion = hash("req") % len(filter)
if filter[filter_posstion] == 1:
# maybe in set
return request_db()
# set bit
filter[filter_posstion] = 1Working with Billion row tables
Why do we need a billion row table? We have many ways to solve a problem.
Avoid having a billion row.
If thats not possible,
Can you index it?
Can you partition it in the same disk.
Can you shard horizontally on different disks, on multiple DB's.
Map Reduce (not suitabale for transactions).
Creating table with random vaues.
1insert into temp (t) select random()*100 from generate_series(0,100000)1insert into temp (t) select random()*100 from generate_series(0,100000)Explaining the explain command
Returns statistics about the query plan,
1explain select * from temp where t > 50;
2Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Scan types ( Seq vs Index )
Cost Time to start executing the query,
Time to finish executing the query
Rows
Number of rows return
Width
Avg size of returned rows in bytes
1explain select * from temp where t > 50;
2Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Creating an index
1create index temp_t_idx on temp(t);Without index scan,
1Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)With index scan,
1Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.
With an index search,
Time to prepare increases 0.00 vs 0.29
Time to execute decreases
Number of rows returned visited decreases
1create index temp_t_idx on temp(t);1Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)1Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)Seq table scan vs Index scan vs Bitmap index scan (postgres)
No title
Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
explain select id from t_records where t > 10;
Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
Filter: (t > 10)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select --> heap --> resolveIndex
explain select id from t_records where t > 5000;
Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
Index Cond: (t > 5000)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select -->| Index Scan | index --> heap --> resolveBitmap
If the no of rows to be scanned is too high, and the
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
select --> bitmap --> recheck --> heapbitmap is a bit array of the index values, pointing to the page on heap.
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
explain select id from t_records where t > 4000;
Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
Recheck Cond: (t > 4000)
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
Index Cond: (t > 4000)
(4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Bitmap And / Or
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
where_clause_1 --> bitmap --> recheck --> heap
where_clause_2 --> bitmapcondition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]
bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
create index a_idx on temp(a) include (b,c);flowchart LR
index[(Index)]
select -->| Index only Scan | index --> resolveAdding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
Composite / Combined Indexes
create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
select * from temp where a > 1 and b < 3 and c > 5;flowchart LR
index[(a,b,c)]
select --> |index only scan|index --> resolveWhen should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
Index
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
Bitmap
If the no of rows to be scanned is too high, and the
Rendering diagram...
bitmap is a bit array of the index values, pointing to the page on heap.
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Rendering diagram...
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Bitmap And / Or
Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)1explain select id from t_records where t > 5000 and t < 1000;Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
1vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
1vacuum t_records;Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Composite / Combined Indexes
1create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
When should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
1create index a_b_c_combined_idx on temp(a,b,c);1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Concurrent Index Creation's
Waits for all transactions to finish, before creating the index.
Does not lock the table read's or writes.
Takes longer time to execute.
Can crash if inserting duplicate values, while creating the index.
1create index concurrently a_idx on temp(a);1create index concurrently a_idx on temp(a);Bloom Filters
Rendering diagram...
A bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.
1# bloom filter
2filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
3filter_posstion = hash("req") % len(filter)
4
5if filter[filter_posstion] == 1:
6 # maybe in set
7 return request_db()
8
9# set bit
10filter[filter_posstion] = 1Rendering diagram...
1# bloom filter
2filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
3filter_posstion = hash("req") % len(filter)
4
5if filter[filter_posstion] == 1:
6 # maybe in set
7 return request_db()
8
9# set bit
10filter[filter_posstion] = 1Working with Billion row tables
Why do we need a billion row table? We have many ways to solve a problem.
Avoid having a billion row.
If thats not possible,
Can you index it?
Can you partition it in the same disk.
Can you shard horizontally on different disks, on multiple DB's.
Map Reduce (not suitabale for transactions).
B and B+ Trees
Full Table Scan
Reading the entire table, to find the rows that satisfy the query.
Reading pages from disk, that are not in memory.
Requires a lot of I/O to read the pages. (8KB - 16KB per page)
DBMS's implement multi threading to read pages in parallel.
B-Tree (Original)
B trees are not binary trees (each node can have more than 2 children),
B trees stands for balanced tree.(distance from root to any leaf node is
the same)
Each node in a B tree contains multiple keys that form a page.
Rendering diagram...
The above example has 2 keys, and 3 children.
Each key in the node points to a value on the heap (tuple),
or the primary key.
Max no of degrees, no of keys per node.
The parent must have (n - 1) keys,
where n is the number of children.
The nodes are ordered from left to right.
You cannot see this in the diagram,
each key in the node has a before and after pointer.
Before pointer is used to find the next node,
that has a value less than the following key.
Limitations of B-Trees
Because each node stores both keys and values,
the number of keys per node is limited.
The values are not used until the node is found.
This increases the number of I/O's required to find the node
(reading each node / page is an IO).
Range queries are slow, since we need to traverse for all the keys in the range.
B+ Tree
Nodes only store keys in internal and root nodes.
The values are only stored in the leaf nodes.
This allows us to store more keys per node.
Leaf nodes are linked, eacn node has a pointer to the next node linked list.
Range queries are much faster.
Rendering diagram...
Each node usually store 100's of thousands of keys, above diagram is just an example.
Leaf nodes have duplicate keys.
Leaf pointers are cheap to traverse.
1 Node = 1 DBMS Page
Internal nodes fit in memory.
Leaf nodes can be stored in the heap.
Most DBMS uses B+ trees.
Storage cost in Postgres vs mySQL
Postgres points to the heap. mySQL points to the primary key.
Write amplification is higher in Postgres. (any update to the heap, needs to be replicated in the index).
If primary keys are large (UUID), inserts in mySQL will be slower.
(Since UUID's don't have a pattern.)
Leaf nodes in mySQL contains all the data (clustered index).
Full Table Scan
Reading the entire table, to find the rows that satisfy the query.
Reading pages from disk, that are not in memory.
Requires a lot of I/O to read the pages. (8KB - 16KB per page)
DBMS's implement multi threading to read pages in parallel.
B-Tree (Original)
B trees are not binary trees (each node can have more than 2 children),
B trees stands for balanced tree.(distance from root to any leaf node is
the same)
Each node in a B tree contains multiple keys that form a page.
Rendering diagram...
The above example has 2 keys, and 3 children.
Each key in the node points to a value on the heap (tuple),
or the primary key.
Max no of degrees, no of keys per node.
The parent must have (n - 1) keys,
where n is the number of children.
The nodes are ordered from left to right.
You cannot see this in the diagram,
each key in the node has a before and after pointer.
Before pointer is used to find the next node,
that has a value less than the following key.
Rendering diagram...
Limitations of B-Trees
Because each node stores both keys and values,
the number of keys per node is limited.
The values are not used until the node is found.
This increases the number of I/O's required to find the node
(reading each node / page is an IO).
Range queries are slow, since we need to traverse for all the keys in the range.
B+ Tree
Nodes only store keys in internal and root nodes.
The values are only stored in the leaf nodes.
This allows us to store more keys per node.
Leaf nodes are linked, eacn node has a pointer to the next node linked list.
Range queries are much faster.
Rendering diagram...
Each node usually store 100's of thousands of keys, above diagram is just an example.
Leaf nodes have duplicate keys.
Leaf pointers are cheap to traverse.
1 Node = 1 DBMS Page
Internal nodes fit in memory.
Leaf nodes can be stored in the heap.
Most DBMS uses B+ trees.
Rendering diagram...
Storage cost in Postgres vs mySQL
Postgres points to the heap. mySQL points to the primary key.
Write amplification is higher in Postgres. (any update to the heap, needs to be replicated in the index).
If primary keys are large (UUID), inserts in mySQL will be slower.
(Since UUID's don't have a pattern.)
Leaf nodes in mySQL contains all the data (clustered index).
Database Partitioning
Partitioning types.
Range partitioning By date, customer id, etc.
By List By location, zip codes, etc.
Hash partitioning By hash of the primary key, etc. (consistent hashing)
Partitioning vs Sharding
In partitioning, the data is stored in the same DB. The client is agnostic.
Each shard sits in a seperate DB, the client needs to know which shard to query.
In partitioning, the table name changes.
In sharding, the table name is the same, but the DB ip changes.
Creating a partition
create table grades (id serial not null, g not null) partition by range (g);
create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);
alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
2create index on grades (g);or
insert into grades (g) select g from another_table;select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99Make sure ENABLE_PARTITION_PRUNING is set to on in
postgresql.conf.
use the show command to check.
show enable_partition_pruning;set enable_partition_pruning = on;Checking partition size
select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
pg_relation_size | relname
------------------+-----------------------------------------------
126935040 | g3570
126754816 | g0035
72572928 | g7090
36241408 | g90100
24305664 | g3570_g_idx
24272896 | g0035_g_idx
13910016 | g7090_g_idx
6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
Partitioning types.
Range partitioning By date, customer id, etc.
By List By location, zip codes, etc.
Hash partitioning By hash of the primary key, etc. (consistent hashing)
Partitioning vs Sharding
In partitioning, the data is stored in the same DB. The client is agnostic.
Each shard sits in a seperate DB, the client needs to know which shard to query.
In partitioning, the table name changes.
In sharding, the table name is the same, but the DB ip changes.
Creating a partition
create table grades (id serial not null, g not null) partition by range (g);
create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);
alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
2create index on grades (g);or
1insert into grades (g) select g from another_table;select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99Make sure ENABLE_PARTITION_PRUNING is set to on in
postgresql.conf.
use the show command to check.
1show enable_partition_pruning;set enable_partition_pruning = on;1create table grades (id serial not null, g not null) partition by range (g);
2
3create table g0035 (like grades including indexes);
4create table g3570 (like grades including indexes);
5create table g7080 (like grades including indexes);
6create table g80100 (like grades including indexes);
7
8alter table attach partition g0035 for values from (0) to (35);
9alter table attach partition g3570 for values from (35) to (70);
10alter table attach partition g7080 for values from (70) to (80);
11alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
2create index on grades (g);1insert into grades (g) select g from another_table;1select max(g) from g0035; -- 34
2select max(g) from g3570; -- 69
3select max(g) from g7080; -- 79
4select max(g) from g80100; -- 991show enable_partition_pruning;1set enable_partition_pruning = on;Checking partition size
No title
1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
No title
1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idx1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
Sharding
Pros of sharding
Horizontal scaling (Data / Memory / CPU).
Security (Access control).
Smaller Indexes.
Cons of sharding
Complex client side code. (need to know which shard to query)
Atomic transactions across shards are impossible.
Rollbacks are complicated AF.
Schema changes are harder (need to change all shards).
Joins are harder (need to query all shards).
Sharding key should be known.
When should we shard?
Sharding should be your last resort.
Partition first, then shard.
TCP connections are reaching the limit?
Try replication first.
Writes only to primary server.
Two way replication, two availabilty zones (multiple primary databases).
You cannot maintain ACID complaiance accross shards. (eventual consistency).
Sharding is an overkill for most applications.
Vitesse is a good sharding library for go.
Cons of sharding
Complex client side code. (need to know which shard to query)
Atomic transactions across shards are impossible.
Rollbacks are complicated AF.
Schema changes are harder (need to change all shards).
Joins are harder (need to query all shards).
Sharding key should be known.
When should we shard?
Sharding should be your last resort.
Partition first, then shard.
TCP connections are reaching the limit?
Try replication first.
Writes only to primary server.
Two way replication, two availabilty zones (multiple primary databases).
You cannot maintain ACID complaiance accross shards. (eventual consistency).
Sharding is an overkill for most applications.
Vitesse is a good sharding library for go.
Concurrency control
Exclusive lock's vs Shared lock's
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
gantt
title Timeline
section Alice
Exclusive lock :a1, 2020-01-10, 42h
section Bob
Shared lock :b1, 2020-01-11, 24h
section Charlie to Bob
Shared lock on Bob failed :crit, c1, 2020-01-11, 1h
Update bob :cirt c2, 2020-01-12, 24hDead locks
graph LR
A[Transaction 1]
B[T2]
l1{{Release lock 1 event}}
l2{{R2}}
A --> l1 --> B
B --> l2 --> Aflowchart TD
t1[Transaction 1]
t2[Transaction 2]
a1[Insert 10]
a2[Insert 20]
b1[Insert 20]
b2[Insert 10]
t1 --> a1 --> a2 --> wait --> commit
t2 --> b1 --> b2 --> rollback[rollback due to deadlock]Most DBMS's have a deadlock detection mechanism,
They will fail one of the transactions.
Two phase locking
The double booking phenomenon.
graph TD
t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
t2[Transaction 2] --> s2[id=10, isBooked=false] --> u2[id=10, isBooked=true, name=Steve] --> commit -->
status[id=10, isBooked=true, name=Steve]Both transactions read the same data, and updated it.
Solution, create a two phase locking mechanism.
graph LR
t1[Transaction 1] --> s1[id=10, isBooked=false] --> u1[id=10, isBooked=true, name=Bob] --> commit
t2[Transaction 2] --> hang --> s2[id=10, isBooked=true]begin transaction;
select * from seats where id = 10 for update; -- creates an exclusive lock
update seats set isBooked = true, name = 'Bob' where id = 10;
commit;begin transaction;
select * from seats where id = 10 for update;
-- hang
-- updated data is returnedfor UPDATE creates an exclusive lock on the row indefinitelyset lock_timeout = 1000; will timeout the lock after 1 second.
Alternate solution (not ideal)
flowchart LR
t1[Read committed transaction 1] --> s1[Update when clause] ---> commit
t2[Read committed transaction 2] --> s2[Update when clause] --> hang --> rollbackupdate seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;This solution is imperfect, You are at the mercy of the DBMS
implementation.
Postgres refreshes the row after the lock is released, the same cannot be said
for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.
flowchart TD
t1[update clause] --> |reading row on heap for lock info|heap[(Heap)] --> wait{{Wait for lock to be released}}
wait --> |refresh row|heap
wait --> resolveSQL pagination with offset is very slow
This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.
select * from users order by id limit 10 offset 1000000;instead, make user query for more data using the last-seen-id.
select * from users where id > 1000000 order by id limit 10;Connection pooling.
For every connection, the DBMS needs to allocate resources, this takes time.
If we create a connection for every request, the latency and throughput will be very bad.
If the no of connections increase, the DBMS will overload and crash.
Connection pooling is a way to reuse connections.
The connections are kept alive, and reused.
Each connection is leased to a request, and returned to the pool after the request is complete.
graph LR
A[Request 1]
B[Request 2]
C((Connection pool))
C --> |lease|A
A --> |release|C
C --> |lease|B
B --> |release|C
C --> D[(DB ip:port)]Default connection pool size is 10, in postgres.
graph LR
db[(DB ip_address:port)]
instance_1((instance 1))
instance_2((instance 2))
instance_n((instance n))
lb[Load balancer]
request --> lb --> instance_1
lb --> instance_2
lb --> instance_n
instance_1 --> |10 connection pool|db
instance_2 --> |10 connection pool|db
instance_n --> |10 connection pool|dbEach instance keeps alive 10 connections to the DB.
Each instance can send upto 10 concurrent requests to the DB.
Each function invocation leases a connection from the pool.
Once the function is executed, the connection is returned to the pool.
const { Pool } = require('pg');
const pool = new Pool({
max: 20,
idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
connectionTimeoutMillis: 2000, // timeout after 2 seconds
});
await pool.query('SELECT something FROM sometable');We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
} finally {
// release the connection to the pool
client.release();
}Exclusive lock's vs Shared lock's
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
Rendering diagram...
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
Rendering diagram...
Rendering diagram...
Dead locks
Rendering diagram...
Rendering diagram...
Most DBMS's have a deadlock detection mechanism,
They will fail one of the transactions.
Rendering diagram...
Rendering diagram...
Two phase locking
The double booking phenomenon.
Rendering diagram...
Both transactions read the same data, and updated it.
Solution, create a two phase locking mechanism.
Rendering diagram...
1begin transaction;
2select * from seats where id = 10 for update; -- creates an exclusive lock
3update seats set isBooked = true, name = 'Bob' where id = 10;
4commit;1begin transaction;
2select * from seats where id = 10 for update;
3-- hang
4-- updated data is returnedfor UPDATE creates an exclusive lock on the row indefinitelyset lock_timeout = 1000; will timeout the lock after 1 second.
Rendering diagram...
Rendering diagram...
1begin transaction;
2select * from seats where id = 10 for update; -- creates an exclusive lock
3update seats set isBooked = true, name = 'Bob' where id = 10;
4commit;1begin transaction;
2select * from seats where id = 10 for update;
3-- hang
4-- updated data is returnedAlternate solution (not ideal)
Rendering diagram...
1update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;This solution is imperfect, You are at the mercy of the DBMS
implementation.
Postgres refreshes the row after the lock is released, the same cannot be said
for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.
Rendering diagram...
Rendering diagram...
1update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;Rendering diagram...
SQL pagination with offset is very slow
This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.
1select * from users order by id limit 10 offset 1000000;instead, make user query for more data using the last-seen-id.
1select * from users where id > 1000000 order by id limit 10;1select * from users order by id limit 10 offset 1000000;1select * from users where id > 1000000 order by id limit 10;Connection pooling.
For every connection, the DBMS needs to allocate resources, this takes time.
If we create a connection for every request, the latency and throughput will be very bad.
If the no of connections increase, the DBMS will overload and crash.
Connection pooling is a way to reuse connections.
The connections are kept alive, and reused.
Each connection is leased to a request, and returned to the pool after the request is complete.
Rendering diagram...
Default connection pool size is 10, in postgres.
Rendering diagram...
Each instance keeps alive 10 connections to the DB.
Each instance can send upto 10 concurrent requests to the DB.
Each function invocation leases a connection from the pool.
Once the function is executed, the connection is returned to the pool.
1const { Pool } = require('pg');
2const pool = new Pool({
3 max: 20,
4 idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
5 connectionTimeoutMillis: 2000, // timeout after 2 seconds
6});
7
8await pool.query('SELECT something FROM sometable');We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
} finally {
// release the connection to the pool
client.release();
}Rendering diagram...
Rendering diagram...
1const { Pool } = require('pg');
2const pool = new Pool({
3 max: 20,
4 idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
5 connectionTimeoutMillis: 2000, // timeout after 2 seconds
6});
7
8await pool.query('SELECT something FROM sometable');1// asking the pool to give us a dedicated connection
2const client = await pool.connect();
3try {
4 await client.query('BEGIN');
5 await client.query('COMMIT');
6} catch (e) {
7 await client.query('ROLLBACK');
8} finally {
9 // release the connection to the pool
10 client.release();
11}Database replication
Master slave replication
Rendering diagram...
Multi-Master replication
Rendering diagram...
Difficult to implement.
Need to deal with conflicts.
Not usually preferred over master-slave replication.
Synchronous replication
Rendering diagram...
Replication must be successful before the query is resolved.
If the replication fails, the query fails.
Can specify the number of replicas that must acknowledge the replication.
(first 2 / first 1 / any)
Asynchronous replication
Rendering diagram...
Pros and cons of replication
Pros
Horizontal scaling
Availability zones
Cons
Eventual consistency
Slow writes in sync mode
Multi-master is hard to implement
Master slave replication
Rendering diagram...
Rendering diagram...
Multi-Master replication
Rendering diagram...
Difficult to implement.
Need to deal with conflicts.
Not usually preferred over master-slave replication.
Rendering diagram...
Synchronous replication
Rendering diagram...
Replication must be successful before the query is resolved.
If the replication fails, the query fails.
Can specify the number of replicas that must acknowledge the replication.
(first 2 / first 1 / any)
Rendering diagram...
Asynchronous replication
Rendering diagram...
Rendering diagram...
Twitter system design
Post tweet 140 chars
Follow users
Timeline
Rendering diagram...
L4, LB (transparent proxy)
no TLS encryption handling.
L7, LB
handles TLS encryption.
Extremely large table.
1
2
Rendering diagram...
URL shortener system design
Create a shortened URL.
Redirects to the original URL.
Rendering diagram...
primary key
Only shard when you are not able to handle writes
1base64(sha(url, salt)[:8]) # short urlRendering diagram...
1base64(sha(url, salt)[:8]) # short urlDatabase Engines / Storage Engines / Embedded Databases
A library that takes care of CRED operations on the actual disk.
Databases can build on top of existing database engines.
MariaDB / MySQL allows us to switch between DB engines.
Postgres does not allow us to switch between DB engines.
LSM Trees ( watch?v=I6jB0nM9SKU )
Rendering diagram...
Log structured merge trees.
Each change to an existing key/value pairs are appended to the Memtable.
When the Memtable is full, it is flushed to disk as a SSTable.
SSTable is immutable, and is stored in the disk.
SSTables are have keys sorted in order.
Any update / delete to a key, is appended to the Memtable.
Memtable is flushed to a new SSTable, when the Memtable is full.
Bloomfilters are used to check if a key exists in the SSTable before reading it.
Compaction is used to merge multiple SSTables into one. (routine background process, for reclaiming disk space)
Duplicate and stale keys are removed during compaction.
Rendering diagram...
Database Cursors
Pros
Cursors create a temporary table in memory.
Cursors are used to iterate over a large number of rows. (without loading all the rows into memory)
Allows data to be streamed.
Can cancel the cursor at any time.
Paging can be done using cursors.
Cons
Cursors are stateful, if the connection is lost, the cursor is lost.
Cannot horizontally scale cursors.
Long running cursors can cause locking issues.
Pros
Cursors create a temporary table in memory.
Cursors are used to iterate over a large number of rows. (without loading all the rows into memory)
Allows data to be streamed.
Can cancel the cursor at any time.
Paging can be done using cursors.
Cons
Cursors are stateful, if the connection is lost, the cursor is lost.
Cannot horizontally scale cursors.
Long running cursors can cause locking issues.
Database Security
Steps to enable SSL
Run pg admin
1sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80 --name pgadmin dpage/pgadmin4Install vim in postgres docker container
docker exec -it pg bash;
apt-get update;
apt-get install vim;Generate SSH key
openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365Update postgresql.conf
vim var/lib/postgresql/data/postgresql.conf
ssl = on
ssl_cert_file = 'cert.pem'
ssl_key_file = 'key.pem'Connect to the db using pg admin
What is the largest sql query we can send before it fails?
Create table if not exists.
Different users for CRED operations and Table owners.
Have different privileges for different users.
Normal users should not be able to DROP tables.
Different pools for different routes (routes that require read access only will use a different pool, logged in as normal user).
Best practices when working with databases
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Steps to enable SSL
Run pg admin
1sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80 --name pgadmin dpage/pgadmin4Install vim in postgres docker container
1docker exec -it pg bash;
2apt-get update;
3apt-get install vim;Generate SSH key
1openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 365Update postgresql.conf
1vim var/lib/postgresql/data/postgresql.conf
2ssl = on
3ssl_cert_file = 'cert.pem'
4ssl_key_file = 'key.pem'Connect to the db using pg admin
1sudo docker run -e PGADMIN_DEFAULT_EMAIL="santhosh" -e PGADMIN_DEFAULT_PASSWORD="passwd" -p 80:80 -p 5555:80 --name pgadmin dpage/pgadmin41docker exec -it pg bash;
2apt-get update;
3apt-get install vim;1openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 3651vim var/lib/postgresql/data/postgresql.conf
2ssl = on
3ssl_cert_file = 'cert.pem'
4ssl_key_file = 'key.pem'What is the largest sql query we can send before it fails?
Create table if not exists.
Different users for CRED operations and Table owners.
Have different privileges for different users.
Normal users should not be able to DROP tables.
Different pools for different routes (routes that require read access only will use a different pool, logged in as normal user).
Best practices when working with databases
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Homomorphic Encryption
What is encryption?
Symmetric encryption
Same key is used to encrypt and decrypt.
AES, DES, 3DES, RC4, RC5, RC6, Blowfish, Twofish, etc.
Asymmetric encryption
Different keys are used to encrypt and decrypt.
Public key and private key pair.
Why can't we always encrypt data?
Queries cannot be executed on encrypted data.
Analysis, indexing, sorting, etc cannot be done on encrypted data.
Applications must read the data, decrypt it, and then perform the operation.
TLS termination at layer 7 load balancer requires the data to be decrypted. (needs to know the request, payload, etc)
What is homomorphic encryption?
Allows us to perform arithmetic operations on encrypted data.
No need to decrypt the data to perform operations.
Its damn slow, at the moment.
Databases can index and optimize without decrypting the data.
True end to end encryption.
What is encryption?
Symmetric encryption
Same key is used to encrypt and decrypt.
AES, DES, 3DES, RC4, RC5, RC6, Blowfish, Twofish, etc.
Asymmetric encryption
Different keys are used to encrypt and decrypt.
Public key and private key pair.
Why can't we always encrypt data?
Queries cannot be executed on encrypted data.
Analysis, indexing, sorting, etc cannot be done on encrypted data.
Applications must read the data, decrypt it, and then perform the operation.
TLS termination at layer 7 load balancer requires the data to be decrypted. (needs to know the request, payload, etc)
What is homomorphic encryption?
Allows us to perform arithmetic operations on encrypted data.
No need to decrypt the data to perform operations.
Its damn slow, at the moment.
Databases can index and optimize without decrypting the data.
True end to end encryption.
Answering your questions
Why does posgres do a bit map heap scan, instead of an index only scan?
create index g_index on grades(g);
explain analyze select count(*) from grades where g = 1;Postgres needs to visit the heap to check if the row is visible to all transactions.
The visibility col is a hidden col in the heap for each row.
Visibility map is used to check if the row is visible to all transactions.
vaccum grades;vaccum command creates a visibility map.
What is cost in explain analyze?
Cost is the estimated cost of the query.
Its just a number that postgres uses.
Higher the cost, slower the query.
All Isolation levels
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11
-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Duplicate indexes in a row
Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.
Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)
Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]
Why use SELECT FOR UPDATE?
Each transaction obtains a exclusive lock on the rows that it needs to read / update.
Other transactions cannot read / update the rows that where locked.
You killed concurrency, your throughput is low.
Your transactions will never fail due to conflicts.
Serialisable isolation level is better.
Detects conflicts and fails the transaction, if there is a serialisable error.
Your throughput is high, and you have concurrency.
Its a optimistic locking mechanism.
Your transactions can fail, due to conflicts.
You need to handle retry logics in your application.
Why use a connection pool?
If you are using one database connection for every request, you are doing it wrong.
TCP blocking could happen if a lot of queries need to be executed at the same time.
flowchart LR
client[Client1]
client_2[Client2]
server[Server]
db[DB]
client --> |request| server --> |get / set| db
client_2 --> |faster request| server --> |get / set| dbflowchart RL
classDef green fill:#ccffcc,stroke:#000,stroke-width:2px;
classDef red fill:#ffcccc,stroke:#000,stroke-width:2px;
client[Client1]:::green
client_2[Client2]:::red
server[Server]
db[DB]
db --> |fast response| server --> |fast response| client
db --> |slow response| server --> |slow response| client_2There is a chance of response misdirection.
Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.
But in case of a connection pool,
Each connection is leased to a request, and returned to the pool after the request is complete.
No parallel requests are made on the same connection.
What is the value of a bitmap index scan?
Row can be in multiple pages.
Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.
Updates a bit map on the second condition.
Goes to the heap pulls those pages.
Filters out the page for the query condition
See, "Bitmap And / Or" for more info this file
If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.
Does composite index order matter?
Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.
Composite indexes are created from left to right.
Minimum filter first. (filter that results in a larger dataset first)
select a,b,c where a=10 and b=10 and c=20;
-- a is minimum filterIn an index only scan, pg still needs to visit the heap?
To check if the table is visible to the transaction.
To get old versions (snapshot) of the row, when the transaction started.
To avoid visits to the heap, use VACUUM command.
Avoid long running transactions.
Persistence vs Durability
Persistence is the ability to store data on disk and retrieve it at a later time after a crash.
My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.
Can we create index when row's are being inserted / updated?
Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.
Postgres allows you to create index concurrently, without locking the table.
Why does posgres do a bit map heap scan, instead of an index only scan?
1create index g_index on grades(g);
2explain analyze select count(*) from grades where g = 1;Postgres needs to visit the heap to check if the row is visible to all transactions.
The visibility col is a hidden col in the heap for each row.
Visibility map is used to check if the row is visible to all transactions.
1vaccum grades;vaccum command creates a visibility map.
1create index g_index on grades(g);
2explain analyze select count(*) from grades where g = 1;1vaccum grades;What is cost in explain analyze?
Cost is the estimated cost of the query.
Its just a number that postgres uses.
Higher the cost, slower the query.
All Isolation levels
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11
-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transaction1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 101BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
1select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
1select * from users where id = 1 for update;Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Duplicate indexes in a row
Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.
Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)
Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]
Why use SELECT FOR UPDATE?
Each transaction obtains a exclusive lock on the rows that it needs to read / update.
Other transactions cannot read / update the rows that where locked.
You killed concurrency, your throughput is low.
Your transactions will never fail due to conflicts.
Serialisable isolation level is better.
Detects conflicts and fails the transaction, if there is a serialisable error.
Your throughput is high, and you have concurrency.
Its a optimistic locking mechanism.
Your transactions can fail, due to conflicts.
You need to handle retry logics in your application.
Why use a connection pool?
If you are using one database connection for every request, you are doing it wrong.
TCP blocking could happen if a lot of queries need to be executed at the same time.
Rendering diagram...
Rendering diagram...
There is a chance of response misdirection.
Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.
But in case of a connection pool,
Each connection is leased to a request, and returned to the pool after the request is complete.
No parallel requests are made on the same connection.
Rendering diagram...
Rendering diagram...
What is the value of a bitmap index scan?
Row can be in multiple pages.
Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.
Updates a bit map on the second condition.
Goes to the heap pulls those pages.
Filters out the page for the query condition
See, "Bitmap And / Or" for more info this file
If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.
Does composite index order matter?
Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.
Composite indexes are created from left to right.
Minimum filter first. (filter that results in a larger dataset first)
1select a,b,c where a=10 and b=10 and c=20;
2-- a is minimum filter1select a,b,c where a=10 and b=10 and c=20;
2-- a is minimum filterIn an index only scan, pg still needs to visit the heap?
To check if the table is visible to the transaction.
To get old versions (snapshot) of the row, when the transaction started.
To avoid visits to the heap, use VACUUM command.
Avoid long running transactions.
Persistence vs Durability
Persistence is the ability to store data on disk and retrieve it at a later time after a crash.
My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.
Can we create index when row's are being inserted / updated?
Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.
Postgres allows you to create index concurrently, without locking the table.
Graph databases
Database persists and retrieves data from disk.
IO, fetching pages from disk / heap is heavy.
More IO, more latency.
Storage engine, classic storage engines use tables and rows.
Postgres, does not store large strings in-line in the row, it compresses to an external TOAST table, and points to it.
Meta data can be stored in an external data-structure.
WAL, REDO, UNDO logs
No title
When a transaction is committed, data is expected to be persistent even after system termination.
A transaction does changes to everything in memory.
On commit, all changes are flushed to disk. ( this is slow )
If databases crashes mid commit, what should happen to data?.
A log of all changes are persisted to disk. (WAL write ahead log).
Dirty pages are kept in memory.
Dirty pages (changed pages) are flushed to WAL.
DB comes back up, detects that WAL is ahead of the data on disk.
No one reads from the WAL, it's only used for recovery.
Page is loaded in memory, and the WAL is replayed ( WAL is also called as redo log ).
Page is flushed to disk, and WAL is marked as flushed.
How often should the WAL be flushed to disk? (checkpointing)
When the wall is full, we need to flush it to disk, before we can write to the wall again.
Checkpointing is a IO intensive operation, we need to pause everything until the checkpoint is done.
DB warns you before the checkpoint is initiated. (causes spikes in CPU usage)
If the WAL is small, the checkpoint is faster and more frequent.
OS buffers the writes in memory, and flushes it to disk when it's ready ( to reduce IO and wear on the disk ).
OS waits for the write cache to be full, before flushing it to disk.
This is a problem, because the WAL is not flushed to disk, and the OS crashes, the WAL is lost.
fsync() system call is used to flush WAL is flushed to disk directly bypassing the OS cache.
UNDO logs?
Not all databases have UNDO logs.
Rollback Transactions: If a transaction fails or is explicitly rolled back by the user, the UNDO log records are used to revert the changes made by that transaction.
Recovery: In the event of a system crash, UNDO logs are used during the recovery process to ensure that any incomplete transactions are undone, maintaining the consistency of the database.
Long running transactions, can cause the UNDO logs to grow, and for any new transactions to see the old data, the UNDO logs need to be replayed to get to the old state.
Undo logs are applied after wall changes are applied when DB is recovering from a crash.
Can Null values improve database performance?
Default values will take up space, eg (int 32bit will take up 32bits in the row even if the value like 0 does not take up 32 bits))
PostGres uses a null bitmap, to store null values. Every row has a null bitmap, that stores the null values in the row. Every bit corresponds to a column in the row.
If you have more than 8 columns, the null bitmap will increment by 8 bytes. | 8cols | 8 bits | | 9cols | 16 bits | | 17cols | 24 bits |
Nulls are better than default values, in terms of space.
Don't rely on nulls all the time, there are some gotcha's
select count(field);
ignores nulls based on db
select count(*);
counts nulls
T is null or T is not null, is fine.
Not all databases support indexes on cols with nulls in them. (PG allows it, but oracle doesn't)
Postgres limits each page size to 8kb.
We need to try to fit as many rows as possible in a page.
More IO's if the row is split across pages.
If most of the cols are null, the no of rows that can fit in a page can be increased.
There is a fixed cost of saving null's
Select user_id from users where user_id is null; will not use the index on user_id.
No title
When a transaction is committed, data is expected to be persistent even after system termination.
A transaction does changes to everything in memory.
On commit, all changes are flushed to disk. ( this is slow )
If databases crashes mid commit, what should happen to data?.
A log of all changes are persisted to disk. (WAL write ahead log).
Dirty pages are kept in memory.
Dirty pages (changed pages) are flushed to WAL.
DB comes back up, detects that WAL is ahead of the data on disk.
No one reads from the WAL, it's only used for recovery.
Page is loaded in memory, and the WAL is replayed ( WAL is also called as redo log ).
Page is flushed to disk, and WAL is marked as flushed.
How often should the WAL be flushed to disk? (checkpointing)
When the wall is full, we need to flush it to disk, before we can write to the wall again.
Checkpointing is a IO intensive operation, we need to pause everything until the checkpoint is done.
DB warns you before the checkpoint is initiated. (causes spikes in CPU usage)
If the WAL is small, the checkpoint is faster and more frequent.
OS buffers the writes in memory, and flushes it to disk when it's ready ( to reduce IO and wear on the disk ).
OS waits for the write cache to be full, before flushing it to disk.
This is a problem, because the WAL is not flushed to disk, and the OS crashes, the WAL is lost.
fsync() system call is used to flush WAL is flushed to disk directly bypassing the OS cache.
UNDO logs?
Not all databases have UNDO logs.
Rollback Transactions: If a transaction fails or is explicitly rolled back by the user, the UNDO log records are used to revert the changes made by that transaction.
Recovery: In the event of a system crash, UNDO logs are used during the recovery process to ensure that any incomplete transactions are undone, maintaining the consistency of the database.
Long running transactions, can cause the UNDO logs to grow, and for any new transactions to see the old data, the UNDO logs need to be replayed to get to the old state.
Undo logs are applied after wall changes are applied when DB is recovering from a crash.
Can Null values improve database performance?
Default values will take up space, eg (int 32bit will take up 32bits in the row even if the value like 0 does not take up 32 bits))
PostGres uses a null bitmap, to store null values. Every row has a null bitmap, that stores the null values in the row. Every bit corresponds to a column in the row.
If you have more than 8 columns, the null bitmap will increment by 8 bytes. | 8cols | 8 bits | | 9cols | 16 bits | | 17cols | 24 bits |
Nulls are better than default values, in terms of space.
Don't rely on nulls all the time, there are some gotcha's
select count(field);
ignores nulls based on db
select count(*);
counts nulls
T is null or T is not null, is fine.
Not all databases support indexes on cols with nulls in them. (PG allows it, but oracle doesn't)
Postgres limits each page size to 8kb.
We need to try to fit as many rows as possible in a page.
More IO's if the row is split across pages.
If most of the cols are null, the no of rows that can fit in a page can be increased.
There is a fixed cost of saving null's
Select user_id from users where user_id is null; will not use the index on user_id.
Postgres Architecture
No title
Postmaster process
Everything in postgres is append only mutli version concurrency control.
Process vs Threads, (threads are faster, but postgres uses processes)
well established mind physical body, mind, emotions, life energies
Distributed transaction ( a transaction spanning across multiple databases / services / processes )
Each microservice has its own database.
Making a transaction across multiple microservices is hard.
Rendering diagram...
Ongoing transactions where stored in memory, instead of disk
But for distributed transactions this is not the case
Google atomic clocks to sequence of transactions
Compensatory transactions, revert the previous transaction
Eventing model, Kafka / Rabbit MQ,
Rendering diagram...
Go back to the old days, have a macro service.
Debugging is hard in a distributed micro services.
Why did Uber migrate from postgres to mysql
ctid is a internal parameter used by postgres to identify for physical row identification,
it's not meant to be used as a primary key.
ctid's change with every row update and are not stable.
When a row is updated, PostgreSQL often creates a new version of the row with a different ctid. The index is updated to point to the new ctid. This is why indexes might get bloated over time with dead tuples, necessitating regular maintenance tasks like VACUUM or REINDEX.
This is called Write amplification.
WAL's are used to keep track of these changes.
WAL's are pushed to standby servers, to keep them in sync.
SSD's are not good for updates, they are good for inserts (RocksDB and LevelsDB, optimised for SSD's).
Adding too much indexes are a bad idea, it will cause write amplifications.
Uber had databases across multiple data centers, and they had to keep them in sync.
Push WAL updates across multiple data centers, is expensive (network wise).
No title
Postmaster process
Everything in postgres is append only mutli version concurrency control.
Process vs Threads, (threads are faster, but postgres uses processes)
well established mind physical body, mind, emotions, life energies
Distributed transaction ( a transaction spanning across multiple databases / services / processes )
Each microservice has its own database.
Making a transaction across multiple microservices is hard.
Rendering diagram...
Ongoing transactions where stored in memory, instead of disk
But for distributed transactions this is not the case
Google atomic clocks to sequence of transactions
Compensatory transactions, revert the previous transaction
Eventing model, Kafka / Rabbit MQ,
Rendering diagram...
Go back to the old days, have a macro service.
Debugging is hard in a distributed micro services.
Rendering diagram...
Rendering diagram...
Why did Uber migrate from postgres to mysql
ctid is a internal parameter used by postgres to identify for physical row identification,
it's not meant to be used as a primary key.
ctid's change with every row update and are not stable.
When a row is updated, PostgreSQL often creates a new version of the row with a different ctid. The index is updated to point to the new ctid. This is why indexes might get bloated over time with dead tuples, necessitating regular maintenance tasks like VACUUM or REINDEX.
This is called Write amplification.
WAL's are used to keep track of these changes.
WAL's are pushed to standby servers, to keep them in sync.
SSD's are not good for updates, they are good for inserts (RocksDB and LevelsDB, optimised for SSD's).
Adding too much indexes are a bad idea, it will cause write amplifications.
Uber had databases across multiple data centers, and they had to keep them in sync.
Push WAL updates across multiple data centers, is expensive (network wise).
What is a transaction? 🔁
Life cycle of a transaction
Rendering diagram...
Begin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;balance - 100 > 0;UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;1COMMIT TRANSACTION;Life cycle of a transaction
Rendering diagram...
Begin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Rendering diagram...
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;1COMMIT TRANSACTION;1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;1COMMIT TRANSACTION;Atomicity
All queries in a transaction must succeed,
or the transaction must be rolled back.
Even if 1/100 query fails,
the entire transaction must be rolled back.
It the database went down prior to a commit of a transaction,
the transaction must be rolled back.
Rendering diagram...
Rendering diagram...
Isolation
No title
Can my inflight transaction see the changes made by other transactions?
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
begin transaction isolation level repeatable read;
# or
begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
flowchart TD
initial_state[SUM 10] -->
transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
initial_state -->
transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commitIsolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
1begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
Rendering diagram...
Rendering diagram...
Rendering diagram...
1begin transaction isolation level repeatable read;Rendering diagram...
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;1begin transaction isolation level serializable;Rendering diagram...
Isolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Consistency
Consistency in data is user defined (DBA team)
Referential integrity (through foreign keys) data is
inconsistent across tables.
Atomicity, isolation affects consistency.
Data consistency between tables,
1
a
1
2
b
2
Mate
1
Mark
1
Jack
2
Consistency in reads,
Rendering diagram...
If a transaction update a row,
will the change be immediately visible during a read?.
Both SQL and NoSQL databases suffer from this problem.
Eventual consistency
Rendering diagram...
You will eventually get the latest version of the data.
Rendering diagram...
Rendering diagram...
When we introduce caching or we introduce horizontal scaling,
you are eventually consistent.
If you do not have atomicity,
or have inconsistent data,
you do not have consistency at all.
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Durability
Process of persisting data to disk (non-volatile storage).
Durability slows write down,
write to disk is slower than write to memory
Durability techniques
WAL - Write ahead log / AOF - Append only file
Any change to the database is written to a log file, the log file is flushed to disk.
Rendering diagram...
Rendering diagram...
Writing all the changes to disk is expensive (indexes, data, files,
rows...)
The log file is a compressed version of the data stored in the database.
Rendering diagram...
fSync forces the OS to flush the data to disk.
This prevents data loss in case of OS crash.
Async snapshots
Rendering diagram...
Asynchronously write the data to disk
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Serializability vs Repeatable reads
Repeatable reads
Rendering diagram...
both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Serializability
Rendering diagram...
Repeatable reads
Rendering diagram...
both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Rendering diagram...
Serializability
Rendering diagram...
Rendering diagram...
1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
1begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
Rendering diagram...
1begin transaction isolation level repeatable read;1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;1begin transaction isolation level serializable;Rendering diagram...
Isolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
1begin transaction isolation level serializable;How tables and indexes are stored on disk?
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Row vs Column oriented databases
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
Rendering diagram...
Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
Rendering diagram...
To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
Rendering diagram...
Rendering diagram...
Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
Rendering diagram...
To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Rendering diagram...
Primary key vs Secondary key
There is no order maintained by default in the table,
Clustering, order enforced by the database on a table,
through a primary key or secondary key.
Postgres Clustered table / heap organised table, mySQL Index ordered table
Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual
table.
Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.
Buffer pool
Rendering diagram...
To reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the
buffer pool,
thus reducing the number of IO's.
Rendering diagram...
Creating table with random vaues.
1insert into temp (t) select random()*100 from generate_series(0,100000)1insert into temp (t) select random()*100 from generate_series(0,100000)Explaining the explain command
Returns statistics about the query plan,
1explain select * from temp where t > 50;
2Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Scan types ( Seq vs Index )
Cost Time to start executing the query,
Time to finish executing the query
Rows
Number of rows return
Width
Avg size of returned rows in bytes
1explain select * from temp where t > 50;
2Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Creating an index
1create index temp_t_idx on temp(t);Without index scan,
1Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)With index scan,
1Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.
With an index search,
Time to prepare increases 0.00 vs 0.29
Time to execute decreases
Number of rows returned visited decreases
1create index temp_t_idx on temp(t);1Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)1Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)Seq table scan vs Index scan vs Bitmap index scan (postgres)
No title
Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
explain select id from t_records where t > 10;
Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
Filter: (t > 10)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select --> heap --> resolveIndex
explain select id from t_records where t > 5000;
Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
Index Cond: (t > 5000)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select -->| Index Scan | index --> heap --> resolveBitmap
If the no of rows to be scanned is too high, and the
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
select --> bitmap --> recheck --> heapbitmap is a bit array of the index values, pointing to the page on heap.
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
explain select id from t_records where t > 4000;
Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
Recheck Cond: (t > 4000)
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
Index Cond: (t > 4000)
(4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Bitmap And / Or
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
where_clause_1 --> bitmap --> recheck --> heap
where_clause_2 --> bitmapcondition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]
bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
create index a_idx on temp(a) include (b,c);flowchart LR
index[(Index)]
select -->| Index only Scan | index --> resolveAdding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
Composite / Combined Indexes
create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
select * from temp where a > 1 and b < 3 and c > 5;flowchart LR
index[(a,b,c)]
select --> |index only scan|index --> resolveWhen should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
Index
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
Bitmap
If the no of rows to be scanned is too high, and the
Rendering diagram...
bitmap is a bit array of the index values, pointing to the page on heap.
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Rendering diagram...
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Bitmap And / Or
Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)1explain select id from t_records where t > 5000 and t < 1000;Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
1vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
1vacuum t_records;Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Composite / Combined Indexes
1create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
When should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
1create index a_b_c_combined_idx on temp(a,b,c);1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Concurrent Index Creation's
Waits for all transactions to finish, before creating the index.
Does not lock the table read's or writes.
Takes longer time to execute.
Can crash if inserting duplicate values, while creating the index.
1create index concurrently a_idx on temp(a);1create index concurrently a_idx on temp(a);Bloom Filters
Rendering diagram...
A bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.
1# bloom filter
2filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
3filter_posstion = hash("req") % len(filter)
4
5if filter[filter_posstion] == 1:
6 # maybe in set
7 return request_db()
8
9# set bit
10filter[filter_posstion] = 1Rendering diagram...
1# bloom filter
2filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
3filter_posstion = hash("req") % len(filter)
4
5if filter[filter_posstion] == 1:
6 # maybe in set
7 return request_db()
8
9# set bit
10filter[filter_posstion] = 1Working with Billion row tables
Why do we need a billion row table? We have many ways to solve a problem.
Avoid having a billion row.
If thats not possible,
Can you index it?
Can you partition it in the same disk.
Can you shard horizontally on different disks, on multiple DB's.
Map Reduce (not suitabale for transactions).
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
Index
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
Bitmap
If the no of rows to be scanned is too high, and the
Rendering diagram...
bitmap is a bit array of the index values, pointing to the page on heap.
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Rendering diagram...
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Bitmap And / Or
Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)1explain select id from t_records where t > 5000 and t < 1000;Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
1vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
1vacuum t_records;Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Composite / Combined Indexes
1create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
When should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
1create index a_b_c_combined_idx on temp(a,b,c);1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)B+ Tree
Nodes only store keys in internal and root nodes.
The values are only stored in the leaf nodes.
This allows us to store more keys per node.
Leaf nodes are linked, eacn node has a pointer to the next node linked list.
Range queries are much faster.
Rendering diagram...
Each node usually store 100's of thousands of keys, above diagram is just an example.
Leaf nodes have duplicate keys.
Leaf pointers are cheap to traverse.
1 Node = 1 DBMS Page
Internal nodes fit in memory.
Leaf nodes can be stored in the heap.
Most DBMS uses B+ trees.
Creating a partition
create table grades (id serial not null, g not null) partition by range (g);
create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);
alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
2create index on grades (g);or
1insert into grades (g) select g from another_table;select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99Make sure ENABLE_PARTITION_PRUNING is set to on in
postgresql.conf.
use the show command to check.
1show enable_partition_pruning;set enable_partition_pruning = on;1create table grades (id serial not null, g not null) partition by range (g);
2
3create table g0035 (like grades including indexes);
4create table g3570 (like grades including indexes);
5create table g7080 (like grades including indexes);
6create table g80100 (like grades including indexes);
7
8alter table attach partition g0035 for values from (0) to (35);
9alter table attach partition g3570 for values from (35) to (70);
10alter table attach partition g7080 for values from (70) to (80);
11alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select g from another_table;1select max(g) from g0035; -- 34
2select max(g) from g3570; -- 69
3select max(g) from g7080; -- 79
4select max(g) from g80100; -- 991show enable_partition_pruning;1set enable_partition_pruning = on;Checking partition size
No title
1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
No title
1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idx1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
1create table grades (id serial not null, g not null) partition by range (g);
2
3create table g0035 (like grades including indexes);
4create table g3570 (like grades including indexes);
5create table g7080 (like grades including indexes);
6create table g80100 (like grades including indexes);
7
8alter table attach partition g0035 for values from (0) to (35);
9alter table attach partition g3570 for values from (35) to (70);
10alter table attach partition g7080 for values from (70) to (80);
11alter table attach partition g80100 for values from (80) to (100);1select max(g) from g0035; -- 34
2select max(g) from g3570; -- 69
3select max(g) from g7080; -- 79
4select max(g) from g80100; -- 991set enable_partition_pruning = on;Exclusive lock's vs Shared lock's
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
Rendering diagram...
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
Rendering diagram...
Rendering diagram...
Dead locks
Rendering diagram...
Rendering diagram...
Most DBMS's have a deadlock detection mechanism,
They will fail one of the transactions.
Rendering diagram...
Rendering diagram...
Two phase locking
The double booking phenomenon.
Rendering diagram...
Both transactions read the same data, and updated it.
Solution, create a two phase locking mechanism.
Rendering diagram...
1begin transaction;
2select * from seats where id = 10 for update; -- creates an exclusive lock
3update seats set isBooked = true, name = 'Bob' where id = 10;
4commit;1begin transaction;
2select * from seats where id = 10 for update;
3-- hang
4-- updated data is returnedfor UPDATE creates an exclusive lock on the row indefinitelyset lock_timeout = 1000; will timeout the lock after 1 second.
Rendering diagram...
Rendering diagram...
1begin transaction;
2select * from seats where id = 10 for update; -- creates an exclusive lock
3update seats set isBooked = true, name = 'Bob' where id = 10;
4commit;1begin transaction;
2select * from seats where id = 10 for update;
3-- hang
4-- updated data is returnedAlternate solution (not ideal)
Rendering diagram...
1update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;This solution is imperfect, You are at the mercy of the DBMS
implementation.
Postgres refreshes the row after the lock is released, the same cannot be said
for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.
Rendering diagram...
Rendering diagram...
1update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;Rendering diagram...
SQL pagination with offset is very slow
This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.
1select * from users order by id limit 10 offset 1000000;instead, make user query for more data using the last-seen-id.
1select * from users where id > 1000000 order by id limit 10;1select * from users order by id limit 10 offset 1000000;1select * from users where id > 1000000 order by id limit 10;Connection pooling.
For every connection, the DBMS needs to allocate resources, this takes time.
If we create a connection for every request, the latency and throughput will be very bad.
If the no of connections increase, the DBMS will overload and crash.
Connection pooling is a way to reuse connections.
The connections are kept alive, and reused.
Each connection is leased to a request, and returned to the pool after the request is complete.
Rendering diagram...
Default connection pool size is 10, in postgres.
Rendering diagram...
Each instance keeps alive 10 connections to the DB.
Each instance can send upto 10 concurrent requests to the DB.
Each function invocation leases a connection from the pool.
Once the function is executed, the connection is returned to the pool.
1const { Pool } = require('pg');
2const pool = new Pool({
3 max: 20,
4 idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
5 connectionTimeoutMillis: 2000, // timeout after 2 seconds
6});
7
8await pool.query('SELECT something FROM sometable');We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
} finally {
// release the connection to the pool
client.release();
}Rendering diagram...
Rendering diagram...
1const { Pool } = require('pg');
2const pool = new Pool({
3 max: 20,
4 idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
5 connectionTimeoutMillis: 2000, // timeout after 2 seconds
6});
7
8await pool.query('SELECT something FROM sometable');1// asking the pool to give us a dedicated connection
2const client = await pool.connect();
3try {
4 await client.query('BEGIN');
5 await client.query('COMMIT');
6} catch (e) {
7 await client.query('ROLLBACK');
8} finally {
9 // release the connection to the pool
10 client.release();
11}1// asking the pool to give us a dedicated connection
2const client = await pool.connect();
3try {
4 await client.query('BEGIN');
5 await client.query('COMMIT');
6} catch (e) {
7 await client.query('ROLLBACK');
8} finally {
9 // release the connection to the pool
10 client.release();
11}1docker exec -it pg bash;
2apt-get update;
3apt-get install vim;1openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 3651vim var/lib/postgresql/data/postgresql.conf
2ssl = on
3ssl_cert_file = 'cert.pem'
4ssl_key_file = 'key.pem'Best practices when working with databases
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Why does posgres do a bit map heap scan, instead of an index only scan?
1create index g_index on grades(g);
2explain analyze select count(*) from grades where g = 1;Postgres needs to visit the heap to check if the row is visible to all transactions.
The visibility col is a hidden col in the heap for each row.
Visibility map is used to check if the row is visible to all transactions.
1vaccum grades;vaccum command creates a visibility map.
1create index g_index on grades(g);
2explain analyze select count(*) from grades where g = 1;1vaccum grades;What is cost in explain analyze?
Cost is the estimated cost of the query.
Its just a number that postgres uses.
Higher the cost, slower the query.
All Isolation levels
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11
-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transaction1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 101BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
1select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
1select * from users where id = 1 for update;Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Duplicate indexes in a row
Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.
Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)
Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]
Why use SELECT FOR UPDATE?
Each transaction obtains a exclusive lock on the rows that it needs to read / update.
Other transactions cannot read / update the rows that where locked.
You killed concurrency, your throughput is low.
Your transactions will never fail due to conflicts.
Serialisable isolation level is better.
Detects conflicts and fails the transaction, if there is a serialisable error.
Your throughput is high, and you have concurrency.
Its a optimistic locking mechanism.
Your transactions can fail, due to conflicts.
You need to handle retry logics in your application.
Why use a connection pool?
If you are using one database connection for every request, you are doing it wrong.
TCP blocking could happen if a lot of queries need to be executed at the same time.
Rendering diagram...
Rendering diagram...
There is a chance of response misdirection.
Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.
But in case of a connection pool,
Each connection is leased to a request, and returned to the pool after the request is complete.
No parallel requests are made on the same connection.
Rendering diagram...
Rendering diagram...
What is the value of a bitmap index scan?
Row can be in multiple pages.
Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.
Updates a bit map on the second condition.
Goes to the heap pulls those pages.
Filters out the page for the query condition
See, "Bitmap And / Or" for more info this file
If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.
Does composite index order matter?
Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.
Composite indexes are created from left to right.
Minimum filter first. (filter that results in a larger dataset first)
1select a,b,c where a=10 and b=10 and c=20;
2-- a is minimum filter1select a,b,c where a=10 and b=10 and c=20;
2-- a is minimum filterIn an index only scan, pg still needs to visit the heap?
To check if the table is visible to the transaction.
To get old versions (snapshot) of the row, when the transaction started.
To avoid visits to the heap, use VACUUM command.
Avoid long running transactions.
Persistence vs Durability
Persistence is the ability to store data on disk and retrieve it at a later time after a crash.
My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.
Can we create index when row's are being inserted / updated?
Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.
Postgres allows you to create index concurrently, without locking the table.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transaction1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 101select * from users where id = 1 for update;What is a transaction? 🔁
Life cycle of a transaction
Rendering diagram...
Begin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
BEGIN TRANSACTION;
SELECT balance FROM accounts WHERE account_id = 1;balance - 100 > 0;UPDATE accounts SET balance = balance - 100
WHERE account_id = 1;1COMMIT TRANSACTION;Life cycle of a transaction
Rendering diagram...
Begin,
Tell the DB that you are starting a transaction.
Commit,
Persist the changes to the database.
Rollback,
Undo the changes to the database.
Rollback due to db crashes,
commit crashes.
Rendering diagram...
Properties of a transaction
A transaction is a collection of queries
that is run as a single unit of work.
A transaction is used to modify / update / add data.
Sometimes a transaction can be read-only.
use case:
getting consistent snapshots of data before a transaction starts.
Some transitions are user defined,
but most updates are implicitly wrapped in a transaction.
Example
1
1000
2
500
1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;1COMMIT TRANSACTION;1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;1COMMIT TRANSACTION;Atomicity
All queries in a transaction must succeed,
or the transaction must be rolled back.
Even if 1/100 query fails,
the entire transaction must be rolled back.
It the database went down prior to a commit of a transaction,
the transaction must be rolled back.
Rendering diagram...
Rendering diagram...
Isolation
No title
Can my inflight transaction see the changes made by other transactions?
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
begin transaction isolation level repeatable read;
# or
begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
flowchart TD
initial_state[SUM 10] -->
transaction_1[BEGIN transaction_1] --> read_1[+20] --> read_12[30] --> commit
initial_state -->
transaction_2[BEGIN transaction_2] --> update_1[10] --> read_2[20] --->|lost update -10| commitIsolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
1begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
Rendering diagram...
Rendering diagram...
Rendering diagram...
1begin transaction isolation level repeatable read;Rendering diagram...
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;1begin transaction isolation level serializable;Rendering diagram...
Isolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
Consistency
Consistency in data is user defined (DBA team)
Referential integrity (through foreign keys) data is
inconsistent across tables.
Atomicity, isolation affects consistency.
Data consistency between tables,
1
a
1
2
b
2
Mate
1
Mark
1
Jack
2
Consistency in reads,
Rendering diagram...
If a transaction update a row,
will the change be immediately visible during a read?.
Both SQL and NoSQL databases suffer from this problem.
Eventual consistency
Rendering diagram...
You will eventually get the latest version of the data.
Rendering diagram...
Rendering diagram...
When we introduce caching or we introduce horizontal scaling,
you are eventually consistent.
If you do not have atomicity,
or have inconsistent data,
you do not have consistency at all.
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Durability
Process of persisting data to disk (non-volatile storage).
Durability slows write down,
write to disk is slower than write to memory
Durability techniques
WAL - Write ahead log / AOF - Append only file
Any change to the database is written to a log file, the log file is flushed to disk.
Rendering diagram...
Rendering diagram...
Writing all the changes to disk is expensive (indexes, data, files,
rows...)
The log file is a compressed version of the data stored in the database.
Rendering diagram...
fSync forces the OS to flush the data to disk.
This prevents data loss in case of OS crash.
Async snapshots
Rendering diagram...
Asynchronously write the data to disk
Rendering diagram...
Rendering diagram...
Rendering diagram...
Rendering diagram...
Serializability vs Repeatable reads
Repeatable reads
Rendering diagram...
both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Serializability
Rendering diagram...
Repeatable reads
Rendering diagram...
both the transactions did not step on each other,
transaction one modified only first two rows [a, a, -, -],
transaction two modified only last two rows [-, -, b, b].
When the transactions are committed only modified rows are updated,[a, a, b, b]
Rendering diagram...
Serializability
Rendering diagram...
Rendering diagram...
1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
1begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
Rendering diagram...
1begin transaction isolation level repeatable read;1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;1begin transaction isolation level serializable;Rendering diagram...
Isolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
1begin transaction isolation level serializable;How tables and indexes are stored on disk?
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Storage concepts
Table representation:
1
Mate
30
bits and bytes of data,
stored on disk
row_id
in mysql, the row_id is the primary key.
in postgres, the row_id is a hidden column,
generated by the database.
Page
Fixed size rows of table, stored in a memory location.
Databases **reads and writes data in pages **, it cant read a
single row.
Each page has a fixed size (8kb in postgres, 16kb in mySQL),
IO
Operations that read and write data from disk.
IO can fetch one or more pages at a time, depending on the
disk partitions and other factors.
IO's are very expensive, you want to minimize the number of
IO's you do.
Postgres relies on OS cache to minimize the number of
IO's it does.
Heap data structure
Heap data structure is used to store the pages on disk.
Traversing a heap is expensive, indexes tell us identify
sections of the heap that we are interested in.
Index
You can create indexes on one or more columns.
Index is also stored on disk as pages.
An index is an another data structure that stores the references to
the pages on heap.
An index tells you where to find the page on the heap, instead of traversing
the heap.
B-Tree is the most common index data structure.
Indexes need to be small to reduce the number of IO's.
Rendering diagram...
In postgres, the row_id is the primary index, primary key is secondary index, the index needs to be updated even when a single row is updated.
In mysql, the primary key is the primary index, this means if the primary key is random (uuid), the data is not stored in order on disk making it hard to query.
Row vs Column oriented databases
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
Rendering diagram...
Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
Rendering diagram...
To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Row vise
Tables are stored as rows on disk.
A single IO fetches (1 page) multiple rows with their columns.
More IO's are required to fetch a single row.
Once you find the row, you have all the columns without extra computation.
A row can span multiple pages, row size is limited by the page size.
You need to go through a lot of pages to find all the values for a column. (IO intensive)
Rendering diagram...
Rendering diagram...
Column vise
name
age
Tables are stored as columns on disk.
A single IO can have one or more columns (depends on row size).
Less IO's are required to fetch more values in an column.
Rendering diagram...
To fetch a row, you need to fetch all the columns with the row_id. (IO intensive)
simple
complex
writes are fast
slow
optimal for transactions
optimal for ETL's (one column-row's)
low compression
high
aggregate queries slow
fast
efficient multi column queries
less efficient
Joining row based tables and column based tables are messy AF
Rendering diagram...
Primary key vs Secondary key
There is no order maintained by default in the table,
Clustering, order enforced by the database on a table,
through a primary key or secondary key.
Postgres Clustered table / heap organised table, mySQL Index ordered table
Primary key is order inforced on the table itself,
Secondary key is order inforced through an index table on the actual
table.
Adding rows in a secondary key enforced table is slow,
since you need to update the secondary index table.
Buffer pool
Rendering diagram...
To reduce the number of IO's,
all row reads are stored in a buffer pool temporarily,
this allows for subsequent reads / writes to be served from the
buffer pool,
thus reducing the number of IO's.
Rendering diagram...
Creating table with random vaues.
1insert into temp (t) select random()*100 from generate_series(0,100000)1insert into temp (t) select random()*100 from generate_series(0,100000)Explaining the explain command
Returns statistics about the query plan,
1explain select * from temp where t > 50;
2Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Scan types ( Seq vs Index )
Cost Time to start executing the query,
Time to finish executing the query
Rows
Number of rows return
Width
Avg size of returned rows in bytes
1explain select * from temp where t > 50;
2Seq Scan on temp (cost=0.00..2087.00 rows=100000 width=4)Creating an index
1create index temp_t_idx on temp(t);Without index scan,
1Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)With index scan,
1Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)If info is available in the index, the database will not do a heap lookup.
Dont ask for more info than you need, ask for stuff in the index if possible.
With an index search,
Time to prepare increases 0.00 vs 0.29
Time to execute decreases
Number of rows returned visited decreases
1create index temp_t_idx on temp(t);1Seq Scan on temp (cost=0.00..1443.01 rows=100001 width=4)1Index Only Scan using temp_temp_idx on temp (cost=0.29..1258.40 rows=59549 width=4)Seq table scan vs Index scan vs Bitmap index scan (postgres)
No title
Even when a index is present, the database might choose to do a sequential scan,
or a bitmap index scan, based on the no of rows that needs to be scanned.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
explain select id from t_records where t > 10;
Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
Filter: (t > 10)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select --> heap --> resolveIndex
explain select id from t_records where t > 5000;
Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
Index Cond: (t > 5000)
(2 rows)flowchart LR
index[(Index)]
heap[(Heap)]
select -->| Index Scan | index --> heap --> resolveBitmap
If the no of rows to be scanned is too high, and the
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
select --> bitmap --> recheck --> heapbitmap is a bit array of the index values, pointing to the page on heap.
pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
explain select id from t_records where t > 4000;
Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
Recheck Cond: (t > 4000)
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
Index Cond: (t > 4000)
(4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Bitmap And / Or
flowchart LR
index[(Index)]
bitmap[(Bitmap)]
heap[(Heap)]
recheck[Condition Recheck]
index --> bitmap
where_clause_1 --> bitmap --> recheck --> heap
where_clause_2 --> bitmapcondition_1 = [0, 0, 1, 0, 0, 1, ...]
condition_2 = [0, 0, 1, 0, 1, 0, ...]
bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
create index a_idx on temp(a) include (b,c);flowchart LR
index[(Index)]
select -->| Index only Scan | index --> resolveAdding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
Composite / Combined Indexes
create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
select * from temp where a > 1 and b < 3 and c > 5;flowchart LR
index[(a,b,c)]
select --> |index only scan|index --> resolveWhen should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
Index
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
Bitmap
If the no of rows to be scanned is too high, and the
Rendering diagram...
bitmap is a bit array of the index values, pointing to the page on heap.
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Rendering diagram...
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Bitmap And / Or
Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)1explain select id from t_records where t > 5000 and t < 1000;Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
1vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
1vacuum t_records;Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Composite / Combined Indexes
1create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
When should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
1create index a_b_c_combined_idx on temp(a,b,c);1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
Concurrent Index Creation's
Waits for all transactions to finish, before creating the index.
Does not lock the table read's or writes.
Takes longer time to execute.
Can crash if inserting duplicate values, while creating the index.
1create index concurrently a_idx on temp(a);1create index concurrently a_idx on temp(a);Bloom Filters
Rendering diagram...
A bloom filter is a probabilistic data structure, that can be used to check if an element is in a set.
1# bloom filter
2filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
3filter_posstion = hash("req") % len(filter)
4
5if filter[filter_posstion] == 1:
6 # maybe in set
7 return request_db()
8
9# set bit
10filter[filter_posstion] = 1Rendering diagram...
1# bloom filter
2filter = [0, 0, 1, 0, 0, 0, 0, 0, 0, 1]
3filter_posstion = hash("req") % len(filter)
4
5if filter[filter_posstion] == 1:
6 # maybe in set
7 return request_db()
8
9# set bit
10filter[filter_posstion] = 1Working with Billion row tables
Why do we need a billion row table? We have many ways to solve a problem.
Avoid having a billion row.
If thats not possible,
Can you index it?
Can you partition it in the same disk.
Can you shard horizontally on different disks, on multiple DB's.
Map Reduce (not suitabale for transactions).
Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
Index
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
Bitmap
If the no of rows to be scanned is too high, and the
Rendering diagram...
bitmap is a bit array of the index values, pointing to the page on heap.
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Rendering diagram...
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Bitmap And / Or
Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)1explain select id from t_records where t > 5000 and t < 1000;Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
1vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
1vacuum t_records;Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Composite / Combined Indexes
1create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
When should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
1create index a_b_c_combined_idx on temp(a,b,c);1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)B+ Tree
Nodes only store keys in internal and root nodes.
The values are only stored in the leaf nodes.
This allows us to store more keys per node.
Leaf nodes are linked, eacn node has a pointer to the next node linked list.
Range queries are much faster.
Rendering diagram...
Each node usually store 100's of thousands of keys, above diagram is just an example.
Leaf nodes have duplicate keys.
Leaf pointers are cheap to traverse.
1 Node = 1 DBMS Page
Internal nodes fit in memory.
Leaf nodes can be stored in the heap.
Most DBMS uses B+ trees.
Creating a partition
create table grades (id serial not null, g not null) partition by range (g);
create table g0035 (like grades including indexes);
create table g3570 (like grades including indexes);
create table g7080 (like grades including indexes);
create table g80100 (like grades including indexes);
alter table attach partition g0035 for values from (0) to (35);
alter table attach partition g3570 for values from (35) to (70);
alter table attach partition g7080 for values from (70) to (80);
alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select floor(random() * 100) from generate_series(1, 10000000);
2create index on grades (g);or
1insert into grades (g) select g from another_table;select max(g) from g0035; -- 34
select max(g) from g3570; -- 69
select max(g) from g7080; -- 79
select max(g) from g80100; -- 99Make sure ENABLE_PARTITION_PRUNING is set to on in
postgresql.conf.
use the show command to check.
1show enable_partition_pruning;set enable_partition_pruning = on;1create table grades (id serial not null, g not null) partition by range (g);
2
3create table g0035 (like grades including indexes);
4create table g3570 (like grades including indexes);
5create table g7080 (like grades including indexes);
6create table g80100 (like grades including indexes);
7
8alter table attach partition g0035 for values from (0) to (35);
9alter table attach partition g3570 for values from (35) to (70);
10alter table attach partition g7080 for values from (70) to (80);
11alter table attach partition g80100 for values from (80) to (100);1insert into grades (g) select g from another_table;1select max(g) from g0035; -- 34
2select max(g) from g3570; -- 69
3select max(g) from g7080; -- 79
4select max(g) from g80100; -- 991show enable_partition_pruning;1set enable_partition_pruning = on;Checking partition size
No title
1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
No title
1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idx1select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;
2
3 pg_relation_size | relname
4------------------+-----------------------------------------------
5 126935040 | g3570
6 126754816 | g0035
7 72572928 | g7090
8 36241408 | g90100
9 24305664 | g3570_g_idx
10 24272896 | g0035_g_idx
11 13910016 | g7090_g_idx
12 6955008 | g90100_g_idxPros of partitioning
Increases query performanace, when the query is limited to a partition.
DB decides to use Seq scan's instead of Scattered index scan (multi index
search).
Faster bulk loads (attach / detach partitions).
Archiving old data to a cheaper storage (HDD drive).
Cons of partitioning
Row updates are a lot slower (the row needs to be moved to a different partition).
If Query is not limited to a partition, it will be slower.
Schema changes are harder (need to change all partitions), DBMS dependent.
1create table grades (id serial not null, g not null) partition by range (g);
2
3create table g0035 (like grades including indexes);
4create table g3570 (like grades including indexes);
5create table g7080 (like grades including indexes);
6create table g80100 (like grades including indexes);
7
8alter table attach partition g0035 for values from (0) to (35);
9alter table attach partition g3570 for values from (35) to (70);
10alter table attach partition g7080 for values from (70) to (80);
11alter table attach partition g80100 for values from (80) to (100);1select max(g) from g0035; -- 34
2select max(g) from g3570; -- 69
3select max(g) from g7080; -- 79
4select max(g) from g80100; -- 991set enable_partition_pruning = on;Exclusive lock's vs Shared lock's
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
Rendering diagram...
Exclusive lock's
Other queries cannot read or write to the locked row, until the lock is released.
Concurrent attemps resolve with an error.
Shared lock's
Only updates are locked.
Reads are allowed.
You cannot have both shared and exclusive locks on the same row.
Alice
100
Bob
200
Charlie
300
Rendering diagram...
Rendering diagram...
Dead locks
Rendering diagram...
Rendering diagram...
Most DBMS's have a deadlock detection mechanism,
They will fail one of the transactions.
Rendering diagram...
Rendering diagram...
Two phase locking
The double booking phenomenon.
Rendering diagram...
Both transactions read the same data, and updated it.
Solution, create a two phase locking mechanism.
Rendering diagram...
1begin transaction;
2select * from seats where id = 10 for update; -- creates an exclusive lock
3update seats set isBooked = true, name = 'Bob' where id = 10;
4commit;1begin transaction;
2select * from seats where id = 10 for update;
3-- hang
4-- updated data is returnedfor UPDATE creates an exclusive lock on the row indefinitelyset lock_timeout = 1000; will timeout the lock after 1 second.
Rendering diagram...
Rendering diagram...
1begin transaction;
2select * from seats where id = 10 for update; -- creates an exclusive lock
3update seats set isBooked = true, name = 'Bob' where id = 10;
4commit;1begin transaction;
2select * from seats where id = 10 for update;
3-- hang
4-- updated data is returnedAlternate solution (not ideal)
Rendering diagram...
1update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;This solution is imperfect, You are at the mercy of the DBMS
implementation.
Postgres refreshes the row after the lock is released, the same cannot be said
for other DBMS's.
Always rely on explicit locks instead of relying on implicit locks.
Rendering diagram...
Rendering diagram...
1update seats set isBooked = true, name = 'Bob' where id = 10 and isBooked = false;Rendering diagram...
SQL pagination with offset is very slow
This is slow, because, DBMS needs to iterate through the first 10,00,010 rows,
and then filter out the last 10 rows.
1select * from users order by id limit 10 offset 1000000;instead, make user query for more data using the last-seen-id.
1select * from users where id > 1000000 order by id limit 10;1select * from users order by id limit 10 offset 1000000;1select * from users where id > 1000000 order by id limit 10;Connection pooling.
For every connection, the DBMS needs to allocate resources, this takes time.
If we create a connection for every request, the latency and throughput will be very bad.
If the no of connections increase, the DBMS will overload and crash.
Connection pooling is a way to reuse connections.
The connections are kept alive, and reused.
Each connection is leased to a request, and returned to the pool after the request is complete.
Rendering diagram...
Default connection pool size is 10, in postgres.
Rendering diagram...
Each instance keeps alive 10 connections to the DB.
Each instance can send upto 10 concurrent requests to the DB.
Each function invocation leases a connection from the pool.
Once the function is executed, the connection is returned to the pool.
1const { Pool } = require('pg');
2const pool = new Pool({
3 max: 20,
4 idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
5 connectionTimeoutMillis: 2000, // timeout after 2 seconds
6});
7
8await pool.query('SELECT something FROM sometable');We cannot begin a transaction on a pooled connection, transactions need a dedicated connection.
// asking the pool to give us a dedicated connection
const client = await pool.connect();
try {
await client.query('BEGIN');
await client.query('COMMIT');
} catch (e) {
await client.query('ROLLBACK');
} finally {
// release the connection to the pool
client.release();
}Rendering diagram...
Rendering diagram...
1const { Pool } = require('pg');
2const pool = new Pool({
3 max: 20,
4 idleTimeoutMillis: 30000, // connection will be closed after 30 seconds in idle
5 connectionTimeoutMillis: 2000, // timeout after 2 seconds
6});
7
8await pool.query('SELECT something FROM sometable');1// asking the pool to give us a dedicated connection
2const client = await pool.connect();
3try {
4 await client.query('BEGIN');
5 await client.query('COMMIT');
6} catch (e) {
7 await client.query('ROLLBACK');
8} finally {
9 // release the connection to the pool
10 client.release();
11}1// asking the pool to give us a dedicated connection
2const client = await pool.connect();
3try {
4 await client.query('BEGIN');
5 await client.query('COMMIT');
6} catch (e) {
7 await client.query('ROLLBACK');
8} finally {
9 // release the connection to the pool
10 client.release();
11}1docker exec -it pg bash;
2apt-get update;
3apt-get install vim;1openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 3651vim var/lib/postgresql/data/postgresql.conf
2ssl = on
3ssl_cert_file = 'cert.pem'
4ssl_key_file = 'key.pem'Best practices when working with databases
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Separate read and write pools.
This adds additional security to the database.
An compromised read only user, cannot drop tables.
Use Pools
All read operations should use a read only pool.
A database pool is a collection of TCP connections.
Each connection is leased to a request, and returned to the pool after the request is complete.
This way we can reuse connections, and avoid creating new connections for every request.
Less overhead to create and close connections for the database.
Don't use unbounded queries
select * from users instead of
select * from users limit 10 offset 1000000.
Use template literals
select * from users where id = ${id} instead of
select * from users where id = hardcoded_id.
Don't show database errors to the client
Show generic error messages to the client instead.
Why does posgres do a bit map heap scan, instead of an index only scan?
1create index g_index on grades(g);
2explain analyze select count(*) from grades where g = 1;Postgres needs to visit the heap to check if the row is visible to all transactions.
The visibility col is a hidden col in the heap for each row.
Visibility map is used to check if the row is visible to all transactions.
1vaccum grades;vaccum command creates a visibility map.
1create index g_index on grades(g);
2explain analyze select count(*) from grades where g = 1;1vaccum grades;What is cost in explain analyze?
Cost is the estimated cost of the query.
Its just a number that postgres uses.
Higher the cost, slower the query.
All Isolation levels
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 11
-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10
-- locks the rows that we read
-- parallel transaction cannot update the rows that we read
-- parallet transaction can append new rows
select COUNT(*) from users where id GTE 100 and id LTE 200;
-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Repeatable read
The rows that we read, will remain the same, until the transaction is complete.
Obtains the lock on the things we read.
Phantom reads are possible. ( locks are not obtained on the rows that we did not read before )
New rows that where inserted during the transaction, will not be visible to the transaction.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transaction1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transactionSnapshot isolation level
Each transaction has a version.
xmin is the transaction version of the transaction that created the row.
xmax is the transaction version of the transaction that deleted the row.
xmin and xmax are stored in the heap.
When I start a transaction, only the rows that are visible to the transaction are returned.
Any new rows that are inserted during the transaction, will not be visible to the transaction.
Any rows that are deleted during the transaction, will be visible to the transaction.
xmin is used to filter out rows that where inserted after the transaction
started.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 101BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 10Repeatable read vs Snapshot isolation level
Depends on the database.
In Posgres Repeatable read is Repeatable read.
Select for update. (Pessimistic locking)
1select * from users where id = 1 for update;Obtains an exclusive lock on the rows that we read.
Other transactions cannot read / update the rows that where locked.
They have to wait for the previous transaction to release the lock.
1select * from users where id = 1 for update;Serializable isolation level. (Optimistic locking)
Uses MVCC to check if the rows that we read are modified by other
transactions.
Detects conflicts, and fails the transaction.
Duplicate indexes in a row
Adding an index to a column that has less unique values, is not useful. (Index selectivity) Index is not selective enough.
Index is bloated and takes up more space, (same value is repeated multiple times, in the b-tree)
Postgres 13 has a de-duplication feature, that removes duplicate values from the b-tree index. node - row_id[]
Why use SELECT FOR UPDATE?
Each transaction obtains a exclusive lock on the rows that it needs to read / update.
Other transactions cannot read / update the rows that where locked.
You killed concurrency, your throughput is low.
Your transactions will never fail due to conflicts.
Serialisable isolation level is better.
Detects conflicts and fails the transaction, if there is a serialisable error.
Your throughput is high, and you have concurrency.
Its a optimistic locking mechanism.
Your transactions can fail, due to conflicts.
You need to handle retry logics in your application.
Why use a connection pool?
If you are using one database connection for every request, you are doing it wrong.
TCP blocking could happen if a lot of queries need to be executed at the same time.
Rendering diagram...
Rendering diagram...
There is a chance of response misdirection.
Depends on the DB implementation, if the protocol used by the database does not use query tagging, the response could be misdirected.
But in case of a connection pool,
Each connection is leased to a request, and returned to the pool after the request is complete.
No parallel requests are made on the same connection.
Rendering diagram...
Rendering diagram...
What is the value of a bitmap index scan?
Row can be in multiple pages.
Creates a bit map on the first condition, if index is present for the same value. This creates a bit map of pages that satisfy the first condition.
Updates a bit map on the second condition.
Goes to the heap pulls those pages.
Filters out the page for the query condition
See, "Bitmap And / Or" for more info this file
If the query planner decides it has to scan a lot of rows, based on the query condition, it will decide to do an bitmap scan vs an index scan, to reduce IO.
Does composite index order matter?
Write amplifications, composite indexes touches a lot of cols, the index needs to be updated on row update.
Composite indexes are created from left to right.
Minimum filter first. (filter that results in a larger dataset first)
1select a,b,c where a=10 and b=10 and c=20;
2-- a is minimum filter1select a,b,c where a=10 and b=10 and c=20;
2-- a is minimum filterIn an index only scan, pg still needs to visit the heap?
To check if the table is visible to the transaction.
To get old versions (snapshot) of the row, when the transaction started.
To avoid visits to the heap, use VACUUM command.
Avoid long running transactions.
Persistence vs Durability
Persistence is the ability to store data on disk and retrieve it at a later time after a crash.
My transaction is durable is only successful, when the data is stored on disk. Postgres and Redis use WAL's to flush data to disk periodically.
Can we create index when row's are being inserted / updated?
Most DB's to create index obtains a update / insert / delete lock on the entire table, until the index is created.
Postgres allows you to create index concurrently, without locking the table.
1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transaction1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 101select * from users where id = 1 for update;1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
1begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
Rendering diagram...
1begin transaction isolation level repeatable read;1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;1begin transaction isolation level serializable;Rendering diagram...
Isolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
1begin transaction isolation level serializable;1begin transaction isolation level serializable;Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
Index
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
Bitmap
If the no of rows to be scanned is too high, and the
Rendering diagram...
bitmap is a bit array of the index values, pointing to the page on heap.
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Rendering diagram...
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Bitmap And / Or
Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)1explain select id from t_records where t > 5000 and t < 1000;Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
1vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
1vacuum t_records;Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Composite / Combined Indexes
1create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
When should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
1create index a_b_c_combined_idx on temp(a,b,c);1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1create table grades (id serial not null, g not null) partition by range (g);
2
3create table g0035 (like grades including indexes);
4create table g3570 (like grades including indexes);
5create table g7080 (like grades including indexes);
6create table g80100 (like grades including indexes);
7
8alter table attach partition g0035 for values from (0) to (35);
9alter table attach partition g3570 for values from (35) to (70);
10alter table attach partition g7080 for values from (70) to (80);
11alter table attach partition g80100 for values from (80) to (100);1select max(g) from g0035; -- 34
2select max(g) from g3570; -- 69
3select max(g) from g7080; -- 79
4select max(g) from g80100; -- 991set enable_partition_pruning = on;1// asking the pool to give us a dedicated connection
2const client = await pool.connect();
3try {
4 await client.query('BEGIN');
5 await client.query('COMMIT');
6} catch (e) {
7 await client.query('ROLLBACK');
8} finally {
9 // release the connection to the pool
10 client.release();
11}1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transaction1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 101select * from users where id = 1 for update;1BEGIN TRANSACTION;
2SELECT balance FROM accounts WHERE account_id = 1;1balance - 100 > 0;1UPDATE accounts SET balance = balance - 100
2WHERE account_id = 1;Read phenomena's,
Dirty reads
A read of uncommitted data,
the data can be rolled back / committed.
The data needs to be fully flushed to disk (committed to disk)
Rendering diagram...
Non-repeatable reads
Two different reads of the same data, return different results.
Rendering diagram...
Avoid it by starting a repeatable read transaction,
1begin transaction isolation level repeatable read;Phantom reads
A read of a set of rows,
where the set of rows changes between reads.
Rendering diagram...
On postgres you can use the following to prevent phantom reads,
1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;mysql,
begin transaction isolation level serializable;Lost updates
Two transactions update the same data,
and one overwrites the other.
The read of the first transaction is lost.
Rendering diagram...
1begin transaction isolation level repeatable read;1begin transaction isolation level repeatable read;
2# or
3begin transaction isolation level serializable;1begin transaction isolation level serializable;Rendering diagram...
Isolation levels
Read uncommitted
Dirty reads are allowed.
Read committed
Dirty reads are not allowed.
Non-repeatable reads are allowed,
committed data will be picked up mid transaction
Repeatable reads
Read query results in same transactions are always consistent.
Snapshot
Only changes committed up to the start of the transaction are visible.
Serializable
All transactions are executed in order.
*Each DBMS system implements isolation levels differently
read uncommitted
⛔
⛔
⛔
⛔
read committed
✅
⛔
⛔
⛔
repeatable read
✅
✅
⛔
⛔
snapshot
✅
✅
✅
⛔
serializable
✅
✅
✅
✅
Serialisable is slower than snapshots
Approaches to isolation
Pessimist approach
Data (rows / tables / page) are locked until they are committed.
Expensive and slow ( transaction wait for locks to be released
).
Needs to track locked data.
Optimestic approach
No locks,
If there is a conflict, the transaction fails.
Repeatable read locks
Locks the data that is read,
locks are released when the transaction is committed.
Searliability
Optimistic concurrency control, no locks until conflicts during commit.
Pessimistic concurrency control, locks are used to prevent conflicts.
1begin transaction isolation level serializable;1begin transaction isolation level serializable;Sequential
If the no of rows to be scanned is too high,
it would be easier to iterate over the heap,
instead of iterating over the index and then going to the heap.
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 10;
2
3 Seq Scan on t_records (cost=0.00..15233.01 rows=898039 width=4)
4 Filter: (t > 10)
5 (2 rows)Rendering diagram...
Index
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
1explain select id from t_records where t > 5000;
2
3 Index Scan using t_records_t_idx on t_records (cost=0.42..8.44 rows=1 width=4)
4 Index Cond: (t > 5000)
5 (2 rows)Rendering diagram...
Bitmap
If the no of rows to be scanned is too high, and the
Rendering diagram...
bitmap is a bit array of the index values, pointing to the page on heap.
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the conditionPostgres would only scan the pages that have rows that satisfy the condition.
1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Postgres may decide to do a bitmap scan based on the where clause, if it needs to compare index with multiple other data-points.
Rendering diagram...
1pages = [0, 0, 1, 0, 0, 0, ...] # 1 if page has rows that satisfy the condition1explain select id from t_records where t > 4000;
2
3 Bitmap Heap Scan on t_records (cost=1942.75..8103.70 rows=174236 width=4)
4 Recheck Cond: (t > 4000)
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..1899.19 rows=174236 width=0)
6 Index Cond: (t > 4000)
7 (4 rows)Bitmap And / Or
Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditionsexplain select id from t_records where t > 5000 and t < 1000;
Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
Recheck Cond: ((t > 5000) AND (t < 1000))
-> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
Index Cond: ((t > 5000) AND (t < 1000))
(4 rows)1explain select id from t_records where t > 5000 and t < 1000;Rendering diagram...
1condition_1 = [0, 0, 1, 0, 0, 1, ...]
2condition_2 = [0, 0, 1, 0, 1, 0, ...]
3
4bitmap_and = [0, 0, 1, 0, 0, 0, ...] # only page 3 staifies both conditions1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;Vaccum
The vaccum command is a clean up function, that's auto run by postgres, to
reclaim disk space.
1vacuum t_records;The vacuum command is recommended to be run after a large number of CRED
operations.
1vacuum t_records;Analyze
Use analyze to update statistics about the table after large no of inserts,
this will help the database to make better decisions when executing queries.
Non-Key Indexes / Covering Indexes
Indexes with reference to non-key columns, are called covering indexes.
This allows for the database to serve the query from the index itself,
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Adding large columns to the index, will increase the size of the index. This makes the index unable to fit in memory, and thus slower (stored in disk instead).
When searching for b / c, postgres cannot use a_idx, and must go to the heap.
1create index a_idx on temp(a) include (b,c);Rendering diagram...
Composite / Combined Indexes
1create index a_b_c_combined_idx on temp(a,b,c);Multiple permutations of indexes can be created, to serve different queries.
The index will be ordered from left to right.
1
2
3
1
2
4
1
3
3
1
3
4
2
2
3
Postgres will decide to use combined indexes,
when we are comparing multiple columns in the where clause.
Since the index is ordered by the left-most column,
then database does not need to scan the entire index,
it can just scan the index until the first non-matching value.
1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
When should I use combined indexes?
When using a where clause with multiple columns,
Combined indexes are faster than covered indexes,
because the database does not need to do additional I/O's to map
intersections between data sets.
When should I not use combined indexes?
When not including the first column in the where clause. (just use a covering index)
When not using the second and up columns in the where clause, and just
selecting them.
(consider using a covering index, this avoids creating b-trees that are not
used)
1create index a_b_c_combined_idx on temp(a,b,c);1select * from temp where a > 1 and b < 3 and c > 5;Rendering diagram...
Clustered Indexes vs Non-Clustered Indexes
Data is ordered
Data is not ordered
Reads are faster
Slower
Edits are Slower
Faster
In mySQL, the primary key is a clustered index.
The heap itself is ordered by the primary key.
Inserts, Deletes and Updates are slower, in clustered indexes,
since the database needs to re-order the data, by physically moving rows.
Postgres by default does not support clustered indexes.
all rows are inserted into the end of the table, with a new row_id
How Database Optimizers Decide to Use Indexes
If we get too many rows from the index, it will be faster to just scan the heap.
If two indexes are used, one index returns more rows than the other,
the database will use the index that returns the least rows.
1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1create table grades (id serial not null, g not null) partition by range (g);
2
3create table g0035 (like grades including indexes);
4create table g3570 (like grades including indexes);
5create table g7080 (like grades including indexes);
6create table g80100 (like grades including indexes);
7
8alter table attach partition g0035 for values from (0) to (35);
9alter table attach partition g3570 for values from (35) to (70);
10alter table attach partition g7080 for values from (70) to (80);
11alter table attach partition g80100 for values from (80) to (100);1select max(g) from g0035; -- 34
2select max(g) from g3570; -- 69
3select max(g) from g7080; -- 79
4select max(g) from g80100; -- 991set enable_partition_pruning = on;1// asking the pool to give us a dedicated connection
2const client = await pool.connect();
3try {
4 await client.query('BEGIN');
5 await client.query('COMMIT');
6} catch (e) {
7 await client.query('ROLLBACK');
8} finally {
9 // release the connection to the pool
10 client.release();
11}1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 11
10
11-- same query returns different results, within the same transaction1BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
2select COUNT(*) from users where id GTE 100 and id LTE 200;
3-- val 10
4
5-- locks the rows that we read
6-- parallel transaction cannot update the rows that we read
7-- parallet transaction can append new rows
8select COUNT(*) from users where id GTE 100 and id LTE 200;
9-- val 101select * from users where id = 1 for update;1begin transaction isolation level serializable;1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)1begin transaction isolation level serializable;1explain select id from t_records where t > 5000 and t < 1000;
2
3 Bitmap Heap Scan on t_records (cost=62.55..4302.44 rows=4500 width=4)
4 Recheck Cond: ((t > 5000) AND (t < 1000))
5 -> Bitmap Index Scan on t_records_t_idx (cost=0.00..61.42 rows=4500 width=0)
6 Index Cond: ((t > 5000) AND (t < 1000))
7 (4 rows)