Sagyam's Blog

An Interactive Guide To Count Min Sketch

Sagyam Thapa — Wed, 25 Jun 2025 17:58:40 GMT

Introduction

Count min sketch is a probabilistic data structure that can estimate the frequency of items in a stream. It is an improvement over Hyperloglog. While hyperloglog can estimate the number of unique items in a fixed amount of data, count min sketch can do that over a stream of data. Think of hyperloglog as something that can guess the frequency of unique items on an image (something that is fixed), while count min sketch as something that can do that for a live video stream (stream of data). Meaning even when you don’t know how long the data stream is, you can still guess the frequency of items in the stream

Working principle

This blog is the third installment of my probabilistic data structure series. I have written similar interactive guides on Bloom Filter and Hyperloglog. If you are unfamiliar with probabilistic data structures, then reading a similar interactive guide in Hyperloglog should give you a good idea.

A Count-Min Sketch is made of
- A 2D array of counters with d rows and w columns.
- Each row has its hash function (h1, h2, h3..).

💡

Here we hash first items i.e. 2 from stream of data and obtain the cell position that needs to be incremented.

Insert Operation (Adding an item):
- For an item x Hash it using all d hash functions.
- For each hash, increment the corresponding counter in its row:
```
  count[i][hash_i(x)] += 1
```
Query Operation (Getting frequency estimate of x):
- Hash x with the same d hash functions.
- Fetch the counts from the corresponding cells.
- Return the minimum value among those d counters:
```
  estimate = min(count[0][h1(x)], count[1][h2(x)], ..., count[d-1][hd(x)])
```

I have created a fun little app that lets you see the working of Count-Min Sketch. Adjust the number of rows and columns. Click to generate a random number. The number is hashed n times. Each time it’s hashed a location for cell whose value needs to incremented by one is found. Clicking on a number follows a similar process. Except instead of incrementing the values, we take the minimum of all cells to get the estimate for that item.

Can you get a count-min sketch to always get it right?

Fun facts

It is called count-min sketch because it counts the minimum from a sketch (sketch is like a compact summary of a large dataset).
It has sub-linear space complexity, meaning it takes less space than storing an accurate count.
The reason it never underestimates is that counters can only ever be incremented, and the minimum count is taken.
Increasing d (rows) means higher probability of accurate results because more independent estimates but with more time complexity.
Increasing w (columns) means better accuracy due to less chance of collision but more memory usage.

Demo

I have create a fun little app that puts all the pieces together to show you count min sketch would work. Here our app is guessing the frequency of fruits in a stream of 5000 fruits. Hit start and see a stream of fruits appear. See how the hash table is updated in real-time. Notice that the count min sketch never underestimates the real amount.

Mathematical Relationships

Error Bounds

Error in frequency estimate ≤ ε × N with probability 1 - δ
Where:
- ε = error factor (e.g., 0.001 means 0.1% error)
- N = total number of items processed
- δ = failure probability (e.g., 0.01 means 99% confidence)

Formula for Parameters

Width: w = ⌈e/ε⌉ (where e ≈ 2.718)
Depth: d = ⌈ln(1/δ)⌉

Use Cases

Finding heavy hitters in a stream.
Detecting DDoS attack.
Tracking popular search queries in search engine

References

An Interactive Guide To Caching Strategies

Sagyam Thapa — Fri, 20 Jun 2025 14:42:44 GMT

Introduction

Word cache originates from French word cacher which means to hide. Outside computer science circle it refers to a secret place where you hide things, usually emergency supplies. In computer science though the meaning of the word is flipped. Cache is a place where you store your frequently accessed data. It is one of the most effective ways to improve application performance, but choosing the right caching strategy can be tricky. Each strategy has its own strengths, trade-offs, and ideal use cases.

Terminologies

Cache Hit: When the data you are looking for is found in cache
Cache Miss: When the data you are looking for is not found in cache
Asynchronous Writes: When you write multiple items to DB back to back without waiting for last write to complete
Eventual Consistency: Data syncing mechanism where updates not transferred immediately
Cache Stampede: Situation where data that’s not available in cache is suddenly in high demand
Pre-Warming cache: Loading cache with frequently used data before it’s even asked.
Cache pollution: When frequently accessed data is repeatedly evicted and re-fetched leading to performance degradation

In this guide, we'll explore five common caching strategies that every developer should understand.

Cache Aside

💡

Note that at step 2 cache miss is actually reported to server and server is responsible for updating cache.

Introduction: Server takes the responsibility of managing cache

Cache Hit Behavior: Return data directly from cache

Cache Miss Behavior: Load from database, update cache, return data

Write Behavior: Write to database only, invalidate cache entry

Consistency: Eventual consistency, possible stale data

Performance: Fast reads on hit, slower on miss

Use Cases: Read-heavy workloads, unpredictable access patterns

Advantages: Simple implementation, application has full control

Disadvantages: Cache stampede risk, code duplication across services

Link to interactive app

Read Through

💡

Note that at step 2 cache does not actually return cache miss to server. Instead it transparently fetches data from database and updates cache and send the data to server.

Introduction: Cache takes full responsibility of managing cache

Cache Hit Behavior: Cache returns data directly

Cache Miss Behavior: Cache loads from database transparently

Write Behavior: Typically combined with write-through or write-back

Consistency: Depends on write strategy used

Performance: Consistent read performance

Use Cases: Uniform data access patterns

Advantages: Simplified application code, centralized cache logic

Disadvantages: Less flexibility, cache becomes critical component

Link to interactive app

Write Through

Introduction: Writes are first written to cache then immediately written to database

Cache Hit Behavior: Return cached data

Cache Miss Behavior: Load from database into cache

Write Behavior: Write to cache and database before confirming

Consistency: Strong consistency guaranteed

Performance: Slower writes due to dual updates

Use Cases: Financial systems, inventory management

Advantages: No data loss risk, always consistent

Disadvantages: Higher write latency, no benefit for write-heavy loads

Link to interactive app

Write Back

Introduction: Writes are first written to cache and eventually written to database

Cache Hit Behavior: Return cached data

Cache Miss Behavior: Load from database if not in write queue

Write Behavior: Write to cache immediately, batch/delay database writes

Consistency: Eventual consistency

Performance: Very fast writes

Use Cases: Write-heavy workloads, analytics data

Advantages: Excellent write performance, reduced database load

Disadvantages: Risk of data loss, complex failure handling

Link to interactive app

Write Around

💡

Note at step 4 cache miss is not advertised to server instead the cache fetches the data from database keeps a copy for itself and returns the data. Also step 2 can be asynchronous or synchronous.

Description: Writes bypass cache, goes directly to database

Cache Hit Behavior: Return cached data

Cache Miss Behavior: Load from database, optionally cache

Write Behavior: Direct to database, don't update cache

Consistency: Avoids cache pollution from writes

Performance: Good for infrequent reads after writes

Use Cases: Bulk imports, audit logs

Advantages: Prevents cache pollution, simpler write path

Disadvantages: First read after write is slow

Link to interactive app

Refresh Ahead

💡

Note that it’s not wise to re-fetch data just because it’s stale. This causes unnecessary strain on database. It better to re-fetch stale data only when it’s requested.

Introduction: Needs pairing with another strategy

Cache Hit Behavior: Always return fresh data

Cache Miss Behavior: Rare, only on first access

Write Behavior: Depends on combined strategy

Consistency: Near real-time data freshness

Performance: Consistent fast reads

Use Cases: Frequently accessed data, predictable patterns

Advantages: Minimizes cache misses, predictable performance

Disadvantages: Wastes resources on unused data, complex prediction logic

Link to interactive app

Conclusion

As you see from these demos, picking the best caching strategy depends on your use case. Remember that these strategies aren't mutually exclusive. Many production systems combine strategies.

Here are some of my recommendation:

Content Management System

Recommended: Write Through + Read Through

Requirements:

Ensures published content is immediately available
Simplifies application logic
Strong consistency for content updates

E-commerce Product Catalog

Recommended: Cache Aside + Refresh Ahead

Requirements:

Cache Aside for flexibility with varying access patterns
Refresh Ahead for popular items to ensure availability
Handles both predictable (trending) and unpredictable (long-tail) access

Financial Trading System

Recommended: Write Through

Requirements:

Strong consistency is must
Every transaction must be persisted immediately
Cache serves only to reduce read latency

Real-time Chat Application

Recommended: Write Back + Read Through

Requirements:

Write Back for message sending performance
Read Through for message history
Recent messages stay in cache

Gaming Leader boards

Recommended: Write Back + Refresh Ahead

Requirements:

Write Back for rapid score updates
Refresh Ahead for top players
Eventual consistency acceptable

API Rate Limiting

Recommended: Write Back

Requirements:

Extremely high update frequency
Small data loss acceptable
Performance critical for API gateway

Start simple with Cache-Aside, measure your performance, and evolve your caching strategy as your application grows. Happy caching!

References

An Interactive Guide To Rate Limiting

Sagyam Thapa — Wed, 04 Jun 2025 22:19:25 GMT

Introduction

Rate limiting is a must have strategy in every back-end app. It prevent one user from overusing a resource and degrading the quality of service for other users. Here are some benefits of rate limiting

It presents resource starvation
Reduces server hosting cost
Provides basic protection against DDoS

I have made four interactive app that let’s you play around with common rate limiting algorithms.

Token bucket

Working:

A bucket holds fixed number tokens
Tokens are added to bucket at fixed rate
When a request comes in:
- If a token is available, it’s removed from the bucket and the request is allowed.
- If no tokens are available, the request is rejected or delayed.
Allows for occasional short burst if tokens are available

I have created an app that let’s you play with leaky bucket algorithm.

Leaky bucket

Working

Think of it as a bucket leaking at a fixed rate
Incoming requests are added to the bucket
Requests are processed (or "leak") at a constant rate
If the bucket is full when a new request arrives, the request is dropped
Smooths out bursts; outputs requests at a steady rate

I have made an app that let’s you play with leaky bucket algorithm.

Fixed window counter

Working:

Time is divided into fixed size windows (e.g., 1 minute)
A counter tracks the number of requests per client/IP in the current window
If the count exceeds the limit, further requests are rejected until the next window
Simple and efficient, but allows burst traffic spike at end/start

I have created an app that let’s you play with fixed bucket algorithm.

Sliding window counter

Working:

Keeps a timestamped log of each request
When a request comes in, logs are checked to count how many requests were made in the last X seconds
If under the limit, the request is allowed and logged; otherwise, it’s rejected

I have created an app that let’s you play with sliding bucket algorithm.

Instrument your NodeJS App With OpenTelemetry

Sagyam Thapa — Tue, 03 Jun 2025 18:15:00 GMT

Introduction

Have you ever had a bug that occurred in production and you have no idea what went wrong because your logs won’t tell you exactly what went wrong or a request that takes usually long to process.

Sometimes debugging these issues without a tracing system is impossible. A tracing system is like a CCTV camera that captures every thing what happened, when did it happen, what was the order of events, how long did it each event take. This information is vital for debugging and identifying performance bottlenecks in complex distributed applications.

Prerequisite

NodeJS
Typescript
NestJS
Docker

Terminology

Trace: A trace is like a complete journey map of a single request as it moves through your entire distributed system. Imagine it as a detailed travel log that follows a request from its starting point to its final destination, capturing every stop and interaction along the way.
Instrumentation: The process of adding code to your application to collect telemetry data. It's like installing GPS trackers in different parts of your system.
Exporter: A component responsible for sending collected trace data to a back-end system for storage and analysis. Think of it as a postal service that sends your travel logs to a central archive.
Span:
- Root span: The first span in a trace, marking the beginning of the entire request journey. It's like the starting point of your travel log.
- Child span: A span that is nested within another span, representing a more specific operation within a broader process.
Context propagation: The mechanism of transferring trace information between different services and components. It's like passing a traveler's passport that contains their complete journey details.
Metrics: Metrics are numerical data that tells up about app’s performance, health and behavior.
Logs: Logs are text entries describing usage patterns, activities, and operations within your application.

Three horsemen of observability

Observability lets you understand a system from the outside by letting you ask questions about that system without knowing its inner workings. It allows you to easily troubleshoot and handle novel problems, that is, “unknown unknowns”. It also answers the question “Why is this happening?”

Setting up the project

pnpm i -g @nestjs/cli
nest new tracing-app
cd tracing-app

Installing dependencies

Install Jaeger and OpenTelemetry related libraries:

pnpm install @opentelemetry/sdk-trace-node @opentelemetry/resources @opentelemetry/sdk-trace-base 
pnpm install @opentelemetry/instrumentation @prisma/instrumentation @opentelemetry/instrumentation-net @opentelemetry/instrumentation-http @opentelemetry/instrumentation-express
pnpm install @opentelemetry/exporter-trace-otlp-http
pnpm install @opentelemetry/api @opentelemetry/semantic-conventions

Install Prisma ORM and SQLite:

pnpm install @prisma/client sqlite3 class-validator
pnpm install prisma --save-dev
pnpm install --save @nestjs/swagger

Initialize Prisma:

npx prisma init

This will create a prisma directory with a schema.prisma file.

datasource db {
  provider = "sqlite"
  url      = "file:./dev.db"
}

generator client {
  provider = "prisma-client-js"
}

model User {
  id    Int     @id @default(autoincrement())
  name  String
  email String  @unique
}

Run Prisma migrations:

npx prisma migrate dev --name init

Generate Prisma Client:

npx prisma generate

Setup a CRUD endpoint

Generate a CRUD module for users

pnpm nest generate resource users

This will create a users module with a controller, service, and DTOs.

Create a prisma.service.ts file in prisma folder

import { Injectable, OnModuleInit, OnModuleDestroy } from '@nestjs/common';
import { PrismaClient } from '@prisma/client';

@Injectable()
export class PrismaService extends PrismaClient implements OnModuleInit, OnModuleDestroy {
  async onModuleInit() {
    await this.$connect();
  }

  async onModuleDestroy() {
    await this.$disconnect();
  }
}

Update the users.module.ts file to include the PrismaService:

import { Module } from '@nestjs/common';
import { UsersService } from './users.service';
import { UsersController } from './users.controller';
import { PrismaService } from '../../prisma/prisma.service';

@Module({
  controllers: [UsersController],
  providers: [UsersService, PrismaService],
})
export class UsersModule { }

Create a file named create-user.dto.ts in the users/dto directory:

import { IsEmail, IsNotEmpty, IsString } from 'class-validator';
import { ApiProperty } from '@nestjs/swagger';

export class CreateUserDto {
    @ApiProperty({
        description: 'The name of the user',
        example: 'John Doe',
    })
    @IsNotEmpty()
    @IsString()
    name: string;

    @ApiProperty({
        description: 'The email of the user',
        example: 'email@domain.com',
    })
    @IsNotEmpty()
    @IsEmail()
    email: string;
}

export class UpdateUserDto extends PartialType(CreateUserDto) {}

Update the users.service.ts file to use Prisma:

import { Injectable } from '@nestjs/common';
import { PrismaService } from '../prisma/prisma.service';
import { CreateUserDto } from './dto/create-user.dto';
import { UpdateUserDto } from './dto/update-user.dto';

@Injectable()
export class UsersService {
  constructor(private prisma: PrismaService) {}

  create(createUserDto: CreateUserDto) {
    return this.prisma.user.create({
      data: createUserDto,
    });
  }

  findAll() {
    return this.prisma.user.findMany();
  }

  findOne(id: number) {
    return this.prisma.user.findUnique({
      where: { id },
    });
  }

  update(id: number, updateUserDto: UpdateUserDto) {
    return this.prisma.user.update({
      where: { id },
      data: updateUserDto,
    });
  }

  remove(id: number) {
    return this.prisma.user.delete({
      where: { id },
    });
  }
}

Update the users.controller.ts file:

import { Controller, Get, Post, Body, Patch, Param, Delete } from '@nestjs/common';
import { UsersService } from './users.service';
import { CreateUserDto } from './dto/create-user.dto';
import { UpdateUserDto } from './dto/update-user.dto';
import { ApiGoneResponse, ApiNotFoundResponse, ApiOkResponse, ApiOperation, ApiParam, ApiTags } from '@nestjs/swagger';

@ApiTags('users')
@Controller('users')
export class UsersController {
  constructor(private readonly usersService: UsersService) { }

  @ApiOperation({ summary: 'Create user' })
  @ApiOkResponse({ description: 'User created' })
  @Post()
  create(@Body() createUserDto: CreateUserDto) {
    return this.usersService.create(createUserDto);
  }

  @ApiOperation({ summary: 'Get all users' })
  @ApiOkResponse({ description: 'Users found' })
  @Get()
  findAll() {
    return this.usersService.findAll();
  }

  @ApiOperation({ summary: 'Get user by id' })
  @ApiOkResponse({ description: 'User found' })
  @ApiNotFoundResponse({ description: 'User not found' })
  @ApiParam({ name: 'id', description: 'User id' })
  @Get(':id')
  findOne(@Param('id') id: string) {
    return this.usersService.findOne(+id);
  }

  @ApiOperation({ summary: 'Update user' })
  @ApiOkResponse({ description: 'User updated' })
  @ApiNotFoundResponse({ description: 'User not found' })
  @ApiParam({ name: 'id', description: 'User id' })
  @Patch(':id')
  update(@Param('id') id: string, @Body() updateUserDto: UpdateUserDto) {
    return this.usersService.update(+id, updateUserDto);
  }

  @ApiOperation({ summary: 'Delete user' })
  @ApiGoneResponse({ description: 'User deleted' })
  @ApiParam({ name: 'id', description: 'User id' })
  @Delete(':id')
  remove(@Param('id') id: string) {
    return this.usersService.remove(+id);
  }
}

Configuring exporters

Create a file tracing.ts in your src directory:

import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } from '@opentelemetry/semantic-conventions';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { NetInstrumentation } from '@opentelemetry/instrumentation-net';
import { NodeTracerProvider } from '@opentelemetry/sdk-trace-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { PrismaInstrumentation } from '@prisma/instrumentation';
import { Resource } from '@opentelemetry/resources';
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
import { registerInstrumentations } from '@opentelemetry/instrumentation';

export function setupTracing() {
    // Enable OpenTelemetry diagnostic logging
    diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);

    // Create a resource with service information
    const resource = new Resource({
        [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || 'tracer-app',
        [ATTR_SERVICE_VERSION]: process.env.npm_package_version || '1.0.0',
    });

    const otlpExporter = new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
    });


    // Create tracer provider with resource and span processors
    const provider = new NodeTracerProvider({
        resource,
        spanProcessors: [
            new BatchSpanProcessor(otlpExporter, {
                maxQueueSize: 100,
                scheduledDelayMillis: 5000,
                exportTimeoutMillis: 30000,
                maxExportBatchSize: 50,
            })
        ]
    });

    // Register instrumentations with more comprehensive coverage
    registerInstrumentations({
        tracerProvider: provider,
        instrumentations: [
            new HttpInstrumentation({
                requestHook: (span, request) => {
                    span.setAttribute('http.request.method', request.method);
                },
            }),
            new NetInstrumentation(),
            new ExpressInstrumentation(),
            new PrismaInstrumentation({ middleware: true }),
        ],
    });

    // Register the provider
    provider.register();

    // Return the provider for potential manual instrumentation
    return provider;
}

// Call this at application startup
setupTracing();

Explanation

Diagnostic Logging: Enables diagnostic logging using a console logger at the INFO level to debug tracing setup.
Resource Initialization:
- Defines metadata about the service, like SERVICE_NAME and SERVICE_VERSION.
- This metadata is attached to every trace and helps identify which service the trace belongs to.
OTLP Trace Exporter: Configures the OpenTelemetry Protocol (OTLP) exporter to send trace data to the backend which is Jaeger for now using HTTP protocol. Note that Jaeger can be swapped with other backend like Honeycomb, Zipkin etc and you change the protocol to more efficient GRPC from current HTTP.
Tracer Provider with Span Processor: Creates a NodeTracerProvider, which manages tracers and spans:
- Resource: Includes service metadata.
- BatchSpanProcessor: Buffers spans and exports them in batches to minimize performance impact. Key configurations:
  - maxQueueSize: Maximum spans in the queue before flushing.
  - scheduledDelayMillis: Frequency of flushing spans.
  - exportTimeoutMillis: Max time allowed for export.
  - maxExportBatchSize: Maximum spans per export batch.
Register Instrumentation: Automatically captures traces for libraries and frameworks:
- HttpInstrumentation: Captures time taken by HTTP requests/responses.
- NetInstrumentation: Captures time taken by low-level networking events.
- ExpressInstrumentation: Tracks time taken by Express middleware and routes.
- PrismaInstrumentation: Tracks time taken by SQL queries generated by Prisma

Inject instrumenting code in your application

Import and initialize the tracing configuration in your main application file main.ts:

import { NestFactory } from '@nestjs/core';
import { SwaggerModule, DocumentBuilder } from '@nestjs/swagger';
import { AppModule } from './app.module';
import './tracing';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);

  const config = new DocumentBuilder()
    .setTitle('Tracing example')
    .setDescription('The tracing API description')
    .setVersion('1.0')
    .addTag('tracing')
    .build();
  const documentFactory = () => SwaggerModule.createDocument(app, config);
  SwaggerModule.setup('api-docs', app, documentFactory);

  await app.listen(process.env.PORT ?? 3000);
}
bootstrap();

Run Your Application

pnpm run start:dev

Setting Jaeger for development environment

Easiest way to setup Jaeger is with docker-compose winch will work fine a development environment.

Create docker-compose.yaml file:

services:
  jaeger:
    image: jaegertracing/all-in-one:1.63.0
    container_name: jaeger
    environment:
      COLLECTOR_OTLP_ENABLED: "true"
    ports:
      - "4317:4317" # For Jaeger-GRPC
      - "4318:4318" # For Jaeger-HTTP
      - "16686:16686" # # Web UI

networks:
  default:
    driver: bridge

Containerize app (optional)

You can use docker init command to automatically generate an optimized Dockerfile if you have newer version of docker installed.

# Arguments for versions
ARG NODE_VERSION=20.18.0
ARG PNPM_VERSION=9.12.2
ARG ALPINE_VERSION=3.20

################################################################################
# Base stage: Build the application
FROM node:${NODE_VERSION}-alpine${ALPINE_VERSION} AS builder

# Set working directory
WORKDIR /usr/src/app

# Install pnpm globally with cache
RUN --mount=type=cache,target=/root/.npm \
    npm install -g pnpm@${PNPM_VERSION}

# Copy package.json and pnpm-lock.yaml to install dependencies
COPY ../package.json pnpm-lock.yaml ./

# Install dependencies with cache
RUN --mount=type=cache,target=/root/.pnpm-store \
    pnpm install --frozen-lockfile

# Copy the all application code
COPY .. .

# Setup prisma
RUN pnpm prisma generate

# Build the application
RUN pnpm run build

# Runner Stage
FROM node:${NODE_VERSION}-alpine${ALPINE_VERSION} AS runner

# Set working directory
WORKDIR /usr/src/app

# Copy the built application from the builder stage
COPY --from=builder /usr/src/app/dist ./dist
COPY ../package.json pnpm-lock.yaml ./
COPY ../prisma/schema.prisma ./prisma/schema.prisma

# Install pnpm globally
RUN --mount=type=cache,target=/root/.npm \
    npm install -g pnpm@${PNPM_VERSION}

# Install dependencies with cache
RUN --mount=type=cache,target=/root/.pnpm-store \
    pnpm install --frozen-lockfile --prod

# Set NODE_ENV to production
ENV NODE_ENV=production

# Run the application
CMD ["pnpm", "run", "start:prod"]

Swagger UI

Visit http://localhost:3000/api-docs and make some API calls

Visualizing traces

Open your browser and go to http://localhost:16686 to see the Jaeger UI. Run some request and click Find Trace and click on a trace

Scaling PostgreSQL with Kubernetes

Sagyam Thapa — Sun, 25 May 2025 20:09:27 GMT

A case for vertical scaling

If you have read any article or a book on system design then you probably know what vertical and horizontal scaling is and benefits of horizontal scaling. Before I explain how to setup proper horizontal scaling with Postgres let me make a case when you should not try this.

Simplicity: Single node database means you can run your database out of the box. Although I recommend you run PGTune for a quick preset or visit postconf a full breakdown
Easier backup and recovery: No need to think about state across replicas when creating backups or applying a backup.
No network overhead especially with write heavy operations.
A temporary fix: If need a fix right now, this will provide an instant relief.

Prerequisite

Make sure you have following tools installed.

Following the guide requires you have basic understanding of Kubernetes, CRD, Helm. Nothing deep a quick AI summary will suffice.

Replication

Replication means keeping multiple copies of data on multiple machines connected via network. Here is why you might want to do that:

It keeps you data close to your users.
It acts as a hot backup of a follower goes down.
It helps with scaling if most of your workload is read operation (which is the case for most OLTP)

Here pg-pool acts as load balancer, it distributes read request evenly among followers and mutation request to the leader. Notice that Leader periodically syncs it WAL with it’s followers.

Setup StackGres and enable load balancer

minikube addons enable metallb
minikube tunnel

helm install stackgres-operator stackgres-charts/stackgres-operator \
    --namespace stackgres-operator \
    --create-namespace

Define CRD for replicated cluster

apiVersion: stackgres.io/v1
kind: SGCluster
metadata:
  name: cluster

spec:
  instances: 3 # 1 primary + 2 replicas

  postgres:
    version: "15"

  pods:
    persistentVolume:
      size: "1Gi"

  profile: development

  postgresServices:
    primary:
      type: LoadBalancer
    replicas:
      type: LoadBalancer

Apply the CRD

kubectl apply -f ./replication.yaml
kubectl get pods -w

Get credentials

PG_PASSWORD=$(kubectl -n default get secret cluster --template '{{ printf "%s" (index .data "superuser-password" | base64decode) }}')
echo "The superuser password is: $PG_PASSWORD"

See who is who

kubectl exec -it cluster-0  -c patroni -- patronictl list

Kill the primary

kubectl delete pod cluster-0

See who is in charge now

Patroni should have elected a new leader by now.

kubectl exec -it cluster-1 -c patroni -- patronictl list

Tell something only to the primary

PRIMARY=$(kubectl exec -it cluster-1 -c patroni -- patronictl list | grep Leader | awk '{print $2}')
kubectl exec -it $PRIMARY -c patroni -- psql -U postgres -c "CREATE TABLE replication_test_table (id SERIAL PRIMARY KEY, data TEXT);"
kubectl exec -it $PRIMARY -c patroni -- psql -U postgres -c "INSERT INTO replication_test_table (data) VALUES ('Spread the word about our lord savior PostgreSQL!');"

Primary tell his followers

kubectl exec -it cluster-0 -c patroni -- psql -U postgres -c "SELECT * FROM replication_test_table;"
kubectl exec -it cluster-1 -c patroni -- psql -U postgres -c "SELECT * FROM replication_test_table;"
kubectl exec -it cluster-2 -c patroni -- psql -U postgres -c "SELECT * FROM replication_test_table;"

As you can see how quickly the word has spread. This is possible because StackGres uses Patroni under the hood to coordinate all the replication.

Partitioning

Partitioning splits the data (table in our case) into smaller, more manageable parts. This is done within a single database instance. Postgres supports this out of the box. It is defined in data definition layer and having multiple replicas for makes a partition highly available. It works best for time-series data, logs, or region-based segmentation.

Types of Partitioning

Range Partitioning – Data is partitioned based on value ranges (e.g., date ranges).
List Partitioning – Partitioning based on a list of values (e.g., regions or categories).
Hash Partitioning – Data is distributed using a hash function (e.g., MOD(user_id, 4)).

Following code create a table orders and derives three tables from it using range, list and hash based partition in a hierarchical way. Order table is split by year, year is further split into regions and region is finally split by hash.

Notice that only hash based partition grantees that all partition are of same size.

Setup StackGres and enable load balancer

helm install stackgres-operator stackgres-charts/stackgres-operator \
    --namespace stackgres-operator \
    --create-namespace

minikube addons enable metallb
minikube tunnel

Get credentials

PG_PASSWORD=$(kubectl -n default get secret cluster --template '{{ printf "%s" (index .data "superuser-password" | base64decode) }}')
echo "The superuser password is: $PG_PASSWORD"

Your database should now be available at postgresql://postgres::localhost:5432

Now open an SQL Editor like pgAdmin, and run the following.

-- Parent table
CREATE TABLE orders (
    order_id    INT,
    customer_id INT,
    order_date  DATE,
    region      TEXT,
    amount      INT,
    PRIMARY KEY (order_id, order_date, region, customer_id)
) PARTITION BY RANGE (order_date);


-- Range: Year 2024
CREATE TABLE orders_2024 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01')
    PARTITION BY LIST (region);

-- Range: Year 2025
CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01')
    PARTITION BY LIST (region);

-- 2024 - US region
CREATE TABLE orders_2024_us PARTITION OF orders_2024
    FOR VALUES IN ('US')
    PARTITION BY HASH (customer_id);

-- 2024 - EU region
CREATE TABLE orders_2024_eu PARTITION OF orders_2024
    FOR VALUES IN ('EU')
    PARTITION BY HASH (customer_id);

-- 2024 - US - Hash partitions
CREATE TABLE orders_2024_us_0 PARTITION OF orders_2024_us FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TABLE orders_2024_us_1 PARTITION OF orders_2024_us FOR VALUES WITH (MODULUS 2, REMAINDER 1);

-- 2024 - EU - Hash partitions
CREATE TABLE orders_2024_eu_0 PARTITION OF orders_2024_eu FOR VALUES WITH (MODULUS 2, REMAINDER 0);
CREATE TABLE orders_2024_eu_1 PARTITION OF orders_2024_eu FOR VALUES WITH (MODULUS 2, REMAINDER 1);

Bulk insert synthetic data

-- Generate 1000 random orders
INSERT INTO orders (order_id, customer_id, order_date, region, amount)
SELECT 
    -- Generate order IDs between 1000 and 9999
    1000 + floor(random() * 9000)::int AS order_id,
    -- Generate customer IDs between 1000 and 9999
    1000 + floor(random() * 9000)::int AS customer_id,    
    -- Generate dates in 2024 (to fit the 2024 partition)
    DATE '2024-01-01' + (floor(random() * 366)::int * INTERVAL '1 day') AS order_date,    
    -- Randomly select region
    (ARRAY['US', 'EU'])[1 + floor(random() * 2)::int] AS region,    
    -- Generate random amounts between 10 and 1000
    10 + floor(random() * 990)::int AS amount
FROM 
    generate_series(1, 1000) AS i;            -- 1k rows

SELECT * FROM orders
WHERE order_date = '2024-06-10'
  AND region = 'US'

Querying the orders does not require you to know the partition

Sharding with replication

Sharding splits a large database into small pieces called shards. Each shard is then split among multiple machines so that our database can continue to function even if we lose a few machines. Routing of queries to the proper is done by a coordinator, and just like with the replication example we will have pg-pool doing load balancing within a shard.

Types of sharding

Row based: Think of it like splitting a very thick book into many volumes (shards) based on and creating a new volumes just to keep track of table of content (coordinator). Think of a table where the schema of the table is simple but amount of rows and amount write operation has gone crazy. With this method both read/write operation for every shard can scale as needed.
Schema based: Just like last time we are still splitting the book but this time we are taking a few chapters that are related and turning it into a book about a sub topic. Think of how a very thick physics textbook can be split into Optics, Thermodynamics, Quantum Mechanics. Think of a table to large number of columns, but you don’t need the all the columns every time a query is made. So you split the table into shards such that related columns get placed together.

Notice the resiliency of this architecture, not only we have multiple replicas for shards but also for the coordinator. As long as we have a minimum of 3 machines to run our sharded cluster, failure of single machine will not bring the down our database.

Setup StackGres and enable load balancer

helm install stackgres-operator stackgres-charts/stackgres-operator \
    --namespace stackgres-operator \
    --create-namespace

minikube addons enable metallb
minikube tunnel

Define CRD for Sharded Cluster

# shard.yaml
apiVersion: stackgres.io/v1alpha1
kind: SGShardedCluster
metadata:
  name: cluster
spec:
  type: citus
  database: mydatabase
  postgres:
    version: 'latest'
  coordinator:
    instances: 2 # Number of coordinator instances
    pods:
      persistentVolume:
        size: '1Gi'
  shards:
    clusters: 3 # Number of shards
    instancesPerCluster: 3 # 1 primary and 2 replicas
    pods:
      persistentVolume:
        size: '1Gi'
  postgresServices:
    coordinator:
      primary:
        type: LoadBalancer

  profile: development

Apply Citus CRD

kubectl apply -f ./shard.yaml

Get credentials

PG_PASSWORD=$(kubectl -n default get secret cluster --template '{{ printf "%s" (index .data "superuser-password" | base64decode) }}')
echo "The superuser password is: $PG_PASSWORD"

Your database should now be available at postgresql://postgres::localhost:5432

Now open a app like SQL Editor like pgAdmin, and run the following.

Create some distributed table

CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    name TEXT
);
SELECT create_distributed_table('users', 'id');

CREATE TABLE orders (
    id BIGINT,
    user_id BIGINT,
    product_id BIGINT,
    amount INTEGER,
    PRIMARY KEY (user_id, id)
);
SELECT create_distributed_table('orders', 'user_id');

CREATE TABLE products (
    id BIGINT PRIMARY KEY,
    name TEXT,
    price NUMERIC
);
SELECT create_reference_table('products');

Insert some data

INSERT INTO users (id, name) VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie');

INSERT INTO orders (id, user_id, product_id, amount) VALUES
(1, 1, 1, 2),
(2, 1, 2, 3),
(3, 2, 1, 1),
(4, 3, 3, 5);
INSERT INTO products (id, name, price) VALUES
(1, 'Product A', 10.00),
(2, 'Product B', 20.00),
(3, 'Product C', 30.00);

See how shards are spread

SELECT * FROM citus_shards
WHERE table_name = 'orders'::regclass;

Find which node host which shard

SELECT
  s.shardid,
  n.nodename,
  n.nodeport
FROM pg_dist_shard s
JOIN pg_dist_shard_placement p ON s.shardid = p.shardid
JOIN pg_dist_node n ON p.nodename = n.nodename
WHERE s.logicalrelid = 'orders'::regclass;

Find which has a specific row

SELECT get_shard_id_for_distribution_column('orders', 1);

Join distributed-distributed (co-located)

SELECT
    o.id AS order_id,
    u.name AS customer,
    o.amount
FROM orders o
JOIN users u ON o.user_id = u.id;

This is efficient because orders and users are sharded using the same key (user_id and id)

Join distributed-reference

SELECT
    o.id AS order_id,
    u.name AS customer,
    p.name AS product,
    o.amount
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN products p ON o.product_id = p.id;

This works well because products is replicated across all nodes.

References

An interactive Guide to HyperLogLog

Sagyam Thapa — Mon, 16 Dec 2024 19:13:28 GMT

The problem

Imagine you’re running a large scale online store. Thousands of users visit your website every second, and you want to know how many unique users visit each day. This sounds straightforward just track each user by their IP address or login ID. But here’s the catch: keeping a list of every unique user requires a lot of memory, especially as the number of users grows into the billions.

How do you solve this problem without drowning in memory usage? This is where HyperLogLog, a probabilistic data structure, comes into play.

The solution

HyperLogLog (HLL) is a clever algorithm that provides an approximate count of unique items (billion items) while using a fraction of the memory (1.5 kB) required by exact methods. It achieves this by trading off a small amount of accuracy for significant space savings.

Play with HyperLogLog

I have create a fun little app that let you play with HyperLogLog. Here is how this app works

Input IP Address:
- Click "Random IP" to generate IP address automatically.
- Add the entered IP to the HyperLogLog by clicking "Add to HLL".
Adjust Bucket Count:
- Use the slider to adjust the number of buckets.
- This resets the HyperLogLog and clears all previous data.
Add Multiple Random IPs:
- Use the preset buttons to add 1K, 5K, 10K, 50K, or 100K random IP addresses to the HyperLogLog.
View Metrics:
- Check Actual Count, Estimated Count, Difference, Margin of Error, and Actual Error in the metrics cards.
- Accurate metrics are displayed as the HyperLogLog processes the inputs.
Inspect Buckets:
- Scroll through individual buckets to observe how HyperLogLog distributes and calculates run lengths.

Things to notice

As you increase then number of buckets Margin of error reduces. This is because a few bucket may get unlucky and get a big run early on, but when you spread that luck over large number of buckets chances of such mistakes reduces. (Kinda like how insurance work)
The error never actually reaches zero, this is because this is probabilistic algorithms i.e. unexpected wild swing are possible.
See what happens to the estimate if you skimp on number of buckets.

Working

Hash Functions and Uniform Distribution

If you have reached here that I trust you know how hash function works. To refresh you memory

Hashing involves using a hash function to convert input data (like an IP address) into a fixed size output, often a number. Good hash functions are deterministic (they always produce the same output for the same input) and uniformly distribute outputs across the possible range.
Leading Zeros and Cardinality

The core insight is that for a uniform random hash value:
- The probability of encountering a hash value with at least $k$ leading zeros in its binary representation is $2^{-k}$.
  Example: For $k=3$, the binary prefix must start with $0$, which occurs $2^{-3} = \frac{1}{8}$ of the time.
- The expected maximum run of leading zeros in the hash values increases logarithmically with the number of distinct elements $n$ in the dataset.

Bucketing and Parallelism

To reduce variance and improve accuracy, the HyperLogLog algorithm splits the hash values into $m=2^p$ buckets (where $p$ is a tunable parameter).

Each bucket is determined by the first $p$ bits of the hash value, which serve as the bucket index.
The remaining bits of the hash value are used to compute the number of leading zeros for that bucket.
Each bucket keeps track of the maximum number of leading zeros observed for hash values assigned to it.

Harmonic Mean of Maximum Leading Zeros

Each bucket contributes an estimate of the cardinality based on the leading zeros it observes. Since these estimates can vary significantly, HyperLogLog uses the harmonic mean of these estimates to combine the results:

$$E = α_m⋅m^2⋅ \left(\sum_{j=1}^{m} 2^{-M[j]} \right)$$

Where:

$m=2^p$: Number of buckets.
$M[j]$: Maximum number of leading zeros observed in the $j-th$ bucket.
$α_m$: A bias correction constant dependent on $m$ derived empirically.

Bias Correction and Range Adjustment

The raw estimate $E$ can be biased for small or large cardinalities. HyperLogLog applies bias correction in the following ways:

1. Small Range Correction: If $E$ is small ( $E \leq \frac{5}{2}$), it applies a correction to handle underestimation caused by hash collisions:

$$E_{\text{corrected}} = m \cdot \log \left( \frac{m}{V} \right)$$

Where $V$ is the number of empty buckets.

Large Range Correction: If $E$ exceeds a threshold (typically when $n$ approaches or exceeds $2^{32}$, the algorithm switches to a simpler linear counting method.
Error and Memory Efficiency

The relative error of HyperLogLog is approximately:

$$Error \approx \frac{1.04}{\sqrt{m}}$$

Larger $m$ (more buckets) reduces error but increases memory usage.
Memory usage is proportional to $mlog_2(log_2(n))$ bits, making HyperLogLog extremely space efficient.

Intuition Behind Logarithmic Behavior

The logarithmic behavior of leading zeros stems from the exponential relationship between probabilities and cardinalities:

As the cardinality $n$ grows, the probability of observing hash values with more leading zeros increases logarithmically.
HyperLogLog aggregates these local estimates (per bucket) and normalizes them using the harmonic mean, resulting in a robust global estimate.

Where it’s used

Web Analytics: Google Analytics and YouTube use algorithms similar to HyperLogLog to estimate unique visitors.
Databases: TimescaleDB and Redis implement HyperLogLog for approximate distinct counts.
Big Data Platforms: Apache Druid and Presto use HyperLogLog to provide fast, approximate query results.

Downsides

While HyperLogLog is powerful, it’s important to understand its limitations:

Approximation: The algorithm provides an estimate, not an exact count. The error rate is about
Hash Collisions: The accuracy depends on the quality of the hash function. Poor hashing can lead to inaccuracies.

An Interactive Guide to Bloom Filter

Sagyam Thapa — Sun, 01 Dec 2024 16:56:15 GMT

Introduction

Bloom filter is space efficient probabilistic data structure that can tell if a given element is already present in a database. It saves us from doing an expensive query to our database. While Bloom filters can guarantee that an element is not in the set, they cannot guarantee its presence. Instead, they can sometimes return false positives indicating an element is in the set when it is not but they never return false negatives.

Problem

Before diving into how Bloom filters work, let’s consider the problem they solve. Imagine you run a website that needs to process thousands or even millions of requests every second. One of your tasks is to check whether the IP address making a request is in a list of banned IPs.

If you store this list in a traditional database or an in memory data structure like a hash table, every lookup will consume resources, and the time to check will grow with the size of the list. For every incoming request, you’ll have to query the database or search through the list, which could severely impact the website’s performance.

Wouldn’t it be great if there was a magic solution to quickly determine whether an IP address is banned in constant time without querying the database? Enter Bloom filters.

Prerequisite

Hashing

To understand Bloom filters, you need to be familiar with the concept of hashing. Hashing involves using a hash function to convert input data (like an IP address) into a fixed size output, often a number. Good hash functions are deterministic (they always produce the same output for the same input) and uniformly distribute outputs across the possible range.

Working

Bloom filters address the problem of quickly checking membership by using multiple hash functions and a bit array. Here’s how it works:

Initialization: A Bloom filter uses a fixed size bit array (m), initially set to all zeros. It also uses k independent hash functions.
Adding an element:
- To add an element, it is passed through all k hash functions.
- Each hash function maps the element to a position in the bit array, and the corresponding bits at these positions are set to 1.
Checking for membership:
- To check if an element is in the set, the element is hashed with the same k hash functions.
- If all the bits at the positions indicated by the hash functions are set to 1, the filter reports that the element might be in the set.
- If any of these bits are 0, the element is definitely not in the set.

This design ensures that the Bloom filter is both space efficient and fast. However, there is a trade off: the possibility of false positives, which occurs when the bits set by other elements overlap, making it appear that an element is in the set when it is not.

I have create a fun little app that let’s you play with a bloom filter.

See what happens when you fill filter with all ones.
Can you get the bloom filter to return is false positive.
Notice how increasing hash function fills up the filter.
Notice how deceasing hash function affects probability.

Tuning

I have made another fun little app that let’s you play around with parameters of a bloom filter.

Number of elements N
Size of filter M
Number of hash functions K

Some conclusion you to draw from the graphs, for a well designed filter:

False positive vs No of Items follows a logistic curve.
False positive vs No of hash function follows a J curve.
False positive vs Filter size follows a linear downward line.

Formulae

1. Probability of a False Positive

The probability of a false positive in a Bloom Filter is given by:

$$P \approx \left( 1 - e^{- \frac{kn}{m}} \right)^k$$

Where:

m: Number of bits in the Bloom Filter.
k: Number of hash functions.
n: Number of elements inserted into the filter.

2. Optimal Number of Hash Functions

The optimal number of hash functions k, to minimize the false positive rate, is:

$$k = \frac{m}{n} \ln 2$$

3. Expected Fraction of Bits Set to 1

The fraction f of bits in the Bloom Filter that are set to 1 after n insertions is:

$$f = 1 - \left( 1 - \frac{1}{m} \right)^{kn}$$

Improvements

Cuckoo Filters:

A lighter and faster version that allows you to delete an inserted item. It collects fingerprint of inserted item and stores it in an array of buckets. Works great for application needing frequency counts or large scale de-duplication.
Counting Bloom filters

Uses a counter array, where each position in the array is a small integer. Lookup and delete are performed by incrementing and decrementing the positions in the array. Works great for application with high performance and low memory usage

Application

Bloom filters have a wide range of applications, including:

Database Query Optimization: Reduce database lookup by quickly discarding queries for non existent elements.
Web Caching: Check if a URL is cached before attempting to fetch it.
Spam Detection: Quickly determine whether an email sender is blacklisted.
Distributed Systems: Identify duplicate data or requests in distributed storage and processing systems.