Symmetric Hash Join Implementation
🤝

Symmetric Hash Join Implementation

Tags
C
PostgreSQL
Published
November 29, 2023

1. Objective

The task is to develop a symmetric hash join operator to enhance PostgreSQL's query processing capabilities. This development will involve modifications to both the optimizer and executor components of PostgreSQL.

2. Understanding Hash Join

The current hash join mechanism in PostgreSQL, used for joining two relations (T1 and T2) on attributes (attr1 and attr2), involves two phases: building and probing. Initially, the inner relation (T1) is processed, followed by the outer relation (T2). The algorithm has limitations due to memory requirements for large relations, leading to the adoption of a Hybrid Hash Join. This variant partitions tuples into batches, processing them sequentially with non-memory-resident batches stored on disk.

3. Implementing Symmetric Hash Join

The project involves implementing a symmetric hash join. This approach differs from the regular hash join by maintaining two separate hash tables, each using a distinct hash function for the involved relations. It alternates between reading from the inner and outer relations, probing for matches in the opposite table. The implementation must adhere to the demand-pull pipeline interface, necessitating state preservation between calls for continuity in join output tuple production.

4. Modifying PostgreSQL Components for Implementation

4.1 Optimizer

The optimizer's role is to generate an execution plan from the query parser's output, constructing Path trees and selecting the most cost-effective path as the plan. It involves building RelOptInfo structures for base and join relations and assessing various paths for scans and joins. The final plan is derived from the cheapest path for the relevant RelOptInfo.

4.2 Executor

The executor processes the plan tree, comprising of Plan nodes, into a demand-pull pipeline of tuple operations. Each node produces the next tuple in its output sequence. The executor maintains a parallel tree of executor state nodes corresponding to the Plan nodes, storing execution-specific data. This structure allows for read-only plan trees, facilitating plan caching and reuse.

4.3 PostgreSQL Hash Join Operator

In PostgreSQL, the hash join is implemented in nodeHashjoin.c, and hash table creation in nodeHash.c. The hash join node includes subplans representing the outer and inner relations. PostgreSQL's hybrid hash join approach is utilized to manage large relations. For the symmetric hash join implementation, the complexity of multiple batches can be ignored, assuming both hash tables fit in memory.

5. Results

notion image

Getting Started

  1. Enter Postgres Folder: Navigate to the folder with cd postgresql-15.4.
  1. Install PostgreSQL: Run the following commands in order: ./configure make sudo su (enter your password) make install Create PostgreSQL User: Make a 'postgres' user in your system settings and set a password.
  1. Create Data Folder: Run mkdir -p /usr/local/pgsql/data and change ownership with chown postgres /usr/local/pgsql/data.
  1. Switch to PostgreSQL User: Use su - postgres.
  1. Test Installation:
Initialize database: /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data Start server: /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile start Create a test database: /usr/local/pgsql/bin/createdb testdb
  1. Run python3 swapFiles.py
  1. Test SQL Command: In the terminal, run psql.
 

Prerequisites

  • C
  • PostgreSQL 8.1.7
 
Check out the project on GitHub:
3130-A1
Deniz-JasaUpdated Oct 19, 2023