Understanding SQL Parsers

It’s rare to talk about heuristic systems when there’s so much hype around probabilistic ones. This post is boring, but stay with me.

My work at Atlan has touched SQL parsing since the beginning. I made early contributions to the query engine with policy-based authorization powering Insights, and we generate SQL lineage by parsing queries to power column-level lineage. Along the way, my colleagues and I have evaluated a lot of SQL parsers, both open source and commercial: SQLGlot, sqlparser-rs, sqloxide, Apache Calcite, Gudusoft GSP, JSqlParser, and others. Each with different tradeoffs.

This post is my attempt to distill what I’ve learned. I’m not an expert, just someone curious enough to ask: why do so many SQL parsers exist? And what are they actually doing under the hood?

The idea for this writeup came during a drive back from SFO after dropping a friend off at the airport. That turned into about 10 hours of conversation with Claude to pull it all together.


Table of Contents


What is a SQL Parser?

A SQL parser reads SQL text and converts it into a structured representation, usually a tree, that computers can work with. It’s the “understanding” step.

INPUT:  "CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'"
                                            ↓
                                       [SQL PARSER]
                                            ↓
OUTPUT: A tree structure representing the query's meaning

Think of it like how your brain parses a sentence to extract meaning. The parser does the same for SQL.


The Full Pipeline

Every SQL parser follows the same fundamental pipeline. This isn’t a design choice. It’s a consequence of how language processing works.

Lexer: Breaks the SQL string into tokens. Keywords, identifiers, operators, literals. Like recognizing words in a sentence.

Parser: Takes tokens and builds a tree based on grammar rules. Like understanding “subject-verb-object” structure.

AST: The Abstract Syntax Tree. A clean, navigable representation of the query’s structure.

Semantic Analysis: Adds meaning. Does this table exist? What type is this column? This is where you need schema information.


Lexical Analysis (The Lexer)

The lexer converts a stream of characters into meaningful chunks called tokens.

Input:

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Output:

Token(CREATE)
Token(TABLE)
Token(IDENTIFIER, "active_users")
Token(AS)
Token(SELECT)
Token(IDENTIFIER, "id")
Token(COMMA)
Token(IDENTIFIER, "name")
Token(COMMA)
Token(IDENTIFIER, "email")
Token(FROM)
Token(IDENTIFIER, "users")
Token(WHERE)
Token(IDENTIFIER, "status")
Token(EQUALS)
Token(STRING, "active")

The lexer handles dialect-specific decisions early:

Decision Standard SQL MySQL SQL Server PostgreSQL
Identifier quote "name" `name` [name] "name"
String quote 'text' 'text' or "text" 'text' 'text'
Line comment -- -- or # -- --
Case sensitivity Insensitive Insensitive Insensitive Insensitive (keywords)

Lexers are simple. They look at one character (or a few) at a time, don’t need to understand nesting or structure, and can be implemented with state machines or regex. This is why SQLGlot’s optional Rust tokenizer gives ~30% speedup1. Tokenization is pure CPU-bound character scanning.


Syntactic Analysis (The Parser)

The parser reads tokens and builds a tree structure based on grammar rules.

Input tokens:

[CREATE, TABLE, active_users, AS, SELECT, id, COMMA, name, COMMA, email, FROM, users, WHERE, status, =, 'active']

Grammar rule:

ctas_stmt    := CREATE TABLE table_name AS select_stmt
select_stmt  := SELECT column_list FROM table_name [WHERE condition]

Parser thinks:

  • “I see CREATE TABLE… this must be a ctas_stmt”
  • “Next I need a table_name… I see ‘active_users’”
  • “Next I need AS… got it”
  • “Now I need a select_stmt…”
  • “I see SELECT… parsing the inner query”
  • “Column list: id, name, email”
  • “FROM users”
  • “WHERE status = ‘active’”

The parser is essentially a state machine following grammar rules, consuming tokens and building tree nodes.

Why Parsing is Harder Than Lexing

Lexing is pattern matching. Parsing is about structure.

SELECT * FROM (SELECT * FROM (SELECT * FROM t))

A lexer sees: (, (, (, ), ), )

A parser must match each ( with its corresponding ).

This is why regex can’t parse SQL. Regex can’t count balanced parentheses. It’s mathematically proven (regular languages vs context-free languages).


The Abstract Syntax Tree

The AST represents the query’s structure in a clean, navigable tree.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
CreateTableAsSelect
├── table_name: "active_users"
└── query:
    └── SelectStatement
        ├── columns:
        │   ├── Column { name: "id" }
        │   ├── Column { name: "name" }
        │   └── Column { name: "email" }
        ├── from:
        │   └── Table { name: "users" }
        └── where:
            └── BinaryOp
                ├── left: Column { name: "status" }
                ├── op: Equals
                └── right: Literal { value: "active" }

The AST is the central data structure. Everything downstream (analysis, transformation, code generation) operates on it.

What You Can Do With an AST

Operation Description
Traversal Walk the tree, collect information
Transform Rewrite nodes (add filters, change structure)
Generate Convert back to SQL string (round-trip)
Transpile Generate SQL for a different dialect
Lineage Trace where data comes from

Syntax vs Semantics

This is where things get interesting.

Syntactic analysis (parsing): Is this valid SQL grammar?

Semantic analysis: Does this SQL make sense given the database schema?

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
Syntactic (Parser answers) Semantic (Analyzer answers)
✓ Valid CTAS structure Does ‘users’ table exist?
✓ Valid SELECT clause Does ‘users’ have ‘id’, ‘name’, ‘email’ columns?
✓ Valid WHERE condition Is ‘status’ a valid column?
✓ Correct keyword order Is comparing status to string valid?

Key insight: A parser with no schema information can only do syntactic analysis. Semantic analysis requires external knowledge about the database.

What Syntactic Analysis Catches

SELECT FROM users        -- Missing column list
FROM users SELECT *      -- Wrong keyword order
SELECT (a + b            -- Unbalanced parentheses

What Syntactic Analysis Cannot Catch

SELECT * FROM nonexistent    -- Parser doesn't know if table exists
SELECT foo FROM users        -- Parser doesn't know columns
WHERE name > 5               -- Parser doesn't know types

Column-Level Lineage

Lineage traces where data comes from and where it goes.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Table-level lineage (easy, no schema needed):

READ:  [users]
WRITE: [active_users]

Column-level lineage (needs schema):

active_users.id    ← users.id (direct)
active_users.name  ← users.name (direct)
active_users.email ← users.email (direct)

Data Flow vs Control Flow

This is where lineage gets nuanced.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Question: Does status contribute to the lineage of active_users?

Perspective status in lineage? Reasoning
Data Flow ❌ No status doesn’t appear in output columns
Control Flow ✅ Yes status affects which rows are included

Most “lineage” discussions mean data flow. But impact analysis needs both. Changing status column could break this query even though it’s not in the output.


SQL Dialects

SQL is a “standard” that nobody fully implements.

The Dialect Landscape

Every database vendor implements a subset of the SQL standard, adds proprietary extensions, and has different syntax for the same operations.

                         ┌─────────────────┐
                         │  SQL Standard   │
                         │  (SQL-92, 99,   │
                         │   2003, 2016)   │
                         └────────┬────────┘
                                  │
         ┌──────────┬─────────┬───┴───┬─────────┬──────────┐
         ▼          ▼         ▼       ▼         ▼          ▼
   ┌──────────┐┌──────────┐┌──────┐┌──────┐┌──────────┐┌──────────┐
   │PostgreSQL││  MySQL   ││Oracle││SQL   ││Snowflake ││ BigQuery │
   │          ││          ││      ││Server││          ││          │
   │ +ARRAY   ││ +LIMIT   ││+ROWNUM│ +TOP ││ +FLATTEN ││ +STRUCT  │
   │ +JSONB   ││ +BACKTICK││+DUAL ││ +[]  ││ +VARIANT ││ +UNNEST  │
   │ +::cast  ││ +AUTO_INC││+PLSQL││+T-SQL││ +$$      ││ +SAFE_   │
   └──────────┘└──────────┘└──────┘└──────┘└──────────┘└──────────┘

   Each dialect: ~80% common SQL + ~20% proprietary extensions

This fragmentation is why so many SQL parsers exist. A parser built for PostgreSQL won’t understand MySQL’s backtick identifiers. A parser built for standard SQL won’t handle Snowflake’s FLATTEN function.

Identifier Quoting

Dialect Quote Style Example
Standard SQL "double quotes" SELECT "Column" FROM "Table"
MySQL `backticks` SELECT `Column` FROM `Table`
SQL Server [brackets] SELECT [Column] FROM [Table]
BigQuery `backticks` SELECT `Column` FROM `Table`

LIMIT / Pagination

-- MySQL, PostgreSQL, SQLite
SELECT * FROM users LIMIT 10 OFFSET 5

-- SQL Server
SELECT TOP 10 * FROM users
-- Or (SQL Server 2012+)
SELECT * FROM users ORDER BY id OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY

-- Oracle (traditional)
SELECT * FROM users WHERE ROWNUM <= 10
-- Oracle 12c+
SELECT * FROM users FETCH FIRST 10 ROWS ONLY

Type Casting

-- Standard SQL
CAST(x AS INTEGER)

-- PostgreSQL
x::INTEGER

-- MySQL
CAST(x AS SIGNED)  -- No INTEGER, use SIGNED/UNSIGNED

-- BigQuery
CAST(x AS INT64)
SAFE_CAST(x AS INT64)  -- Returns NULL instead of error

Function Name Differences

The same operation, different names across dialects:

Operation PostgreSQL MySQL SQL Server Snowflake
Current timestamp NOW() NOW() GETDATE() CURRENT_TIMESTAMP()
String length LENGTH() LENGTH() LEN() LENGTH()
If null COALESCE() IFNULL() ISNULL() NVL()
Date add + INTERVAL '1 day' DATE_ADD() DATEADD() DATEADD()

How Parsers Handle Dialects

Dialect flags (sqlparser-rs): Single parser with ~50 boolean flags like supports_filter_during_aggregation, supports_group_by_expr. Simple but can’t handle major syntax differences.

Parameterized grammar (SQLGlot): Base parser with overridable methods per dialect. Each dialect class inherits and overrides specific parsing methods. Flexible but has complex inheritance.

Separate grammars (Gudusoft): One complete grammar file per database. Complete accuracy but high maintenance burden. When a database releases a new version, you update that grammar file.


Parsers vs Query Engines

This distinction trips people up. A parser and a query engine are not the same thing.

┌─────────────────────────────┐    ┌─────────────────────────────────┐
│         PARSER              │    │         QUERY ENGINE            │
│                             │    │                                 │
│  • Lexical analysis         │    │  • Query planning               │
│  • Syntactic analysis       │    │  • Query optimization           │
│  • AST construction         │    │  • Physical execution           │
│  • (Optional) Semantic      │    │  • Data access                  │
│    analysis                 │    │  • Join algorithms              │
│                             │    │  • Aggregation                  │
│  INPUT: SQL string          │    │  • Sorting                      │
│  OUTPUT: AST or errors      │    │  • Result materialization       │
│                             │    │                                 │
│  NO data access             │    │  READS/WRITES data              │
│  NO execution               │    │  EXECUTES query                 │
└─────────────────────────────┘    └─────────────────────────────────┘

Examples:                          Examples:
• SQLGlot                          • PostgreSQL
• sqlparser-rs                     • DuckDB
• JSqlParser                       • Apache Spark
                                   • Presto/Trino

The Full Query Processing Pipeline

When you run a query in a database, it goes through many stages. Parsing is just the first.

┌─────────────────┐
│    SQL String   │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│     PARSER      │  ◄── SQLGlot, sqlparser-rs stop here
└────────┬────────┘
         │ AST
         ▼
┌─────────────────┐
│    ANALYZER     │  ◄── Semantic analysis (name resolution, types)
└────────┬────────┘
         │ Analyzed AST
         ▼
┌─────────────────┐
│    PLANNER      │  ◄── Logical plan (relational algebra)
└────────┬────────┘
         │ Logical Plan
         ▼
┌─────────────────┐
│   OPTIMIZER     │  ◄── Rule-based and cost-based optimization
└────────┬────────┘
         │ Optimized Plan
         ▼
┌─────────────────┐
│   EXECUTOR      │  ◄── Physical operators, data access
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    RESULTS      │
└─────────────────┘

What Parsers Do

Capability Parser Does
Syntax validation ✅ Detect SELECT FROM (missing columns)
AST construction ✅ Build tree structure
Transpilation ✅ Convert MySQL → PostgreSQL
Lineage extraction ✅ Find table/column dependencies
Query formatting ✅ Pretty-print SQL

What Parsers Don’t Do

Capability Parser Engine
Execute query
Return results
Optimize execution
Choose join order
Manage transactions

Why This Matters

If you’re building a data catalog and need lineage, you need a parser. You don’t need a query engine.

If you’re building an IDE with autocomplete, you need a parser. You don’t need to execute anything.

If you’re building a transpiler to migrate queries from Snowflake to Databricks, you need a parser with good dialect support. You don’t need to run those queries.

The tools are different because the problems are different.


At a Glance

Parser Language License Dialects One-liner
SQLGlot Python MIT 31 Most feature-complete open-source
sqlparser-rs Rust Apache 2.0 ~15 Fast, minimal, foundation for Rust engines
Apache Calcite Java Apache 2.0 ~10 Full query planning framework
Gudusoft GSP Java/C# Commercial 25+ Enterprise, stored procedure support
JSqlParser Java Apache/LGPL ~6 Mature, simple Java parser

Layer Support

Parser Lexer Parser AST Semantic Lineage
SQLGlot
sqlparser-rs
Calcite ⚠️
Gudusoft GSP
JSqlParser

Features

Parser Transpile Format Lineage Schema Round-trip
SQLGlot
sqlparser-rs ⚠️ ⚠️
Calcite ⚠️ ⚠️
Gudusoft GSP ⚠️
JSqlParser ⚠️

Quick Decision Matrix

Your Situation Recommended
Python, need transpilation SQLGlot
Python, need lineage SQLGlot
Rust query engine sqlparser-rs
Browser-based SQL tool sqlparser-rs (WASM)
Java, basic parsing JSqlParser
Java, query planning Apache Calcite
Enterprise, stored procs Gudusoft GSP

Further Reading

Still Interested? There’s more to read. Hopefully I have piqued your interest. For any comments or feedback, please reach out to and I’ll address them.

Parsing Algorithms

Parser Libraries