Understanding SQL Parsers
It’s rare to talk about heuristic systems when there’s so much hype around probabilistic ones. This post is boring, but stay with me.
My work at Atlan has touched SQL parsing since the beginning. I made early contributions to the query engine with policy-based authorization powering Insights, and we generate SQL lineage by parsing queries to power column-level lineage. Along the way, my colleagues and I have evaluated a lot of SQL parsers, both open source and commercial: SQLGlot, sqlparser-rs, sqloxide, Apache Calcite, Gudusoft GSP, JSqlParser, and others. Each with different tradeoffs.
This post is my attempt to distill what I’ve learned. I’m not an expert, just someone curious enough to ask: why do so many SQL parsers exist? And what are they actually doing under the hood?
The idea for this writeup came during a drive back from SFO after dropping a friend off at the airport. That turned into about 10 hours of conversation with Claude to pull it all together.
Table of Contents
- What is a SQL Parser?
- The Full Pipeline
- Lexical Analysis (The Lexer)
- Syntactic Analysis (The Parser)
- The Abstract Syntax Tree
- Syntax vs Semantics
- Column-Level Lineage
- SQL Dialects
- Parsers vs Query Engines
- Comparing SQL Parsers
- Further Reading
What is a SQL Parser?
A SQL parser reads SQL text and converts it into a structured representation, usually a tree, that computers can work with. It’s the “understanding” step.
INPUT: "CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'"
↓
[SQL PARSER]
↓
OUTPUT: A tree structure representing the query's meaning
Think of it like how your brain parses a sentence to extract meaning. The parser does the same for SQL.
The Full Pipeline
Every SQL parser follows the same fundamental pipeline. This isn’t a design choice. It’s a consequence of how language processing works.
Lexer: Breaks the SQL string into tokens. Keywords, identifiers, operators, literals. Like recognizing words in a sentence.
Parser: Takes tokens and builds a tree based on grammar rules. Like understanding “subject-verb-object” structure.
AST: The Abstract Syntax Tree. A clean, navigable representation of the query’s structure.
Semantic Analysis: Adds meaning. Does this table exist? What type is this column? This is where you need schema information.
Lexical Analysis (The Lexer)
The lexer converts a stream of characters into meaningful chunks called tokens.
Input:
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
Output:
Token(CREATE)
Token(TABLE)
Token(IDENTIFIER, "active_users")
Token(AS)
Token(SELECT)
Token(IDENTIFIER, "id")
Token(COMMA)
Token(IDENTIFIER, "name")
Token(COMMA)
Token(IDENTIFIER, "email")
Token(FROM)
Token(IDENTIFIER, "users")
Token(WHERE)
Token(IDENTIFIER, "status")
Token(EQUALS)
Token(STRING, "active")
The lexer handles dialect-specific decisions early:
| Decision | Standard SQL | MySQL | SQL Server | PostgreSQL |
|---|---|---|---|---|
| Identifier quote | "name" |
`name` |
[name] |
"name" |
| String quote | 'text' |
'text' or "text" |
'text' |
'text' |
| Line comment | -- |
-- or # |
-- |
-- |
| Case sensitivity | Insensitive | Insensitive | Insensitive | Insensitive (keywords) |
Lexers are simple. They look at one character (or a few) at a time, don’t need to understand nesting or structure, and can be implemented with state machines or regex. This is why SQLGlot’s optional Rust tokenizer gives ~30% speedup1. Tokenization is pure CPU-bound character scanning.
Syntactic Analysis (The Parser)
The parser reads tokens and builds a tree structure based on grammar rules.
Input tokens:
[CREATE, TABLE, active_users, AS, SELECT, id, COMMA, name, COMMA, email, FROM, users, WHERE, status, =, 'active']
Grammar rule:
ctas_stmt := CREATE TABLE table_name AS select_stmt
select_stmt := SELECT column_list FROM table_name [WHERE condition]
Parser thinks:
- “I see CREATE TABLE… this must be a ctas_stmt”
- “Next I need a table_name… I see ‘active_users’”
- “Next I need AS… got it”
- “Now I need a select_stmt…”
- “I see SELECT… parsing the inner query”
- “Column list: id, name, email”
- “FROM users”
- “WHERE status = ‘active’”
The parser is essentially a state machine following grammar rules, consuming tokens and building tree nodes.
Why Parsing is Harder Than Lexing
Lexing is pattern matching. Parsing is about structure.
SELECT * FROM (SELECT * FROM (SELECT * FROM t))
A lexer sees: (, (, (, ), ), )
A parser must match each ( with its corresponding ).
This is why regex can’t parse SQL. Regex can’t count balanced parentheses. It’s mathematically proven (regular languages vs context-free languages).
The Abstract Syntax Tree
The AST represents the query’s structure in a clean, navigable tree.
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
CreateTableAsSelect
├── table_name: "active_users"
└── query:
└── SelectStatement
├── columns:
│ ├── Column { name: "id" }
│ ├── Column { name: "name" }
│ └── Column { name: "email" }
├── from:
│ └── Table { name: "users" }
└── where:
└── BinaryOp
├── left: Column { name: "status" }
├── op: Equals
└── right: Literal { value: "active" }
The AST is the central data structure. Everything downstream (analysis, transformation, code generation) operates on it.
What You Can Do With an AST
| Operation | Description |
|---|---|
| Traversal | Walk the tree, collect information |
| Transform | Rewrite nodes (add filters, change structure) |
| Generate | Convert back to SQL string (round-trip) |
| Transpile | Generate SQL for a different dialect |
| Lineage | Trace where data comes from |
Syntax vs Semantics
This is where things get interesting.
Syntactic analysis (parsing): Is this valid SQL grammar?
Semantic analysis: Does this SQL make sense given the database schema?
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
| Syntactic (Parser answers) | Semantic (Analyzer answers) |
|---|---|
| ✓ Valid CTAS structure | Does ‘users’ table exist? |
| ✓ Valid SELECT clause | Does ‘users’ have ‘id’, ‘name’, ‘email’ columns? |
| ✓ Valid WHERE condition | Is ‘status’ a valid column? |
| ✓ Correct keyword order | Is comparing status to string valid? |
Key insight: A parser with no schema information can only do syntactic analysis. Semantic analysis requires external knowledge about the database.
What Syntactic Analysis Catches
SELECT FROM users -- Missing column list
FROM users SELECT * -- Wrong keyword order
SELECT (a + b -- Unbalanced parentheses
What Syntactic Analysis Cannot Catch
SELECT * FROM nonexistent -- Parser doesn't know if table exists
SELECT foo FROM users -- Parser doesn't know columns
WHERE name > 5 -- Parser doesn't know types
Column-Level Lineage
Lineage traces where data comes from and where it goes.
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
Table-level lineage (easy, no schema needed):
READ: [users]
WRITE: [active_users]
Column-level lineage (needs schema):
active_users.id ← users.id (direct)
active_users.name ← users.name (direct)
active_users.email ← users.email (direct)
Data Flow vs Control Flow
This is where lineage gets nuanced.
CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
Question: Does status contribute to the lineage of active_users?
| Perspective | status in lineage? |
Reasoning |
|---|---|---|
| Data Flow | ❌ No | status doesn’t appear in output columns |
| Control Flow | ✅ Yes | status affects which rows are included |
Most “lineage” discussions mean data flow. But impact analysis needs both. Changing status column could break this query even though it’s not in the output.
SQL Dialects
SQL is a “standard” that nobody fully implements.
The Dialect Landscape
Every database vendor implements a subset of the SQL standard, adds proprietary extensions, and has different syntax for the same operations.
┌─────────────────┐
│ SQL Standard │
│ (SQL-92, 99, │
│ 2003, 2016) │
└────────┬────────┘
│
┌──────────┬─────────┬───┴───┬─────────┬──────────┐
▼ ▼ ▼ ▼ ▼ ▼
┌──────────┐┌──────────┐┌──────┐┌──────┐┌──────────┐┌──────────┐
│PostgreSQL││ MySQL ││Oracle││SQL ││Snowflake ││ BigQuery │
│ ││ ││ ││Server││ ││ │
│ +ARRAY ││ +LIMIT ││+ROWNUM│ +TOP ││ +FLATTEN ││ +STRUCT │
│ +JSONB ││ +BACKTICK││+DUAL ││ +[] ││ +VARIANT ││ +UNNEST │
│ +::cast ││ +AUTO_INC││+PLSQL││+T-SQL││ +$$ ││ +SAFE_ │
└──────────┘└──────────┘└──────┘└──────┘└──────────┘└──────────┘
Each dialect: ~80% common SQL + ~20% proprietary extensions
This fragmentation is why so many SQL parsers exist. A parser built for PostgreSQL won’t understand MySQL’s backtick identifiers. A parser built for standard SQL won’t handle Snowflake’s FLATTEN function.
Identifier Quoting
| Dialect | Quote Style | Example |
|---|---|---|
| Standard SQL | "double quotes" |
SELECT "Column" FROM "Table" |
| MySQL | `backticks` |
SELECT `Column` FROM `Table` |
| SQL Server | [brackets] |
SELECT [Column] FROM [Table] |
| BigQuery | `backticks` |
SELECT `Column` FROM `Table` |
LIMIT / Pagination
-- MySQL, PostgreSQL, SQLite
SELECT * FROM users LIMIT 10 OFFSET 5
-- SQL Server
SELECT TOP 10 * FROM users
-- Or (SQL Server 2012+)
SELECT * FROM users ORDER BY id OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY
-- Oracle (traditional)
SELECT * FROM users WHERE ROWNUM <= 10
-- Oracle 12c+
SELECT * FROM users FETCH FIRST 10 ROWS ONLY
Type Casting
-- Standard SQL
CAST(x AS INTEGER)
-- PostgreSQL
x::INTEGER
-- MySQL
CAST(x AS SIGNED) -- No INTEGER, use SIGNED/UNSIGNED
-- BigQuery
CAST(x AS INT64)
SAFE_CAST(x AS INT64) -- Returns NULL instead of error
Function Name Differences
The same operation, different names across dialects:
| Operation | PostgreSQL | MySQL | SQL Server | Snowflake |
|---|---|---|---|---|
| Current timestamp | NOW() |
NOW() |
GETDATE() |
CURRENT_TIMESTAMP() |
| String length | LENGTH() |
LENGTH() |
LEN() |
LENGTH() |
| If null | COALESCE() |
IFNULL() |
ISNULL() |
NVL() |
| Date add | + INTERVAL '1 day' |
DATE_ADD() |
DATEADD() |
DATEADD() |
How Parsers Handle Dialects
Dialect flags (sqlparser-rs): Single parser with ~50 boolean flags like supports_filter_during_aggregation, supports_group_by_expr. Simple but can’t handle major syntax differences.
Parameterized grammar (SQLGlot): Base parser with overridable methods per dialect. Each dialect class inherits and overrides specific parsing methods. Flexible but has complex inheritance.
Separate grammars (Gudusoft): One complete grammar file per database. Complete accuracy but high maintenance burden. When a database releases a new version, you update that grammar file.
Parsers vs Query Engines
This distinction trips people up. A parser and a query engine are not the same thing.
┌─────────────────────────────┐ ┌─────────────────────────────────┐
│ PARSER │ │ QUERY ENGINE │
│ │ │ │
│ • Lexical analysis │ │ • Query planning │
│ • Syntactic analysis │ │ • Query optimization │
│ • AST construction │ │ • Physical execution │
│ • (Optional) Semantic │ │ • Data access │
│ analysis │ │ • Join algorithms │
│ │ │ • Aggregation │
│ INPUT: SQL string │ │ • Sorting │
│ OUTPUT: AST or errors │ │ • Result materialization │
│ │ │ │
│ NO data access │ │ READS/WRITES data │
│ NO execution │ │ EXECUTES query │
└─────────────────────────────┘ └─────────────────────────────────┘
Examples: Examples:
• SQLGlot • PostgreSQL
• sqlparser-rs • DuckDB
• JSqlParser • Apache Spark
• Presto/Trino
The Full Query Processing Pipeline
When you run a query in a database, it goes through many stages. Parsing is just the first.
┌─────────────────┐
│ SQL String │
└─────────────────┘
│
▼
┌─────────────────┐
│ PARSER │ ◄── SQLGlot, sqlparser-rs stop here
└────────┬────────┘
│ AST
▼
┌─────────────────┐
│ ANALYZER │ ◄── Semantic analysis (name resolution, types)
└────────┬────────┘
│ Analyzed AST
▼
┌─────────────────┐
│ PLANNER │ ◄── Logical plan (relational algebra)
└────────┬────────┘
│ Logical Plan
▼
┌─────────────────┐
│ OPTIMIZER │ ◄── Rule-based and cost-based optimization
└────────┬────────┘
│ Optimized Plan
▼
┌─────────────────┐
│ EXECUTOR │ ◄── Physical operators, data access
└────────┬────────┘
│
▼
┌─────────────────┐
│ RESULTS │
└─────────────────┘
What Parsers Do
| Capability | Parser Does |
|---|---|
| Syntax validation | ✅ Detect SELECT FROM (missing columns) |
| AST construction | ✅ Build tree structure |
| Transpilation | ✅ Convert MySQL → PostgreSQL |
| Lineage extraction | ✅ Find table/column dependencies |
| Query formatting | ✅ Pretty-print SQL |
What Parsers Don’t Do
| Capability | Parser | Engine |
|---|---|---|
| Execute query | ❌ | ✅ |
| Return results | ❌ | ✅ |
| Optimize execution | ❌ | ✅ |
| Choose join order | ❌ | ✅ |
| Manage transactions | ❌ | ✅ |
Why This Matters
If you’re building a data catalog and need lineage, you need a parser. You don’t need a query engine.
If you’re building an IDE with autocomplete, you need a parser. You don’t need to execute anything.
If you’re building a transpiler to migrate queries from Snowflake to Databricks, you need a parser with good dialect support. You don’t need to run those queries.
The tools are different because the problems are different.
At a Glance
| Parser | Language | License | Dialects | One-liner |
|---|---|---|---|---|
| SQLGlot | Python | MIT | 31 | Most feature-complete open-source |
| sqlparser-rs | Rust | Apache 2.0 | ~15 | Fast, minimal, foundation for Rust engines |
| Apache Calcite | Java | Apache 2.0 | ~10 | Full query planning framework |
| Gudusoft GSP | Java/C# | Commercial | 25+ | Enterprise, stored procedure support |
| JSqlParser | Java | Apache/LGPL | ~6 | Mature, simple Java parser |
Layer Support
| Parser | Lexer | Parser | AST | Semantic | Lineage |
|---|---|---|---|---|---|
| SQLGlot | ✅ | ✅ | ✅ | ✅ | ✅ |
| sqlparser-rs | ✅ | ✅ | ✅ | ❌ | ❌ |
| Calcite | ✅ | ✅ | ✅ | ✅ | ⚠️ |
| Gudusoft GSP | ✅ | ✅ | ✅ | ✅ | ✅ |
| JSqlParser | ✅ | ✅ | ✅ | ❌ | ❌ |
Features
| Parser | Transpile | Format | Lineage | Schema | Round-trip |
|---|---|---|---|---|---|
| SQLGlot | ✅ | ✅ | ✅ | ✅ | ✅ |
| sqlparser-rs | ❌ | ⚠️ | ❌ | ❌ | ⚠️ |
| Calcite | ⚠️ | ❌ | ⚠️ | ✅ | ✅ |
| Gudusoft GSP | ⚠️ | ✅ | ✅ | ✅ | ✅ |
| JSqlParser | ❌ | ⚠️ | ❌ | ❌ | ✅ |
Quick Decision Matrix
| Your Situation | Recommended |
|---|---|
| Python, need transpilation | SQLGlot |
| Python, need lineage | SQLGlot |
| Rust query engine | sqlparser-rs |
| Browser-based SQL tool | sqlparser-rs (WASM) |
| Java, basic parsing | JSqlParser |
| Java, query planning | Apache Calcite |
| Enterprise, stored procs | Gudusoft GSP |
Further Reading
Still Interested? There’s more to read. Hopefully I have piqued your interest. For any comments or feedback, please reach out to and I’ll address them.
Parsing Algorithms
- Recursive Descent – Top-down, hand-written parsers. Most SQL parsers use this.
- Pratt Parsing – Elegant handling of operator precedence.
- LR Parsing – Bottom-up, table-driven. Used by parser generators like Bison.
- Parser Combinators – Functional composition of parsers.
Parser Libraries
- SQLGlot – Python, 31 dialects, transpilation, lineage
- sqlparser-rs – Rust, fast, WASM support
- Apache Calcite – Java, full query planning
- JSqlParser – Java, mature, simple
- Gudusoft GSP – Commercial, enterprise features