clex + cparse — A Complete Parsing Toolkit for C

The Parsing Pipeline

Source Text your input string

⟶

clex regex-based tokenizer

⟶

cparse LR(1) / LALR(1)

⟶

Parse Tree structured output

Why clex + cparse?

Everything you need to build lexers and parsers in plain C.

⚙

No Code Generation

Unlike traditional tools, clex needs no separate code generation step. Define patterns at runtime and start tokenizing immediately.

➡

NFA-Based Regex Engine

Thompson NFA construction provides reliable matching with support for grouping, alternation, character classes, ranges, and quantifiers.

✓

LR(1) & LALR(1) Parsing

Full LR(1) parser construction with optional LALR(1) state merging. Handles left-recursive grammars and complex language constructs.

⚡

Predictable Performance

Dynamic NFA transitions in clex and symbol-ID indexed action/goto tables in cparse remove fixed slot limits and speed parse-time lookups.

🛡

Safe Failure Modes

Lexer and parser operations return typed status codes. Structured errors include exact position, expected token(s), offending lexeme, and detailed LR conflict diagnostics during parser generation.

🌳

Parse Tree Output

Get structured parse trees with symbol values, matched tokens for terminals, and child node vectors for easy traversal.

clex

A tiny, battle-tested lexer generator for C. Feed it regular expressions, get tokens back.

C11 MIT

Highlights

✓ Compact runtime API — no code generation required
✓ Regex: grouping, alternation, character classes, ranges, * + ?
✓ Dynamic NFA transition storage — no fixed per-node transition cap
✓ Whitespace between tokens is skipped automatically
✓ Up to 1024 token rules (configurable via CLEX_MAX_RULES)
✓ NFA visualization via Graphviz output
✓ Typed status codes + structured lexer errors (clexError)
✓ Token source spans include byte offset + line/column

Function	Description
clexInit()	Allocate and return a new lexer
clexRegisterKind()	Register a regex pattern for a token kind (returns `clexStatus`)
clexReset()	Point lexer at a new input string
clex()	Lex the next token into an out-parameter (returns `clexStatus`)
clexGetLastError()	Retrieve structured lexer error details
clexDeleteKinds()	Clear all rules for reuse
clexLexerDestroy()	Free all lexer resources

Supported Regex Syntax

(ab) Grouping

a|b Alternation

[a-z] Character classes

[A-Z] Ranges

a* Zero or more

a+ One or more

a? Optional

\( Escape sequences

Example — Tokenizing C Code

tokenizer.c

#include "clex.h"
#include <stdio.h>
#include <stdlib.h>

typedef enum TokenKind { INT, OPARAN, CPARAN, IDENTIFIER, CONSTANT, SEMICOL } TokenKind;

int main() {
    clexLexer *lexer = clexInit();

    clexRegisterKind(lexer, "int",                                   INT);
    clexRegisterKind(lexer, "\\(",                                   OPARAN);
    clexRegisterKind(lexer, "\\)",                                   CPARAN);
    clexRegisterKind(lexer, "[1-9][0-9]*",                           CONSTANT);
    clexRegisterKind(lexer, ";",                                     SEMICOL);
    clexRegisterKind(lexer, "[a-zA-Z_]([a-zA-Z_]|[0-9])*",           IDENTIFIER);

    clexReset(lexer, "int main()");

    clexToken tok;
    clexTokenInit(&tok);
    while (1) {
        clexStatus st = clex(lexer, &tok);
        if (st == CLEX_STATUS_EOF) break;
        if (st != CLEX_STATUS_OK) {
            const clexError *err = clexGetLastError(lexer);
            fprintf(stderr, "lexical error at %zu:%zu near '%s'\n",
                    err->position.line, err->position.column,
                    err->offending_lexeme ? err->offending_lexeme : "");
            break;
        }
        printf("kind=%d lexeme='%s' @ %zu:%zu\n", tok.kind, tok.lexeme,
               tok.span.start.line, tok.span.start.column);
    }
    clexTokenClear(&tok);

    clexLexerDestroy(lexer);
}

cparse

An LR(1) and LALR(1) parser generator for C. Define grammars in plain text, get parse trees out.

C11 MIT

Highlights

✓ LR(1) construction with optional LALR(1) state merging
✓ Intuitive textual grammar format
✓ Automatic First/Follow set computation
✓ Structured LR conflict diagnostics with state/item/action details
✓ Typed parse statuses + structured parser errors
✓ Parse tree production with source spans on each node
✓ Symbol-ID indexed action/goto tables + dynamic internals (no fixed-size limits)
✓ Ships with libcparse.a static library build

Function	Description
cparseGrammar()	Parse a grammar string into internal representation
cparseCreateLR1Parser()	Build an LR(1) parser from a grammar and token-name map (array + count)
cparseCreateLALR1Parser()	Build an LALR(1) parser (merged states) from a token-name map (array + count)
cparseAccept()	Validate input (returns `cparseStatus`)
cparse()	Parse input into an out-parameter parse tree (returns `cparseStatus`)
cparseGetLastError()	Retrieve parser error details: position, expected terminals, offending lexeme
cparseFreeParseTree()	Recursively release a parse tree
cparseFreeParser()	Release parser state
cparseFreeGrammar()	Release grammar data structures

Grammar Format

Syntax Rules

Each line defines a production: NonTerminal -> symbol1 symbol2 | alt
Tokens are whitespace-separated
Use epsilon for empty productions
Lines starting with # are comments
The first nonterminal becomes the start symbol
Use | to specify alternative productions

grammar.txt

# Arithmetic expression grammar
Expr     -> Term ExprTail
ExprTail -> PLUS Term ExprTail | epsilon
Term     -> Factor TermTail
TermTail -> STAR Factor TermTail | epsilon
Factor   -> NUMBER | LPAREN Expr RPAREN

Parse Tree Structure

parse_tree_node.h

typedef struct ParseTreeNode {
    char     *value;      /* grammar symbol       */
    clexToken token;      /* matched token (term) */
    clexSourceSpan span;  /* node source range    */
    PtrVec    children;   /* ParseTreeNode*       */
} ParseTreeNode;

tree output for "8 + 5 * 2"

Expr
 ├─ Term
 │   └─ Factor
 │       └─ NUMBER "8"
 ├─ ExprTail
 │   ├─ PLUS "+"
 │   ├─ Term
 │   │   ├─ Factor
 │   │   │   └─ NUMBER "5"
 │   │   └─ TermTail
 │   │       ├─ STAR "*"
 │   │       └─ Factor
 │   │           └─ NUMBER "2"
 │   └─ ExprTail
 │       └─ epsilon

Complete Pipeline Example

clex and cparse working together to parse arithmetic expressions.

expr_parser.c

#include "cparse.h"
#include "clex/clex.h"
#include <stdio.h>

int main(void) {
    /* ── Step 1: Set up the lexer ─────────────────────────────── */
    clexLexer *lexer = clexInit();
    clexRegisterKind(lexer, "[0-9]+",  0);   /* NUMBER */
    clexRegisterKind(lexer, "\\+",     1);   /* PLUS   */
    clexRegisterKind(lexer, "\\*",     2);   /* STAR   */
    clexRegisterKind(lexer, "\\(",     3);   /* LPAREN */
    clexRegisterKind(lexer, "\\)",     4);   /* RPAREN */

    /* ── Step 2: Define the grammar ──────────────────────────── */
    const char *grammar_src =
        "Expr     -> Term ExprTail\n"
        "ExprTail -> PLUS Term ExprTail | epsilon\n"
        "Term     -> Factor TermTail\n"
        "TermTail -> STAR Factor TermTail | epsilon\n"
        "Factor   -> NUMBER | LPAREN Expr RPAREN";

    Grammar *grammar = cparseGrammar(grammar_src);

    /* ── Step 3: Build the parser ───────────────────────────── */
    const char *names[] = {"NUMBER", "PLUS", "STAR", "LPAREN", "RPAREN"};
    LALR1Parser *parser = cparseCreateLALR1Parser(
        grammar, lexer, names, sizeof(names) / sizeof(names[0]));

    /* ── Step 4: Parse input ───────────────────────────────── */
    const char *input = "8 + 5 * 2";

    if (cparseAccept(parser, input) == CPARSE_STATUS_OK) {
        ParseTreeNode *tree = NULL;
        if (cparse(parser, input, &tree) == CPARSE_STATUS_OK) {
            /* ... traverse or inspect the parse tree ... */
        }
        cparseFreeParseTree(tree);
    } else {
        const cparseError *err = cparseGetLastError(parser);
        /* err->position, err->expected_tokens, err->offending_lexeme */
    }

    /* ── Cleanup ────────────────────────────────────────────── */
    cparseFreeParser(parser);
    cparseFreeGrammar(grammar);
    clexLexerDestroy(lexer);
}

Get Started

Up and running in under a minute.

1

Clone the repository

cparse bundles clex as a git submodule.

git clone https://github.com/h2337/cparse.git

2

Initialize submodules

Pull in the clex lexer dependency.

cd cparse && git submodule update --init --recursive

3

Build and test

Builds libcparse.a and runs the test suite.

make test

4

Try the examples

Build and run the expression parser demo.

make examples && ./examples/expr_parser "8 + 5 * 2"

Using clex Standalone

If you only need the lexer, you can use clex on its own.

terminal

# Clone clex standalone
git clone https://github.com/h2337/clex.git
cd clex

# Run the test suite
make test-all

# Build for library use
make lib

# Or compile directly
gcc your_app.c fa.c clex.c -o your_app

Linking cparse into Your Project

terminal

# After building, link the static library
gcc your_parser.c -L. -lcparse clex/clex.o clex/fa.o -o your_parser

# Or embed the sources directly
gcc your_parser.c grammar.c lr1_lalr1.c util.c \
    clex/clex.c clex/fa.c -o your_parser

A complete parsing toolkit for C. From source text to parse trees in minutes.

No Code Generation

NFA-Based Regex Engine

LR(1) & LALR(1) Parsing

Predictable Performance

Safe Failure Modes

Parse Tree Output

Highlights

Supported Regex Syntax

Example — Tokenizing C Code

Highlights

Grammar Format

Syntax Rules

Parse Tree Structure

Clone the repository

Initialize submodules

Build and test

Try the examples

Using clex Standalone

Linking cparse into Your Project

A complete parsing toolkit for C.
From source text to parse trees in minutes.