Portable C11 · Zero dependencies · MIT License
+

A complete parsing toolkit for C.
From source text to parse trees in minutes.

Lexer generator · LR(1) & LALR(1) parser generator · NFA engine

6
clex API functions
8
cparse API functions
0
External dependencies
C11
Portable standard
The Parsing Pipeline
Source Text your input string
clex regex-based tokenizer
cparse LR(1) / LALR(1)
Parse Tree structured output

Why clex + cparse?
Everything you need to build lexers and parsers in plain C.

No Code Generation

Unlike traditional tools, clex needs no separate code generation step. Define patterns at runtime and start tokenizing immediately.

NFA-Based Regex Engine

Thompson NFA construction provides reliable matching with support for grouping, alternation, character classes, ranges, and quantifiers.

LR(1) & LALR(1) Parsing

Full LR(1) parser construction with optional LALR(1) state merging. Handles left-recursive grammars and complex language constructs.

Predictable Performance

Dynamic NFA transitions in clex and symbol-ID indexed action/goto tables in cparse remove fixed slot limits and speed parse-time lookups.

🛡

Safe Failure Modes

Lexer and parser operations return typed status codes. Structured errors include exact position, expected token(s), offending lexeme, and detailed LR conflict diagnostics during parser generation.

🌳

Parse Tree Output

Get structured parse trees with symbol values, matched tokens for terminals, and child node vectors for easy traversal.


A tiny, battle-tested lexer generator for C. Feed it regular expressions, get tokens back.

C11 MIT Header + Source

Highlights

  • Compact runtime API — no code generation required
  • Regex: grouping, alternation, character classes, ranges, * + ?
  • Dynamic NFA transition storage — no fixed per-node transition cap
  • Whitespace between tokens is skipped automatically
  • Up to 1024 token rules (configurable via CLEX_MAX_RULES)
  • NFA visualization via Graphviz output
  • Typed status codes + structured lexer errors (clexError)
  • Token source spans include byte offset + line/column
FunctionDescription
clexInit()Allocate and return a new lexer
clexRegisterKind()Register a regex pattern for a token kind (returns clexStatus)
clexReset()Point lexer at a new input string
clex()Lex the next token into an out-parameter (returns clexStatus)
clexGetLastError()Retrieve structured lexer error details
clexDeleteKinds()Clear all rules for reuse
clexLexerDestroy()Free all lexer resources

Supported Regex Syntax

(ab) Grouping
a|b Alternation
[a-z] Character classes
[A-Z] Ranges
a* Zero or more
a+ One or more
a? Optional
\( Escape sequences

Example — Tokenizing C Code

tokenizer.c
#include "clex.h"
#include <stdio.h>
#include <stdlib.h>

typedef enum TokenKind { INT, OPARAN, CPARAN, IDENTIFIER, CONSTANT, SEMICOL } TokenKind;

int main() {
    clexLexer *lexer = clexInit();

    clexRegisterKind(lexer, "int",                                   INT);
    clexRegisterKind(lexer, "\\(",                                   OPARAN);
    clexRegisterKind(lexer, "\\)",                                   CPARAN);
    clexRegisterKind(lexer, "[1-9][0-9]*",                           CONSTANT);
    clexRegisterKind(lexer, ";",                                     SEMICOL);
    clexRegisterKind(lexer, "[a-zA-Z_]([a-zA-Z_]|[0-9])*",           IDENTIFIER);

    clexReset(lexer, "int main()");

    clexToken tok;
    clexTokenInit(&tok);
    while (1) {
        clexStatus st = clex(lexer, &tok);
        if (st == CLEX_STATUS_EOF) break;
        if (st != CLEX_STATUS_OK) {
            const clexError *err = clexGetLastError(lexer);
            fprintf(stderr, "lexical error at %zu:%zu near '%s'\n",
                    err->position.line, err->position.column,
                    err->offending_lexeme ? err->offending_lexeme : "");
            break;
        }
        printf("kind=%d lexeme='%s' @ %zu:%zu\n", tok.kind, tok.lexeme,
               tok.span.start.line, tok.span.start.column);
    }
    clexTokenClear(&tok);

    clexLexerDestroy(lexer);
}

An LR(1) and LALR(1) parser generator for C. Define grammars in plain text, get parse trees out.

C11 MIT Static Library

Highlights

  • LR(1) construction with optional LALR(1) state merging
  • Intuitive textual grammar format
  • Automatic First/Follow set computation
  • Structured LR conflict diagnostics with state/item/action details
  • Typed parse statuses + structured parser errors
  • Parse tree production with source spans on each node
  • Symbol-ID indexed action/goto tables + dynamic internals (no fixed-size limits)
  • Ships with libcparse.a static library build
FunctionDescription
cparseGrammar()Parse a grammar string into internal representation
cparseCreateLR1Parser()Build an LR(1) parser from a grammar and token-name map (array + count)
cparseCreateLALR1Parser()Build an LALR(1) parser (merged states) from a token-name map (array + count)
cparseAccept()Validate input (returns cparseStatus)
cparse()Parse input into an out-parameter parse tree (returns cparseStatus)
cparseGetLastError()Retrieve parser error details: position, expected terminals, offending lexeme
cparseFreeParseTree()Recursively release a parse tree
cparseFreeParser()Release parser state
cparseFreeGrammar()Release grammar data structures

Grammar Format

Syntax Rules

  • Each line defines a production: NonTerminal -> symbol1 symbol2 | alt
  • Tokens are whitespace-separated
  • Use epsilon for empty productions
  • Lines starting with # are comments
  • The first nonterminal becomes the start symbol
  • Use | to specify alternative productions
grammar.txt
# Arithmetic expression grammar
Expr     -> Term ExprTail
ExprTail -> PLUS Term ExprTail | epsilon
Term     -> Factor TermTail
TermTail -> STAR Factor TermTail | epsilon
Factor   -> NUMBER | LPAREN Expr RPAREN

Parse Tree Structure

parse_tree_node.h
typedef struct ParseTreeNode {
    char     *value;      /* grammar symbol       */
    clexToken token;      /* matched token (term) */
    clexSourceSpan span;  /* node source range    */
    PtrVec    children;   /* ParseTreeNode*       */
} ParseTreeNode;
tree output for "8 + 5 * 2"
Expr
 ├─ Term
 │   └─ Factor
 │       └─ NUMBER "8"
 ├─ ExprTail
 │   ├─ PLUS "+"
 │   ├─ Term
 │   │   ├─ Factor
 │   │   │   └─ NUMBER "5"
 │   │   └─ TermTail
 │   │       ├─ STAR "*"
 │   │       └─ Factor
 │   │           └─ NUMBER "2"
 │   └─ ExprTail
 │       └─ epsilon

Complete Pipeline Example
clex and cparse working together to parse arithmetic expressions.
expr_parser.c
#include "cparse.h"
#include "clex/clex.h"
#include <stdio.h>

int main(void) {
    /* ── Step 1: Set up the lexer ─────────────────────────────── */
    clexLexer *lexer = clexInit();
    clexRegisterKind(lexer, "[0-9]+",  0);   /* NUMBER */
    clexRegisterKind(lexer, "\\+",     1);   /* PLUS   */
    clexRegisterKind(lexer, "\\*",     2);   /* STAR   */
    clexRegisterKind(lexer, "\\(",     3);   /* LPAREN */
    clexRegisterKind(lexer, "\\)",     4);   /* RPAREN */

    /* ── Step 2: Define the grammar ──────────────────────────── */
    const char *grammar_src =
        "Expr     -> Term ExprTail\n"
        "ExprTail -> PLUS Term ExprTail | epsilon\n"
        "Term     -> Factor TermTail\n"
        "TermTail -> STAR Factor TermTail | epsilon\n"
        "Factor   -> NUMBER | LPAREN Expr RPAREN";

    Grammar *grammar = cparseGrammar(grammar_src);

    /* ── Step 3: Build the parser ───────────────────────────── */
    const char *names[] = {"NUMBER", "PLUS", "STAR", "LPAREN", "RPAREN"};
    LALR1Parser *parser = cparseCreateLALR1Parser(
        grammar, lexer, names, sizeof(names) / sizeof(names[0]));

    /* ── Step 4: Parse input ───────────────────────────────── */
    const char *input = "8 + 5 * 2";

    if (cparseAccept(parser, input) == CPARSE_STATUS_OK) {
        ParseTreeNode *tree = NULL;
        if (cparse(parser, input, &tree) == CPARSE_STATUS_OK) {
            /* ... traverse or inspect the parse tree ... */
        }
        cparseFreeParseTree(tree);
    } else {
        const cparseError *err = cparseGetLastError(parser);
        /* err->position, err->expected_tokens, err->offending_lexeme */
    }

    /* ── Cleanup ────────────────────────────────────────────── */
    cparseFreeParser(parser);
    cparseFreeGrammar(grammar);
    clexLexerDestroy(lexer);
}

Get Started
Up and running in under a minute.
1

Clone the repository

cparse bundles clex as a git submodule.

git clone https://github.com/h2337/cparse.git
2

Initialize submodules

Pull in the clex lexer dependency.

cd cparse && git submodule update --init --recursive
3

Build and test

Builds libcparse.a and runs the test suite.

make test
4

Try the examples

Build and run the expression parser demo.

make examples && ./examples/expr_parser "8 + 5 * 2"

Using clex Standalone

If you only need the lexer, you can use clex on its own.

terminal
# Clone clex standalone
git clone https://github.com/h2337/clex.git
cd clex

# Run the test suite
make test-all

# Build for library use
make lib

# Or compile directly
gcc your_app.c fa.c clex.c -o your_app

Linking cparse into Your Project

terminal
# After building, link the static library
gcc your_parser.c -L. -lcparse clex/clex.o clex/fa.o -o your_parser

# Or embed the sources directly
gcc your_parser.c grammar.c lr1_lalr1.c util.c \
    clex/clex.c clex/fa.c -o your_parser