The LexBuild AST is a semantic intermediate representation between raw XML and rendered Markdown. It is not a 1:1 mapping of XML elements. Instead, it is a partially interpreted tree that captures the structural and textual meaning needed for rendering. Source packages produce AST nodes from their respective XML formats; the core renderer consumes them to generate Markdown. This producer/consumer split allows multiple legal sources to share a single rendering pipeline.
All AST types are defined in packages/core/src/ast/types.ts and exported from @lexbuild/core.
Why an AST Layer
Legislative XML schemas contain dozens of element types, namespace variations, and structural patterns that are irrelevant to Markdown rendering. The AST collapses this complexity into a small set of typed nodes:
- Structural elements map to
LevelNode - Text blocks map to
ContentNode - Inline formatting maps to
InlineNode - Annotations map to
NoteNodeandSourceCreditNode - Tabular data maps to
TableNode
Both USLM (for U.S. Code) and GPO/SGML (for eCFR and FR) XML are mapped to these same node types. The renderer operates on AST nodes without knowing which source produced them.
Transformation Overview
XML Source AST Markdown Output
<section> -> LevelNode -> # Section Heading
<num> .num, .numValue
<heading> .heading
<content> -> ContentNode -> Paragraph text with
<ref> InlineNode(ref) [link](path.md)
text InlineNode(text) plain text
<subsection> -> LevelNode -> **(a)** Subsection
<chapeau> -> ContentNode -> Introductory text--
<paragraph> -> LevelNode -> **(1)** Paragraph text
<sourceCredit> -> SourceCreditNode -> (Pub. L. 111-350, ...)
<notes> -> NotesContainerNode
<note> -> NoteNode -> ### Editorial Notes
Node Types
BaseNode
All nodes extend the BaseNode interface:
interface BaseNode {
readonly type: string; // Discriminator for the node type
identifier?: string; // Canonical URI (e.g., "/us/usc/t1/s1")
sourceElement?: string; // Original XML element name (for diagnostics)
}
The type field is the discriminator for TypeScript’s discriminated union. The identifier carries the canonical URI when one exists in the source XML. The sourceElement preserves the original element name for debugging.
LevelNode
Represents any hierarchical level in the document structure, from an entire title down to a subsubitem.
interface LevelNode extends BaseNode {
readonly type: "level";
levelType: LevelType; // Which level in the hierarchy
num?: string; // Display text (e.g., "ss 1.", "(a)")
numValue?: string; // Normalized value (e.g., "1", "a")
heading?: string; // Heading text
status?: string; // Legal status (e.g., "repealed")
children: ASTNode[]; // Child nodes
}
The 26 level types are organized into three groups:
Big levels (17 types, above section): title, appendix, subtitle, chapter, subchapter, compiledAct, reorganizationPlans, reorganizationPlan, courtRules, courtRule, article, subarticle, part, subpart, division, subdivision, preliminary
Primary level: section
Small levels (8 types, below section): subsection, paragraph, subparagraph, clause, subclause, item, subitem, subsubitem
These groups are exported as BIG_LEVELS and SMALL_LEVELS sets, and the full ordered list as the LEVEL_TYPES array.
The renderer treats these groups differently. Big levels produce Markdown headings (H1 through H5, capped to avoid overuse of deep heading levels). Small levels use bold inline numbering (for example, **(a)** Text...). The section level itself produces the top-level heading of a section file.
ContentNode
A block of text content within a level.
interface ContentNode extends BaseNode {
readonly type: "content";
variant: ContentVariant; // "content" | "chapeau" | "continuation" | "proviso"
children: InlineNode[]; // Inline children (text, formatting, refs)
}
The four variants correspond to distinct roles in legal text:
| Variant | Role |
|---|---|
content | Standard text block |
chapeau | Introductory text before sub-levels (e.g., “The following conditions apply—“) |
continuation | Text after or between sub-levels |
proviso | ”Provided that…” conditional text |
All four variants render as plain paragraph text in Markdown. The distinction is preserved in the AST for consumers that need to differentiate these roles.
InlineNode
Inline text and formatting within content blocks. InlineNodes can nest (for example, bold text inside a reference).
interface InlineNode extends BaseNode {
readonly type: "inline";
inlineType: InlineType; // Discriminator for inline kind
text?: string; // Text content (leaf nodes)
href?: string; // Link target (ref nodes)
idref?: string; // Footnote target ID (footnoteRef nodes)
children?: InlineNode[]; // Nested inline children
}
The inline type determines how the node renders to Markdown:
| Inline Type | Markdown Output |
|---|---|
text | Literal text |
bold | **text** |
italic | *text* |
ref | [text](path) or plain text (depending on link resolution) |
date | Literal text |
term | **text** (bold, same as defined terms) |
quoted | "text" |
sup | <sup>text</sup> |
sub | <sub>text</sub> |
footnoteRef | [^id] |
Leaf nodes use the text field. Non-leaf nodes use children to hold nested inline content. The href field is specific to ref nodes and carries the cross-reference URI.
NoteNode
Editorial notes, statutory notes, and amendment history.
interface NoteNode extends BaseNode {
readonly type: "note";
topic?: string; // Semantic category
role?: string; // Role refinement (e.g., "crossHeading")
noteType?: string; // Placement type
heading?: string; // Heading text
children: ASTNode[]; // Child nodes
}
Notes have two independent classification axes:
- noteType (placement):
uscNote,footnote,inline,endnote - topic (semantic category):
amendments,codification,changeOfName,crossReferences,effectiveDateOfAmendment,miscellaneous,repeals,regulations,dispositionOfSections,enacting
The role field with value "crossHeading" marks notes that act as section dividers within a notes container. A cross-heading note with heading “Editorial Notes” or “Statutory Notes” establishes the category for subsequent notes until the next cross-heading. The renderer uses this to support selective notes filtering without modifying the AST.
SourceCreditNode
Enactment source citations (such as public law references).
interface SourceCreditNode extends BaseNode {
readonly type: "sourceCredit";
children: InlineNode[];
}
Rendered as a horizontal rule followed by a bold “Source Credit” label and the citation text.
TableNode
Both XHTML tables and USLM layout tables.
interface TableNode extends BaseNode {
readonly type: "table";
variant: "xhtml" | "layout";
headers: string[][]; // Header rows (array of cell arrays)
rows: string[][]; // Body rows
rawHtml?: string; // Fallback for tables too complex to decompose
}
The variant distinguishes XHTML namespace <table> elements (common in USC XML) from USLM <layout>/<column> elements. Both are normalized into the same row/column structure. Tables that cannot be simplified to rows and columns (those with colspan or rowspan) fall back to rawHtml. Simple tables render as Markdown pipe tables.
Other Node Types
TOCNodeandTOCItemNode— Table of contents structures. Present in the AST for completeness but skipped during Markdown rendering, since TOC structures in the source XML are navigational aids rather than legal text content.NotesContainerNode— Wraps<notes>containers. Children includeNoteNodeentries with cross-heading roles acting as category dividers.QuotedContentNode— Quoted legal text, typically quoted bills embedded in statutory notes. Rendered as a Markdown blockquote. Sections inside quoted content are suppressed during emission to prevent them from being written as standalone output files.
The Complete Union Type
type ASTNode =
| LevelNode
| ContentNode
| InlineNode
| NoteNode
| SourceCreditNode
| TableNode
| TOCNode
| TOCItemNode
| NotesContainerNode
| QuotedContentNode;
The renderer dispatches on node.type via a switch statement. TypeScript’s discriminated union ensures exhaustive handling — adding a new node type without a corresponding render case produces a compile-time error.
Source Type Discriminator
Every output file carries a source discriminator and a legal_status classification:
type SourceType = "usc" | "ecfr" | "fr";
type LegalStatus =
| "official_legal_evidence" // USC positive law titles
| "official_prima_facie" // USC non-positive law titles
| "authoritative_unofficial"; // eCFR and FR content
The SourceType union is designed for extension. Adding a new source means adding a new value here.
FrontmatterData
The FrontmatterData interface defines all fields used to generate YAML frontmatter for output files. Required fields appear on every file; optional fields are included when applicable.
Required fields include source, legal_status, identifier (canonical URI), title, title_number, title_name, currency, and last_updated. Structural context fields like chapter_number, section_number, and part_number are included when the hierarchy provides them.
Source-specific optional fields cover eCFR metadata (authority, agency, cfr_part) and FR metadata (document_number, document_type, fr_citation, agencies, effective_date). Two constants are appended automatically: FORMAT_VERSION and GENERATOR.
Context Types
EmitContext
When the builder emits a completed subtree, it provides an EmitContext:
interface EmitContext {
ancestors: AncestorInfo[]; // Ancestor chain from document root
documentMeta: DocumentMeta; // Document-level metadata
}
The ancestors array gives you the full hierarchy path from the document root to the emitted node’s parent. This provides enough context for determining file paths, generating frontmatter, and resolving relative links — all without holding the entire document tree in memory.
AncestorInfo
A lightweight snapshot of an ancestor level:
interface AncestorInfo {
levelType: LevelType;
numValue?: string;
heading?: string;
identifier?: string;
}
DocumentMeta
Metadata extracted from the XML <meta> block. Held in memory for the full duration of parsing and included in every EmitContext. Contains fields like dcTitle, docNumber, releasePoint, positivelaw, and the root document identifier.