This is the authoritative specification for LexBuild’s Markdown output format. Every .md file that LexBuild produces follows this format, which is designed for RAG pipelines, vector databases, and LLM context windows. If you are building a system that consumes LexBuild output, this is the document you need.
Format Versioning
The current format version is 1.1.0, defined by the FORMAT_VERSION constant in @lexbuild/core. Breaking changes to the output format increment the major version.
The format version is recorded in two places:
- The
format_versionfield in every Markdown file’s YAML frontmatter. - The
format_versionfield in every_meta.jsonsidecar index.
Directory Layout
USC Output
LexBuild supports three output granularities for U.S. Code content. The granularity determines how content is partitioned into files.
Section granularity (default):
output/usc/
├── title-01/
│ ├── chapter-01/
│ │ ├── section-1.md
│ │ ├── section-2.md
│ │ └── _meta.json
│ ├── chapter-02/
│ │ ├── section-101.md
│ │ └── _meta.json
│ ├── _meta.json
│ └── README.md
└── title-54/
└── ...
Chapter granularity:
output/usc/
├── title-01/
│ ├── chapter-01/
│ │ └── chapter-01.md
│ ├── chapter-02/
│ │ └── chapter-02.md
│ ├── _meta.json
│ └── README.md
└── ...
Title granularity:
output/usc/
├── title-01.md
├── title-02.md
└── ...
Title granularity produces flat files with no subdirectories and no sidecar files. The frontmatter is enriched with aggregate statistics (chapter_count, section_count, total_token_estimate).
eCFR Output
LexBuild supports four output granularities for Code of Federal Regulations content.
Section granularity (default):
output/ecfr/
├── title-01/
│ ├── chapter-I/
│ │ ├── part-1/
│ │ │ ├── section-1.1.md
│ │ │ ├── section-1.2.md
│ │ │ └── _meta.json
│ │ └── part-2/
│ │ ├── section-2.1.md
│ │ └── _meta.json
│ ├── _meta.json
│ └── README.md
└── title-50/
└── ...
Part granularity:
output/ecfr/
├── title-17/
│ ├── chapter-II/
│ │ ├── part-240.md
│ │ └── part-249.md
│ └── ...
└── ...
Chapter granularity:
output/ecfr/
├── title-17/
│ ├── chapter-I.md
│ ├── chapter-II.md
│ └── ...
└── ...
Title granularity:
output/ecfr/
├── title-01.md
├── title-17.md
└── ...
FR Output
Federal Register documents produce one file per document, organized by publication date:
output/fr/
├── 2026/
│ ├── 01/
│ │ └── 2026-00123.md
│ └── 03/
│ ├── 2026-06029.md
│ └── 2026-06048.md
└── 2025/
└── ...
No granularity options are available for FR output. FR documents are already atomic (one file per document).
Naming Conventions
| Component | Pattern | Examples | Notes |
|---|---|---|---|
| Title dir (USC) | title-{NN} | title-01, title-54 | 2-digit zero-padded |
| Title dir (eCFR) | title-{NN} | title-01, title-50 | 2-digit zero-padded |
| Appendix dir | title-{NN}-appendix | title-05-appendix | USC only: titles 5, 11, 18, 28 |
| Chapter dir (USC) | chapter-{NN} | chapter-01, chapter-99 | 2-digit zero-padded |
| Chapter dir (eCFR) | chapter-{X} | chapter-I, chapter-IV | Roman numerals |
| Part dir (eCFR) | part-{N} | part-1, part-240 | Not zero-padded |
| Section file (USC) | section-{ID}.md | section-1.md, section-7801.md, section-202a.md | Not zero-padded; may be alphanumeric |
| Section file (eCFR) | section-{N.N}.md | section-1.1.md, section-240.10b-5.md | Part-prefixed section number |
| Duplicate sections | section-{ID}-2.md | section-3598-2.md | USC only; -2, -3 suffix for subsequent occurrences |
| Year dir (FR) | {YYYY} | 2026 | 4-digit year |
| Month dir (FR) | {MM} | 01, 03 | 2-digit zero-padded month |
| Document file (FR) | {doc_number}.md | 2026-06029.md | FR document number |
Frontmatter Schema
Every output file begins with a YAML frontmatter block delimited by ---. Fields are serialized in a controlled order using double-quoted string values for consistency.
Common Fields
Every file, regardless of source or granularity, includes these fields:
| Field | Type | Description |
|---|---|---|
identifier | string | Canonical URI identifier (e.g., "/us/usc/t1/s1", "/us/cfr/t17/s240.10b-5") |
source | string | Content source: "usc", "ecfr", or "fr" |
legal_status | string | Legal provenance (see Legal Status Values) |
title | string | Human-readable display title |
title_number | number | Numeric title designation |
title_name | string | Title heading text |
positive_law | boolean | Whether the title is enacted as positive law |
currency | string | USC: release point identifier (e.g., "119-73"); eCFR: ISO date of the conversion run (e.g., "2025-03-15") |
last_updated | string | ISO date of the conversion run |
format_version | string | Output format version ("1.1.0") |
generator | string | Generator identifier (e.g., "[email protected]") |
USC Section-Level Frontmatter
A complete USC section file includes all common fields plus section-specific context:
---
identifier: "/us/usc/t1/s1"
source: "usc"
legal_status: "official_legal_evidence"
title: "1 USC § 1 - Words denoting number, gender, and so forth"
title_number: 1
title_name: "GENERAL PROVISIONS"
section_number: "1"
section_name: "Words denoting number, gender, and so forth"
chapter_number: 1
chapter_name: "RULES OF CONSTRUCTION"
positive_law: true
currency: "119-73"
last_updated: "2025-12-03"
format_version: "1.1.0"
generator: "[email protected]"
source_credit: "(July 30, 1947, ch. 388, 61 Stat. 633.)"
---
Optional fields that appear when applicable:
| Field | Type | Condition |
|---|---|---|
subchapter_number | string | Present when section is within a subchapter |
subchapter_name | string | Present when section is within a subchapter |
source_credit | string | Present when the section has a source credit annotation |
status | string | Present for non-current sections (see Section Status Values) |
eCFR Section-Level Frontmatter
eCFR sections include additional regulatory metadata. Note that currency and last_updated reflect the date the conversion was run, not the source data’s last amendment date:
---
identifier: "/us/cfr/t17/s240.10b-5"
source: "ecfr"
legal_status: "authoritative_unofficial"
title: "17 CFR § 240.10b-5 - Employment of manipulative and deceptive devices"
title_number: 17
title_name: "Commodity and Securities Exchanges"
section_number: "240.10b-5"
section_name: "Employment of manipulative and deceptive devices"
chapter_name: "Securities and Exchange Commission"
part_number: "240"
part_name: "General Rules and Regulations, Securities Exchange Act of 1934"
positive_law: false
currency: "2025-03-21"
last_updated: "2025-03-21"
format_version: "1.1.0"
generator: "[email protected]"
authority: "15 U.S.C. 78a et seq."
regulatory_source: "[37 FR 23603, Nov. 4, 1972]"
cfr_part: "240"
---
eCFR-specific optional fields:
| Field | Type | Description |
|---|---|---|
part_number | string | CFR part number (e.g., "240") |
part_name | string | Part heading text |
chapter_number | number | Only set when the chapter designator is a parseable integer (CFR chapters use Roman numerals, which are captured in chapter_name instead) |
chapter_name | string | Chapter heading text |
authority | string | Regulatory authority citation (from part-level AUTH element) |
regulatory_source | string | Publication source (from part-level SOURCE element) |
cfr_part | string | CFR part number |
cfr_subpart | string | CFR subpart identifier |
source_credit | string | Citation for the section (from CITA element) |
FR Document-Level Frontmatter
FR documents include all common fields plus FR-specific metadata. When a JSON sidecar from the API is available, frontmatter is enriched with structured agency, CFR reference, docket, and date information:
---
identifier: "/us/fr/2026-06029"
source: "fr"
legal_status: "authoritative_unofficial"
title: "Amendments to Exchange Act Rule 10b-5"
title_number: 0
title_name: "Federal Register"
section_number: "2026-06029"
section_name: "Amendments to Exchange Act Rule 10b-5"
positive_law: false
currency: "2026-03-28"
last_updated: "2026-03-28"
format_version: "1.1.0"
generator: "[email protected]"
document_number: "2026-06029"
document_type: "rule"
fr_citation: "91 FR 14523"
fr_volume: 91
publication_date: "2026-03-28"
agencies:
- "Securities and Exchange Commission"
cfr_references:
- "17 CFR Part 240"
docket_ids:
- "Release No. 34-99999"
rin: "3235-AM00"
effective_date: "2026-05-27"
fr_action: "Final rule."
---
FR-specific optional fields:
| Field | Type | Description |
|---|---|---|
document_number | string | FR document number (e.g., "2026-06029") |
document_type | string | Normalized type: "rule", "proposed_rule", "notice", "presidential_document" |
fr_citation | string | Full FR citation (e.g., "91 FR 14523") |
fr_volume | number | FR volume number |
publication_date | string | Publication date (YYYY-MM-DD) |
agencies | string[] | Issuing agency names |
cfr_references | string[] | Affected CFR titles/parts |
docket_ids | string[] | Docket identifiers |
rin | string | Regulation Identifier Number |
effective_date | string | When the rule takes effect |
comments_close_date | string | Comment period end date (proposed rules) |
fr_action | string | Action description (e.g., "Final rule.") |
Title-Level Enriched Frontmatter
Title granularity files include aggregate statistics instead of section/chapter context fields:
---
identifier: "/us/usc/t1"
source: "usc"
legal_status: "official_legal_evidence"
title: "Title 1 — GENERAL PROVISIONS"
title_number: 1
title_name: "GENERAL PROVISIONS"
positive_law: true
currency: "119-73"
last_updated: "2025-12-03"
format_version: "1.1.0"
generator: "[email protected]"
chapter_count: 3
section_count: 15
total_token_estimate: 12500
---
| Field | Type | Description |
|---|---|---|
chapter_count | number | Number of chapters in the title |
section_count | number | Total sections across all chapters |
total_token_estimate | number | Estimated token count for the entire title |
part_count | number | Number of parts (eCFR title-level only) |
Legal Status Values
| Value | Meaning | Applies To |
|---|---|---|
official_legal_evidence | Positive law titles; the text itself is legal evidence | USC titles enacted as positive law |
official_prima_facie | Non-positive law titles; prima facie evidence of the law | USC titles not enacted as positive law |
authoritative_unofficial | Authoritative but not official; derived from official sources | All eCFR and FR content |
Identifier Format
USC identifiers use the canonical URI scheme from USLM identifier attributes:
/us/usc/t{title} Title level
/us/usc/t{title}/ch{chapter} Chapter level
/us/usc/t{title}/s{section} Section level
/us/usc/t{title}/s{section}/{sub} Subsection level
CFR identifiers are constructed from eCFR XML attributes and use /us/cfr/ (content type), not /us/ecfr/ (data source):
/us/cfr/t{title} Title level
/us/cfr/t{title}/ch{chapter} Chapter level
/us/cfr/t{title}/pt{part} Part level
/us/cfr/t{title}/s{section} Section level
Both eCFR and future annual CFR sources share the /us/cfr/ identifier space.
FR identifiers use document numbers (unique, stable, API primary key):
/us/fr/{document_number} Document level
For a detailed breakdown of the identifier format and link resolution behavior, see the identifier format reference.
Content Structure
Section Heading
Every section file begins with a level-1 heading displaying the section number and name:
# § 1. Words denoting number, gender, and so forth
For eCFR content:
# § 240.10b-5 Employment of manipulative and deceptive devices
Inline Hierarchy (Small Levels)
Subsections and all levels below use bold inline numbering rather than Markdown headings. This is a deliberate design choice: headings would imply document structure, but legal subsections are subordinate to the section and should not appear in a table of contents.
**(a)** For the purposes of any Federal law, an individual shall be
considered married if that individual's marriage is between 2 individuals
and is valid in the State where the marriage was entered into.
**(b)** In this section, the term "State" means a State, the District of
Columbia, the Commonwealth of Puerto Rico, or any other territory or
possession of the United States.
When a subsection has a heading, it follows the number in bold:
**(a)** **In general.** — The Secretary shall prescribe regulations...
The numbering scheme communicates hierarchical depth:
| Level | Style | Example |
|---|---|---|
| Subsection | Lowercase letter | **(a)** |
| Paragraph | Arabic numeral | **(1)** |
| Subparagraph | Uppercase letter | **(A)** |
| Clause | Lowercase Roman numeral | **(i)** |
| Subclause | Uppercase Roman numeral | **(I)** |
| Item | Double lowercase | **(aa)** |
| Subitem | Double uppercase | **(AA)** |
| Subsubitem | Triple lowercase | **(aaa)** |
Content is never indented with leading spaces. Markdown indentation would create code blocks, defeating the purpose. Hierarchy is communicated exclusively through the numbering scheme.
Title-Level Heading Hierarchy
When rendering at title or chapter granularity (multiple sections in a single file), structural headings use an increasing depth:
| Element | Heading Level |
|---|---|
| Title | # (H1) |
| Chapter | ## (H2) |
| Section | ### (H3) |
| Subchapter | ## or ### depending on nesting |
Structural headings cap at H5. Big-level headings that would exceed H5 render as bold text instead.
Source Credits
Source credits are separated from the body by a horizontal rule and rendered with a bold label:
---
**Source Credit**: (July 30, 1947, ch. 388, 61 Stat. 633.)
Notes
Notes appear after the source credit. Cross-heading notes that categorize groups of notes render as level-2 headings. Individual note headings render as level-3 headings:
## Editorial Notes
### Amendments
2022—Pub. L. 117–228 amended section generally.
1996—Pub. L. 104–199 added this section.
## Statutory Notes and Related Subsidiaries
### Severability
If any provision of this Act is held to be unconstitutional,
the remainder shall not be affected.
Notes are included by default and can be controlled with CLI flags:
--no-include-notesdisables all notes.--include-editorial-notesenables only editorial notes.--include-statutory-notesenables only statutory notes.--include-amendmentsenables only amendment history.
Quoted Content
Quoted legal text (from <quotedContent> elements, typically quoted bills in statutory notes) renders as Markdown blockquotes:
> (a) The Secretary shall establish a program...
>
> (b) The program shall include...
Footnotes
Footnotes use Markdown footnote syntax. References appear inline as [^N] and definitions appear at the bottom of the section file:
The term applies to all cases[^1] under this section.
[^1]: As defined in section 101 of title 5.
Defined Terms
Terms being defined (from <term> elements) render in bold:
The term **employee** means an individual employed by the Government.
Inline Formatting
| Source Element | Markdown Output |
|---|---|
<b> / bold | **text** |
<i> / italic | *text* |
<sup> | <sup>text</sup> |
<sub> | <sub>text</sub> |
<term> | **text** |
<quotedContent> | > blockquote |
Tables
Simple Tables
Tables without colspan or rowspan render as standard Markdown pipe tables:
| Rate | Amount | Date |
| --- | --- | --- |
| Basic | $100 | 2024-01-01 |
| Premium | $250 | 2024-06-01 |
| Enterprise | $500 | 2024-12-01 |
Pipe characters within cell content are escaped as \|. Backslashes are escaped as \\.
Layout Tables
USLM <layout> elements (column-oriented display, common in pay schedules) also render as Markdown pipe tables when their structure is compatible:
| Grade | Step 1 | Step 2 |
| --- | --- | --- |
| GS-1 | $20,000 | $21,000 |
| GS-2 | $25,000 | $26,500 |
Complex Tables
Tables with colspan, rowspan, or other features that do not map cleanly to Markdown pipe syntax are rendered as best-effort pipe tables. Complex layout features (such as multi-column headers or cells spanning multiple rows) may be flattened or approximated. If you require lossless table structure, refer to the source XML.
Cross-Reference Links
Cross-reference rendering is controlled by the --link-style option. Three styles are available:
Plaintext (default)
References render as unlinked text:
section 101 of title 5
Relative
References to sections within the converted corpus resolve to relative Markdown links. References outside the corpus fall back to external URLs:
[section 101 of title 5](../title-05/chapter-01/section-101.md)
The link resolver uses a two-pass approach: all section identifiers are registered before any rendering occurs, enabling both forward and backward cross-references to resolve.
Canonical
USC references link to the OLRC website (uscode.house.gov):
[section 101 of title 5](https://uscode.house.gov/view.xhtml?req=granuleid:USC-prelim-title5-section101)
Link Resolution Rules
| Identifier Prefix | Resolved As |
|---|---|
/us/usc/ | Relative link when resolved; otherwise OLRC fallback URL |
/us/cfr/ | Relative link when resolved; otherwise plain text |
/us/fr/ | Relative link when resolved; otherwise federalregister.gov fallback URL |
/us/stat/ | Always plain text (Statutes at Large) |
/us/pl/ | Always plain text (Public Law) |
Fallback URLs for unresolved references:
- USC:
https://uscode.house.gov/view.xhtml?req=granuleid:USC-prelim-title{N}-section{N}
Unresolved CFR references (/us/cfr/) are rendered as plain text. No automatic ecfr.gov fallback URLs are generated.
Metadata Index (_meta.json)
Sidecar JSON index files are generated at section granularity for all sources. USC additionally generates chapter-level indexes. eCFR currently generates _meta.json only at section granularity and does not emit chapter/part/title-level sidecar files. Title granularity uses enriched frontmatter instead. These files enable index-based retrieval without parsing individual Markdown files.
Title-Level Index
For USC output, each title directory contains a _meta.json with aggregate metadata and a listing of all chapters.
USC title-level _meta.json:
{
"format_version": "1.1.0",
"generator": "[email protected]",
"generated_at": "2025-12-03T12:00:00.000Z",
"identifier": "/us/usc/t1",
"title_number": 1,
"title_name": "GENERAL PROVISIONS",
"positive_law": true,
"currency": "119-73",
"release_point": "us/pl/119/73not60",
"source_xml": "usc01.xml",
"granularity": "section",
"stats": {
"chapter_count": 3,
"section_count": 15,
"total_files": 15,
"total_tokens_estimate": 12500
},
"chapters": [
{
"identifier": "/us/usc/t1/ch1",
"number": 1,
"name": "RULES OF CONSTRUCTION",
"directory": "chapter-01",
"sections": [
{
"identifier": "/us/usc/t1/s1",
"number": "1",
"name": "Words denoting number, gender, and so forth",
"file": "section-1.md",
"token_estimate": 250,
"has_notes": true,
"status": "current"
}
]
}
]
}
eCFR title-level _meta.json:
{
"format_version": "1.1.0",
"generator": "[email protected]",
"generated_at": "2025-03-15T12:00:00.000Z",
"identifier": "/us/cfr/t17",
"title_number": 17,
"title_name": "Commodity and Securities Exchanges",
"source": "ecfr",
"legal_status": "authoritative_unofficial",
"currency": "2025-03-15",
"source_xml": "ECFR-title17.xml",
"granularity": "section",
"stats": {
"part_count": 42,
"section_count": 3500,
"total_files": 3500,
"total_tokens_estimate": 2500000
},
"parts": [
{
"identifier": "/us/cfr/t17/pt240",
"number": "240",
"name": "General Rules and Regulations, Securities Exchange Act of 1934",
"directory": "part-240",
"sections": [
{
"identifier": "/us/cfr/t17/s240.10b-5",
"number": "240.10b-5",
"name": "Employment of manipulative and deceptive devices",
"file": "section-240.10b-5.md",
"token_estimate": 150,
"has_notes": false,
"status": "current"
}
]
}
]
}
Chapter-Level Index (USC)
Each chapter directory contains a _meta.json with section listings:
{
"format_version": "1.1.0",
"identifier": "/us/usc/t1/ch1",
"chapter_number": 1,
"chapter_name": "RULES OF CONSTRUCTION",
"title_number": 1,
"section_count": 8,
"sections": [
{
"identifier": "/us/usc/t1/s1",
"number": "1",
"name": "Words denoting number, gender, and so forth",
"file": "section-1.md",
"token_estimate": 250,
"has_notes": true,
"status": "current"
}
]
}
Part-Level Index (eCFR)
Each part directory contains a _meta.json with section listings:
{
"format_version": "1.1.0",
"identifier": "/us/cfr/t17/pt240",
"part_number": "240",
"part_name": "General Rules and Regulations, Securities Exchange Act of 1934",
"title_number": 17,
"section_count": 450,
"sections": [
{
"identifier": "/us/cfr/t17/s240.10b-5",
"number": "240.10b-5",
"name": "Employment of manipulative and deceptive devices",
"file": "section-240.10b-5.md",
"token_estimate": 150,
"has_notes": false,
"status": "current"
}
]
}
Section Entry Schema
Each section entry in a _meta.json sections array contains:
| Field | Type | Description |
|---|---|---|
identifier | string | Canonical URI identifier |
number | string | Section number (may be alphanumeric) |
name | string | Section heading text |
file | string | Filename within the containing directory |
token_estimate | number | Estimated token count for the section |
has_notes | boolean | Whether the section contains editorial or statutory notes |
status | string | Section status (e.g., "current", "repealed") |
Token Estimation
Token counts use a character-divided-by-four heuristic:
token_estimate = Math.ceil(contentLength / 4)
The contentLength is the byte length of the rendered Markdown content (including the YAML frontmatter). This approximation is intentionally simple and errs on the side of overestimation. It is suitable for capacity planning and chunking decisions, not precise billing.
Token estimates appear in three places:
token_estimateper section in_meta.jsonsection entries.total_tokens_estimatein_meta.jsonstatsobjects.total_token_estimatein title-level enriched frontmatter.
Section Status Values
Sections may carry a status field in both frontmatter and _meta.json entries. The status reflects the legal state of the section as recorded in the source XML.
| Status | Meaning | Rendered Content |
|---|---|---|
current | Active, in-force provision | Full section text |
repealed | Explicitly repealed by legislation | [Repealed] |
transferred | Moved to a different location in the code | [Transferred to section N of title N] |
omitted | Omitted from the code (e.g., expired appropriations) | [Omitted] |
reserved | Placeholder reserved for future use | [Reserved] |
renumbered | Renumbered to a different section | Note text indicating new designation |
redesignated | Redesignated with a new number | Note text indicating new designation |
expired | Expired by its own terms | [Expired] or note text |
terminated | Terminated by operation of law | [Terminated] or note text |
suspended | Temporarily suspended | Note text indicating suspension |
When a section is not current, its frontmatter includes a status field. Current sections omit the field entirely (the absence of status implies "current").
README Files
At section granularity, each title directory receives a README.md providing a human-readable summary table and chapter/part listing. These files are generated artifacts and are not intended for RAG ingestion.
RAG Integration Guidance
Chunking Strategy
The output is designed to align with common RAG chunking strategies:
-
Section level: Individual section files range from approximately 500 to 3,000 tokens each, fitting naturally into most embedding models’ context windows. Each file is a self-contained, citable legal provision with rich metadata. This is the recommended granularity for vector storage.
-
Chapter/part level: When using chapter or part granularity files, split on
# §heading patterns to recover individual sections. Each# §heading begins a new logical unit. -
Title level: Best suited for direct LLM context window injection (e.g., “read Title 1 in its entirety”). Title files can exceed model context limits for large titles (Title 26 or Title 42 produce multi-million-token output). Not recommended for vector storage.
Metadata for Vector Stores
When indexing section-level files into a vector database, extract these frontmatter fields as structured metadata for filtering and retrieval:
| Field | Purpose |
|---|---|
identifier | Unique key; stable across conversions of the same source data |
source | Filter by corpus ("usc", "ecfr", or "fr") |
title_number | Filter by title |
section_number | Section-level lookup |
legal_status | Filter by legal authority level |
status | Exclude non-current sections from search results |
currency | Track data freshness |
File Path Stability
Output file paths are deterministic: converting the same source XML at the same granularity always produces the same directory structure and filenames. Paths change only when:
- The source data itself changes (new release point or updated eCFR date).
- The output format version changes.
- The granularity option changes.
This stability makes file paths suitable as document identifiers in vector stores, provided the source version is also tracked.
Programmatic Access via the Data API
The LexBuild Data API provides REST access to the same content stored in a SQLite database. The lexbuild ingest CLI command populates the database from the section-level output files. All frontmatter fields are available as JSON response fields, and the full Markdown body is retrievable per document. The API supports content negotiation (JSON, Markdown, or plaintext), field selection, full text search with faceted filtering, and paginated listings with sorting. This is an alternative to direct file ingestion for applications that prefer an HTTP interface over filesystem access.