Skip to content
LexBuild
On this page

This is the authoritative specification for LexBuild’s Markdown output format. Every .md file that LexBuild produces follows this format, which is designed for RAG pipelines, vector databases, and LLM context windows. If you are building a system that consumes LexBuild output, this is the document you need.

Format Versioning

The current format version is 1.1.0, defined by the FORMAT_VERSION constant in @lexbuild/core. Breaking changes to the output format increment the major version.

The format version is recorded in two places:

  • The format_version field in every Markdown file’s YAML frontmatter.
  • The format_version field in every _meta.json sidecar index.

Directory Layout

USC Output

LexBuild supports three output granularities for U.S. Code content. The granularity determines how content is partitioned into files.

Section granularity (default):

output/usc/
├── title-01/
│   ├── chapter-01/
│   │   ├── section-1.md
│   │   ├── section-2.md
│   │   └── _meta.json
│   ├── chapter-02/
│   │   ├── section-101.md
│   │   └── _meta.json
│   ├── _meta.json
│   └── README.md
└── title-54/
    └── ...

Chapter granularity:

output/usc/
├── title-01/
│   ├── chapter-01/
│   │   └── chapter-01.md
│   ├── chapter-02/
│   │   └── chapter-02.md
│   ├── _meta.json
│   └── README.md
└── ...

Title granularity:

output/usc/
├── title-01.md
├── title-02.md
└── ...

Title granularity produces flat files with no subdirectories and no sidecar files. The frontmatter is enriched with aggregate statistics (chapter_count, section_count, total_token_estimate).

eCFR Output

LexBuild supports four output granularities for Code of Federal Regulations content.

Section granularity (default):

output/ecfr/
├── title-01/
│   ├── chapter-I/
│   │   ├── part-1/
│   │   │   ├── section-1.1.md
│   │   │   ├── section-1.2.md
│   │   │   └── _meta.json
│   │   └── part-2/
│   │       ├── section-2.1.md
│   │       └── _meta.json
│   ├── _meta.json
│   └── README.md
└── title-50/
    └── ...

Part granularity:

output/ecfr/
├── title-17/
│   ├── chapter-II/
│   │   ├── part-240.md
│   │   └── part-249.md
│   └── ...
└── ...

Chapter granularity:

output/ecfr/
├── title-17/
│   ├── chapter-I.md
│   ├── chapter-II.md
│   └── ...
└── ...

Title granularity:

output/ecfr/
├── title-01.md
├── title-17.md
└── ...

FR Output

Federal Register documents produce one file per document, organized by publication date:

output/fr/
├── 2026/
│   ├── 01/
│   │   └── 2026-00123.md
│   └── 03/
│       ├── 2026-06029.md
│       └── 2026-06048.md
└── 2025/
    └── ...

No granularity options are available for FR output. FR documents are already atomic (one file per document).

Naming Conventions

ComponentPatternExamplesNotes
Title dir (USC)title-{NN}title-01, title-542-digit zero-padded
Title dir (eCFR)title-{NN}title-01, title-502-digit zero-padded
Appendix dirtitle-{NN}-appendixtitle-05-appendixUSC only: titles 5, 11, 18, 28
Chapter dir (USC)chapter-{NN}chapter-01, chapter-992-digit zero-padded
Chapter dir (eCFR)chapter-{X}chapter-I, chapter-IVRoman numerals
Part dir (eCFR)part-{N}part-1, part-240Not zero-padded
Section file (USC)section-{ID}.mdsection-1.md, section-7801.md, section-202a.mdNot zero-padded; may be alphanumeric
Section file (eCFR)section-{N.N}.mdsection-1.1.md, section-240.10b-5.mdPart-prefixed section number
Duplicate sectionssection-{ID}-2.mdsection-3598-2.mdUSC only; -2, -3 suffix for subsequent occurrences
Year dir (FR){YYYY}20264-digit year
Month dir (FR){MM}01, 032-digit zero-padded month
Document file (FR){doc_number}.md2026-06029.mdFR document number

Frontmatter Schema

Every output file begins with a YAML frontmatter block delimited by ---. Fields are serialized in a controlled order using double-quoted string values for consistency.

Common Fields

Every file, regardless of source or granularity, includes these fields:

FieldTypeDescription
identifierstringCanonical URI identifier (e.g., "/us/usc/t1/s1", "/us/cfr/t17/s240.10b-5")
sourcestringContent source: "usc", "ecfr", or "fr"
legal_statusstringLegal provenance (see Legal Status Values)
titlestringHuman-readable display title
title_numbernumberNumeric title designation
title_namestringTitle heading text
positive_lawbooleanWhether the title is enacted as positive law
currencystringUSC: release point identifier (e.g., "119-73"); eCFR: ISO date of the conversion run (e.g., "2025-03-15")
last_updatedstringISO date of the conversion run
format_versionstringOutput format version ("1.1.0")
generatorstringGenerator identifier (e.g., "[email protected]")

USC Section-Level Frontmatter

A complete USC section file includes all common fields plus section-specific context:

---
identifier: "/us/usc/t1/s1"
source: "usc"
legal_status: "official_legal_evidence"
title: "1 USC § 1 - Words denoting number, gender, and so forth"
title_number: 1
title_name: "GENERAL PROVISIONS"
section_number: "1"
section_name: "Words denoting number, gender, and so forth"
chapter_number: 1
chapter_name: "RULES OF CONSTRUCTION"
positive_law: true
currency: "119-73"
last_updated: "2025-12-03"
format_version: "1.1.0"
generator: "[email protected]"
source_credit: "(July 30, 1947, ch. 388, 61 Stat. 633.)"
---

Optional fields that appear when applicable:

FieldTypeCondition
subchapter_numberstringPresent when section is within a subchapter
subchapter_namestringPresent when section is within a subchapter
source_creditstringPresent when the section has a source credit annotation
statusstringPresent for non-current sections (see Section Status Values)

eCFR Section-Level Frontmatter

eCFR sections include additional regulatory metadata. Note that currency and last_updated reflect the date the conversion was run, not the source data’s last amendment date:

---
identifier: "/us/cfr/t17/s240.10b-5"
source: "ecfr"
legal_status: "authoritative_unofficial"
title: "17 CFR § 240.10b-5 - Employment of manipulative and deceptive devices"
title_number: 17
title_name: "Commodity and Securities Exchanges"
section_number: "240.10b-5"
section_name: "Employment of manipulative and deceptive devices"
chapter_name: "Securities and Exchange Commission"
part_number: "240"
part_name: "General Rules and Regulations, Securities Exchange Act of 1934"
positive_law: false
currency: "2025-03-21"
last_updated: "2025-03-21"
format_version: "1.1.0"
generator: "[email protected]"
authority: "15 U.S.C. 78a et seq."
regulatory_source: "[37 FR 23603, Nov. 4, 1972]"
cfr_part: "240"
---

eCFR-specific optional fields:

FieldTypeDescription
part_numberstringCFR part number (e.g., "240")
part_namestringPart heading text
chapter_numbernumberOnly set when the chapter designator is a parseable integer (CFR chapters use Roman numerals, which are captured in chapter_name instead)
chapter_namestringChapter heading text
authoritystringRegulatory authority citation (from part-level AUTH element)
regulatory_sourcestringPublication source (from part-level SOURCE element)
cfr_partstringCFR part number
cfr_subpartstringCFR subpart identifier
source_creditstringCitation for the section (from CITA element)

FR Document-Level Frontmatter

FR documents include all common fields plus FR-specific metadata. When a JSON sidecar from the API is available, frontmatter is enriched with structured agency, CFR reference, docket, and date information:

---
identifier: "/us/fr/2026-06029"
source: "fr"
legal_status: "authoritative_unofficial"
title: "Amendments to Exchange Act Rule 10b-5"
title_number: 0
title_name: "Federal Register"
section_number: "2026-06029"
section_name: "Amendments to Exchange Act Rule 10b-5"
positive_law: false
currency: "2026-03-28"
last_updated: "2026-03-28"
format_version: "1.1.0"
generator: "[email protected]"
document_number: "2026-06029"
document_type: "rule"
fr_citation: "91 FR 14523"
fr_volume: 91
publication_date: "2026-03-28"
agencies:
  - "Securities and Exchange Commission"
cfr_references:
  - "17 CFR Part 240"
docket_ids:
  - "Release No. 34-99999"
rin: "3235-AM00"
effective_date: "2026-05-27"
fr_action: "Final rule."
---

FR-specific optional fields:

FieldTypeDescription
document_numberstringFR document number (e.g., "2026-06029")
document_typestringNormalized type: "rule", "proposed_rule", "notice", "presidential_document"
fr_citationstringFull FR citation (e.g., "91 FR 14523")
fr_volumenumberFR volume number
publication_datestringPublication date (YYYY-MM-DD)
agenciesstring[]Issuing agency names
cfr_referencesstring[]Affected CFR titles/parts
docket_idsstring[]Docket identifiers
rinstringRegulation Identifier Number
effective_datestringWhen the rule takes effect
comments_close_datestringComment period end date (proposed rules)
fr_actionstringAction description (e.g., "Final rule.")

Title-Level Enriched Frontmatter

Title granularity files include aggregate statistics instead of section/chapter context fields:

---
identifier: "/us/usc/t1"
source: "usc"
legal_status: "official_legal_evidence"
title: "Title 1 — GENERAL PROVISIONS"
title_number: 1
title_name: "GENERAL PROVISIONS"
positive_law: true
currency: "119-73"
last_updated: "2025-12-03"
format_version: "1.1.0"
generator: "[email protected]"
chapter_count: 3
section_count: 15
total_token_estimate: 12500
---
FieldTypeDescription
chapter_countnumberNumber of chapters in the title
section_countnumberTotal sections across all chapters
total_token_estimatenumberEstimated token count for the entire title
part_countnumberNumber of parts (eCFR title-level only)
ValueMeaningApplies To
official_legal_evidencePositive law titles; the text itself is legal evidenceUSC titles enacted as positive law
official_prima_facieNon-positive law titles; prima facie evidence of the lawUSC titles not enacted as positive law
authoritative_unofficialAuthoritative but not official; derived from official sourcesAll eCFR and FR content

Identifier Format

USC identifiers use the canonical URI scheme from USLM identifier attributes:

/us/usc/t{title}                   Title level
/us/usc/t{title}/ch{chapter}       Chapter level
/us/usc/t{title}/s{section}        Section level
/us/usc/t{title}/s{section}/{sub}  Subsection level

CFR identifiers are constructed from eCFR XML attributes and use /us/cfr/ (content type), not /us/ecfr/ (data source):

/us/cfr/t{title}                   Title level
/us/cfr/t{title}/ch{chapter}       Chapter level
/us/cfr/t{title}/pt{part}          Part level
/us/cfr/t{title}/s{section}        Section level

Both eCFR and future annual CFR sources share the /us/cfr/ identifier space.

FR identifiers use document numbers (unique, stable, API primary key):

/us/fr/{document_number}           Document level

For a detailed breakdown of the identifier format and link resolution behavior, see the identifier format reference.

Content Structure

Section Heading

Every section file begins with a level-1 heading displaying the section number and name:

# § 1. Words denoting number, gender, and so forth

For eCFR content:

# § 240.10b-5 Employment of manipulative and deceptive devices

Inline Hierarchy (Small Levels)

Subsections and all levels below use bold inline numbering rather than Markdown headings. This is a deliberate design choice: headings would imply document structure, but legal subsections are subordinate to the section and should not appear in a table of contents.

**(a)** For the purposes of any Federal law, an individual shall be
considered married if that individual's marriage is between 2 individuals
and is valid in the State where the marriage was entered into.

**(b)** In this section, the term "State" means a State, the District of
Columbia, the Commonwealth of Puerto Rico, or any other territory or
possession of the United States.

When a subsection has a heading, it follows the number in bold:

**(a)** **In general.** — The Secretary shall prescribe regulations...

The numbering scheme communicates hierarchical depth:

LevelStyleExample
SubsectionLowercase letter**(a)**
ParagraphArabic numeral**(1)**
SubparagraphUppercase letter**(A)**
ClauseLowercase Roman numeral**(i)**
SubclauseUppercase Roman numeral**(I)**
ItemDouble lowercase**(aa)**
SubitemDouble uppercase**(AA)**
SubsubitemTriple lowercase**(aaa)**

Content is never indented with leading spaces. Markdown indentation would create code blocks, defeating the purpose. Hierarchy is communicated exclusively through the numbering scheme.

Title-Level Heading Hierarchy

When rendering at title or chapter granularity (multiple sections in a single file), structural headings use an increasing depth:

ElementHeading Level
Title# (H1)
Chapter## (H2)
Section### (H3)
Subchapter## or ### depending on nesting

Structural headings cap at H5. Big-level headings that would exceed H5 render as bold text instead.

Source Credits

Source credits are separated from the body by a horizontal rule and rendered with a bold label:

---

**Source Credit**: (July 30, 1947, ch. 388, 61 Stat. 633.)

Notes

Notes appear after the source credit. Cross-heading notes that categorize groups of notes render as level-2 headings. Individual note headings render as level-3 headings:

## Editorial Notes

### Amendments

2022—Pub. L. 117–228 amended section generally.

1996—Pub. L. 104–199 added this section.

## Statutory Notes and Related Subsidiaries

### Severability

If any provision of this Act is held to be unconstitutional,
the remainder shall not be affected.

Notes are included by default and can be controlled with CLI flags:

  • --no-include-notes disables all notes.
  • --include-editorial-notes enables only editorial notes.
  • --include-statutory-notes enables only statutory notes.
  • --include-amendments enables only amendment history.

Quoted Content

Quoted legal text (from <quotedContent> elements, typically quoted bills in statutory notes) renders as Markdown blockquotes:

> (a) The Secretary shall establish a program...
>
> (b) The program shall include...

Footnotes

Footnotes use Markdown footnote syntax. References appear inline as [^N] and definitions appear at the bottom of the section file:

The term applies to all cases[^1] under this section.

[^1]: As defined in section 101 of title 5.

Defined Terms

Terms being defined (from <term> elements) render in bold:

The term **employee** means an individual employed by the Government.

Inline Formatting

Source ElementMarkdown Output
<b> / bold**text**
<i> / italic*text*
<sup><sup>text</sup>
<sub><sub>text</sub>
<term>**text**
<quotedContent>> blockquote

Tables

Simple Tables

Tables without colspan or rowspan render as standard Markdown pipe tables:

| Rate | Amount | Date |
| --- | --- | --- |
| Basic | $100 | 2024-01-01 |
| Premium | $250 | 2024-06-01 |
| Enterprise | $500 | 2024-12-01 |

Pipe characters within cell content are escaped as \|. Backslashes are escaped as \\.

Layout Tables

USLM <layout> elements (column-oriented display, common in pay schedules) also render as Markdown pipe tables when their structure is compatible:

| Grade | Step 1 | Step 2 |
| --- | --- | --- |
| GS-1 | $20,000 | $21,000 |
| GS-2 | $25,000 | $26,500 |

Complex Tables

Tables with colspan, rowspan, or other features that do not map cleanly to Markdown pipe syntax are rendered as best-effort pipe tables. Complex layout features (such as multi-column headers or cells spanning multiple rows) may be flattened or approximated. If you require lossless table structure, refer to the source XML.

Cross-reference rendering is controlled by the --link-style option. Three styles are available:

Plaintext (default)

References render as unlinked text:

section 101 of title 5

Relative

References to sections within the converted corpus resolve to relative Markdown links. References outside the corpus fall back to external URLs:

[section 101 of title 5](../title-05/chapter-01/section-101.md)

The link resolver uses a two-pass approach: all section identifiers are registered before any rendering occurs, enabling both forward and backward cross-references to resolve.

Canonical

USC references link to the OLRC website (uscode.house.gov):

[section 101 of title 5](https://uscode.house.gov/view.xhtml?req=granuleid:USC-prelim-title5-section101)
Identifier PrefixResolved As
/us/usc/Relative link when resolved; otherwise OLRC fallback URL
/us/cfr/Relative link when resolved; otherwise plain text
/us/fr/Relative link when resolved; otherwise federalregister.gov fallback URL
/us/stat/Always plain text (Statutes at Large)
/us/pl/Always plain text (Public Law)

Fallback URLs for unresolved references:

  • USC: https://uscode.house.gov/view.xhtml?req=granuleid:USC-prelim-title{N}-section{N}

Unresolved CFR references (/us/cfr/) are rendered as plain text. No automatic ecfr.gov fallback URLs are generated.

Metadata Index (_meta.json)

Sidecar JSON index files are generated at section granularity for all sources. USC additionally generates chapter-level indexes. eCFR currently generates _meta.json only at section granularity and does not emit chapter/part/title-level sidecar files. Title granularity uses enriched frontmatter instead. These files enable index-based retrieval without parsing individual Markdown files.

Title-Level Index

For USC output, each title directory contains a _meta.json with aggregate metadata and a listing of all chapters.

USC title-level _meta.json:

{
  "format_version": "1.1.0",
  "generator": "[email protected]",
  "generated_at": "2025-12-03T12:00:00.000Z",
  "identifier": "/us/usc/t1",
  "title_number": 1,
  "title_name": "GENERAL PROVISIONS",
  "positive_law": true,
  "currency": "119-73",
  "release_point": "us/pl/119/73not60",
  "source_xml": "usc01.xml",
  "granularity": "section",
  "stats": {
    "chapter_count": 3,
    "section_count": 15,
    "total_files": 15,
    "total_tokens_estimate": 12500
  },
  "chapters": [
    {
      "identifier": "/us/usc/t1/ch1",
      "number": 1,
      "name": "RULES OF CONSTRUCTION",
      "directory": "chapter-01",
      "sections": [
        {
          "identifier": "/us/usc/t1/s1",
          "number": "1",
          "name": "Words denoting number, gender, and so forth",
          "file": "section-1.md",
          "token_estimate": 250,
          "has_notes": true,
          "status": "current"
        }
      ]
    }
  ]
}

eCFR title-level _meta.json:

{
  "format_version": "1.1.0",
  "generator": "[email protected]",
  "generated_at": "2025-03-15T12:00:00.000Z",
  "identifier": "/us/cfr/t17",
  "title_number": 17,
  "title_name": "Commodity and Securities Exchanges",
  "source": "ecfr",
  "legal_status": "authoritative_unofficial",
  "currency": "2025-03-15",
  "source_xml": "ECFR-title17.xml",
  "granularity": "section",
  "stats": {
    "part_count": 42,
    "section_count": 3500,
    "total_files": 3500,
    "total_tokens_estimate": 2500000
  },
  "parts": [
    {
      "identifier": "/us/cfr/t17/pt240",
      "number": "240",
      "name": "General Rules and Regulations, Securities Exchange Act of 1934",
      "directory": "part-240",
      "sections": [
        {
          "identifier": "/us/cfr/t17/s240.10b-5",
          "number": "240.10b-5",
          "name": "Employment of manipulative and deceptive devices",
          "file": "section-240.10b-5.md",
          "token_estimate": 150,
          "has_notes": false,
          "status": "current"
        }
      ]
    }
  ]
}

Chapter-Level Index (USC)

Each chapter directory contains a _meta.json with section listings:

{
  "format_version": "1.1.0",
  "identifier": "/us/usc/t1/ch1",
  "chapter_number": 1,
  "chapter_name": "RULES OF CONSTRUCTION",
  "title_number": 1,
  "section_count": 8,
  "sections": [
    {
      "identifier": "/us/usc/t1/s1",
      "number": "1",
      "name": "Words denoting number, gender, and so forth",
      "file": "section-1.md",
      "token_estimate": 250,
      "has_notes": true,
      "status": "current"
    }
  ]
}

Part-Level Index (eCFR)

Each part directory contains a _meta.json with section listings:

{
  "format_version": "1.1.0",
  "identifier": "/us/cfr/t17/pt240",
  "part_number": "240",
  "part_name": "General Rules and Regulations, Securities Exchange Act of 1934",
  "title_number": 17,
  "section_count": 450,
  "sections": [
    {
      "identifier": "/us/cfr/t17/s240.10b-5",
      "number": "240.10b-5",
      "name": "Employment of manipulative and deceptive devices",
      "file": "section-240.10b-5.md",
      "token_estimate": 150,
      "has_notes": false,
      "status": "current"
    }
  ]
}

Section Entry Schema

Each section entry in a _meta.json sections array contains:

FieldTypeDescription
identifierstringCanonical URI identifier
numberstringSection number (may be alphanumeric)
namestringSection heading text
filestringFilename within the containing directory
token_estimatenumberEstimated token count for the section
has_notesbooleanWhether the section contains editorial or statutory notes
statusstringSection status (e.g., "current", "repealed")

Token Estimation

Token counts use a character-divided-by-four heuristic:

token_estimate = Math.ceil(contentLength / 4)

The contentLength is the byte length of the rendered Markdown content (including the YAML frontmatter). This approximation is intentionally simple and errs on the side of overestimation. It is suitable for capacity planning and chunking decisions, not precise billing.

Token estimates appear in three places:

  1. token_estimate per section in _meta.json section entries.
  2. total_tokens_estimate in _meta.json stats objects.
  3. total_token_estimate in title-level enriched frontmatter.

Section Status Values

Sections may carry a status field in both frontmatter and _meta.json entries. The status reflects the legal state of the section as recorded in the source XML.

StatusMeaningRendered Content
currentActive, in-force provisionFull section text
repealedExplicitly repealed by legislation[Repealed]
transferredMoved to a different location in the code[Transferred to section N of title N]
omittedOmitted from the code (e.g., expired appropriations)[Omitted]
reservedPlaceholder reserved for future use[Reserved]
renumberedRenumbered to a different sectionNote text indicating new designation
redesignatedRedesignated with a new numberNote text indicating new designation
expiredExpired by its own terms[Expired] or note text
terminatedTerminated by operation of law[Terminated] or note text
suspendedTemporarily suspendedNote text indicating suspension

When a section is not current, its frontmatter includes a status field. Current sections omit the field entirely (the absence of status implies "current").

README Files

At section granularity, each title directory receives a README.md providing a human-readable summary table and chapter/part listing. These files are generated artifacts and are not intended for RAG ingestion.

RAG Integration Guidance

Chunking Strategy

The output is designed to align with common RAG chunking strategies:

  • Section level: Individual section files range from approximately 500 to 3,000 tokens each, fitting naturally into most embedding models’ context windows. Each file is a self-contained, citable legal provision with rich metadata. This is the recommended granularity for vector storage.

  • Chapter/part level: When using chapter or part granularity files, split on # § heading patterns to recover individual sections. Each # § heading begins a new logical unit.

  • Title level: Best suited for direct LLM context window injection (e.g., “read Title 1 in its entirety”). Title files can exceed model context limits for large titles (Title 26 or Title 42 produce multi-million-token output). Not recommended for vector storage.

Metadata for Vector Stores

When indexing section-level files into a vector database, extract these frontmatter fields as structured metadata for filtering and retrieval:

FieldPurpose
identifierUnique key; stable across conversions of the same source data
sourceFilter by corpus ("usc", "ecfr", or "fr")
title_numberFilter by title
section_numberSection-level lookup
legal_statusFilter by legal authority level
statusExclude non-current sections from search results
currencyTrack data freshness

File Path Stability

Output file paths are deterministic: converting the same source XML at the same granularity always produces the same directory structure and filenames. Paths change only when:

  • The source data itself changes (new release point or updated eCFR date).
  • The output format version changes.
  • The granularity option changes.

This stability makes file paths suitable as document identifiers in vector stores, provided the source version is also tracked.

Programmatic Access via the Data API

The LexBuild Data API provides REST access to the same content stored in a SQLite database. The lexbuild ingest CLI command populates the database from the section-level output files. All frontmatter fields are available as JSON response fields, and the full Markdown body is retrievable per document. The API supports content negotiation (JSON, Markdown, or plaintext), field selection, full text search with faceted filtering, and paginated listings with sorting. This is an alternative to direct file ingestion for applications that prefer an HTTP interface over filesystem access.