# Dataset Schema This page documents the directory structure and JSON schemas for the FormulationBench dataset. The `formulation-bench` Python package is the recommended way to work with the dataset. This schema reference is provided for those modifying or extending the dataset (see {doc}`user_guide/index`). (directory-structure)= ## Directory Structure Here is a summary of the directory structure: ``` dataset/ ├── lakefile.toml # Lean project configuration ├── lake-manifest.json # Pinned Lean library dependencies ├── lean-toolchain # Pinned Lean version ├── Common.lean # Lean definitions ├── dataset.json # Problem IDs and reformulation pairs ├── problems/ # One subdirectory per problem │ ├── p1/ │ │ ├── description.md # Natural-language problem description │ │ ├── problem.json # Problem name, data parameters, metadata │ │ ├── data.json # Concrete parameter values for one instance │ │ ├── solution.json # Reference optimal solution and objective │ │ └── formulations/ │ │ ├── a/ │ │ │ ├── formulation.json # Structured MILP formulation │ │ │ ├── gen_params.py # Maps data.json to MILP parameters │ │ │ └── Formulation.lean # Lean 4 MILP formulation │ │ └── ... │ └── ... └── reformulations/ # Lean 4 reformulation proofs ├── p1/ │ ├── a_b.lean # Proof that formulation b is a reformulation of a │ └── ... └── ... ``` The root of the dataset directory contains: - Lean project files: `lakefile.toml`, `lake-manifest.json`, `lean-toolchain` (see {doc}`/user_guide/build_lean`) - Lean formulation and reformulation definitions: `Common.lean` (see {doc}`/definitions`) - A manifest of the problems and reformulations contained in the dataset: `dataset.json` - The problems subdirectory (see {ref}`problem-directory`) - The reformulations subdirectory (see {ref}`reformulation-pairs-and-proofs`) The {class}`Dataset ` loads in this dataset directory. :::{note} Two files appear inside a formulation directory at runtime but are *not* shipped with the dataset: `parameters.json`, written by {meth}`Formulation.run_gen_params() `, and `solve.py`, written by {meth}`Formulation.gen_solve_py() `. Both are derived from the files above and can be regenerated at any time. ::: ## `dataset.json` This is the manifest at the dataset root. It has two fields: - **`problems`** — list of integer problem IDs (e.g. `1` for `p1`). Each ID has a `problems/pN/` directory. - **`reformulations`** — flat list of labelled formulation pairs (see {ref}`reformulation-pairs-and-proofs`) (problem-directory)= ## Problem Directory Each `problems/pN/` directory describes one optimization problem and loads into a {class}`Problem `. ### `description.md` A self-contained natural-language description of the optimization problem, exposed as the `description` attribute of {class}`Problem `. It is also interpolated into the `problem_description` block when a formulation is rendered with {meth}`Formulation.render_markdown() `. :::{dropdown} `problems/p1/description.md` :icon: code ```{literalinclude} ../../../dataset/problems/p1/description.md :language: markdown :class: wrap ``` ::: ### `problem.json` Defines the problem `name`, its data `parameters`, and freeform `metadata`: - **`name`** — human-readable problem name. - **`parameters`** — schema of the problem's data parameters, keyed by name. Each value is a {class}`Parameter ` with a `description`, `type`, and `shape` (see {ref}`variable-shape-notation`). - **`metadata`** — freeform; typically a `source` field recording which dataset the problem was adapted from and a `notes` field with commentary. These populate the source and notes blocks on the {doc}`/problems/index` pages. :::{dropdown} `problems/p1/problem.json` :icon: code ```{literalinclude} ../../../dataset/problems/p1/problem.json :language: json ``` ::: ### `data.json` A single concrete instance: a JSON object mapping each parameter name in `problem.json` to a value. The same instance is reused across every formulation of the problem (each formulation's `gen_params.py` translates it into formulation-specific parameters). Exposed as the `data` attribute of {class}`Problem `. :::{dropdown} `problems/p1/data.json` :icon: code ```{literalinclude} ../../../dataset/problems/p1/data.json :language: json ``` ::: ### `solution.json` A reference optimal solution for the instance in `data.json`, with `variables` (values keyed by name) and `objective` (the optimal value). It loads into a {class}`Solution `, exposed as the `solution` attribute of {class}`Problem `. Variable values are specific to a *formulation*; by convention they use the variable names of formulation `a`. :::{dropdown} `problems/p1/solution.json` :icon: code ```{literalinclude} ../../../dataset/problems/p1/solution.json :language: json ``` ::: (formulation-directory)= ## Formulation Directory Each `problems/pN/formulations/x/` directory describes one MILP formulation of the parent problem and loads into a {class}`Formulation `. ### `formulation.json` The structured description of the MILP. Each field maps onto an attribute of {class}`Formulation `: - **`valid`** — boolean indicating whether the formulation is a faithful formulation of the parent problem. - **`parameters`** — parameters of the MILP formulation, each a {class}`Parameter `. - **`assumptions`** — list of {class}`Assumption ` records, each with a `description`, a LaTeX `formulation`, an `explicit` flag, and `code.python`. The `explicit` flag is `true` when the assumption is stated explicitly in the original problem text and `false` when it is implicit (e.g. non-negativity of an obviously physical rate). Assumptions constrain *parameters*. - **`definitions`** — *(optional)* ordered map of named derived quantities computed from parameters before variables are declared, each a {class}`Definition `. Typical uses: big-M constants and pre-computed index sets referenced in variable or constraint code. - **`variables`** — decision variables keyed by name, each a {class}`Variable ` with a `description`, `type` (`"integer"`, `"continuous"`, or `"binary"`), and a `shape` — optionally accompanied by `indices` (see {ref}`variable-shape-notation`). - **`constraints`** — list of {class}`Constraint ` records, each with a `description`, a LaTeX `formulation`, an `explicit` flag, and `code.gurobipy`. The `explicit` flag is `false` for implied constraints such as non-negativity bounds. - **`objective`** — a single {class}`Objective ` with a `description`, a LaTeX `formulation`, and `code.gurobipy`. - **`imports`** — *(optional)* list of additional Python import statements emitted into the generated `solve.py` (e.g. `["import math"]`). The `code.python` / `code.gurobipy` snippets are what {meth}`Formulation.gen_solve_py() ` assembles into a runnable solver script. :::{dropdown} `problems/p12/formulations/a/formulation.json` :icon: code ```{literalinclude} ../../../dataset/problems/p12/formulations/a/formulation.json :language: json ``` ::: ### `gen_params.py` A Python script that reads the problem's `data.json` and writes a `parameters.json` holding this formulation's parameters. It keeps a single source of instance data per problem while letting each formulation define its own parametrization. The script accepts the input `data.json` and output `parameters.json` paths as positional arguments and is run by {meth}`Formulation.run_gen_params() `. :::{dropdown} `problems/p1/formulations/b/gen_params.py` :icon: code ```{literalinclude} ../../../dataset/problems/p1/formulations/b/gen_params.py :language: python ``` ::: ### `Formulation.lean` A Lean 4 encoding of the MILP formulation as a `MILPFormulation` (defined in `Common.lean`). {ref}`formulation-definition` describes how a formulation is encoded in Lean; the path to this file is exposed as the `lean_formulation_path` attribute of {class}`Formulation `. :::{dropdown} `problems/p1/formulations/a/Formulation.lean` :icon: code ```{literalinclude} ../../../dataset/problems/p1/formulations/a/Formulation.lean :language: lean ``` ::: (reformulation-pairs-and-proofs)= ## Reformulation Pairs and Proofs The `reformulations` field of `dataset.json` is a flat list of formulation pairs. Each entry names two formulations `a` and `b` and labels whether `b` is a reformulation of `a`: ```json { "a": {"problem": 1, "formulation": "a"}, "b": {"problem": 1, "formulation": "b"}, "reformulation": true } ``` Each entry loads into a {class}`Reformulation `. For every *positive* pair (`"reformulation": true`) there is a corresponding Lean file `reformulations/pN/a_b.lean` constructing a `MILPReformulation` instance (see {ref}`reformulation-definition`). The path to the Lean file is exposed as the `lean_proof_path` attribute. Pairs that are *not* reformulations of one another have no Lean file. :::{dropdown} `reformulations/p1/a_b.lean` :icon: code ```{literalinclude} ../../../dataset/reformulations/p1/a_b.lean :language: lean ``` ::: (variable-shape-notation)= ## Variable Shape Notation The `shape` field of a variable and a parameter is a list of strings encoding the index dimensions. `formulation-bench` parses it into a {class}`Shape `, an ordered sequence of typed {class}`Dimension ` objects. An empty list `[]` denotes a **scalar**. Every dimension is one of four {class}`DimensionType ` types: - **Fixed** (`DimensionType.fixed`) — a string naming a scalar parameter (e.g., `"n"`, `"T"`): the dimension ranges over `range(param)`. - **Expression** (`DimensionType.expression`) — a string containing an arithmetic expression over scalar parameters (e.g., `"K+N"`): used as-is in generated code. - **Ragged** (`DimensionType.ragged`) — a string of the form `"X[Y]"` where `X` is a scalar array and `Y` is another dimension already in the shape: the size of this dimension depends on the value of `X` at the outer index `Y`. For example, `["n_G", "n_S[n_G]"]` means for each generator `g in range(n_G)`, the second dimension ranges over `range(n_S[g])`. Ragged variables are represented as dicts in generated code. - **Cardinality** (`DimensionType.cardinality`) — a string of the form `"|X|"` denotes the cardinality of set `X`. When an `|X|` dimension appears, the variable must also provide an `indices` expression. The {class}`Shape `, {class}`Dimension `, {class}`Parameter `, and {class}`Variable ` API references give worked examples of each case.