Dataset Schema¶

This page documents the directory structure and JSON schemas for the FormulationBench dataset. The formulation-bench Python package is the recommended way to work with the dataset. This schema reference is provided for those modifying or extending the dataset (see User Guides).

Directory Structure¶

Here is a summary of the directory structure:

dataset/
├── lakefile.toml            # Lean project configuration
├── lake-manifest.json       # Pinned Lean library dependencies
├── lean-toolchain           # Pinned Lean version
├── Common.lean              # Lean definitions
├── dataset.json             # Problem IDs and reformulation pairs
├── problems/                # One subdirectory per problem
│   ├── p1/
│   │   ├── description.md   # Natural-language problem description
│   │   ├── problem.json     # Problem name, data parameters, metadata
│   │   ├── data.json        # Concrete parameter values for one instance
│   │   ├── solution.json    # Reference optimal solution and objective
│   │   └── formulations/
│   │       ├── a/
│   │       │   ├── formulation.json   # Structured MILP formulation
│   │       │   ├── gen_params.py      # Maps data.json to MILP parameters
│   │       │   └── Formulation.lean   # Lean 4 MILP formulation
│   │       └── ...
│   └── ...
└── reformulations/          # Lean 4 reformulation proofs
    ├── p1/
    │   ├── a_b.lean         # Proof that formulation b is a reformulation of a
    │   └── ...
    └── ...

The root of the dataset directory contains:

Lean project files: lakefile.toml, lake-manifest.json, lean-toolchain (see Compiling Lean files)
Lean formulation and reformulation definitions: Common.lean (see Definitions)
A manifest of the problems and reformulations contained in the dataset: dataset.json
The problems subdirectory (see Problem Directory)
The reformulations subdirectory (see Reformulation Pairs and Proofs)

The Dataset loads in this dataset directory.

Note

Two files appear inside a formulation directory at runtime but are not shipped with the dataset: parameters.json, written by Formulation.run_gen_params(), and solve.py, written by Formulation.gen_solve_py(). Both are derived from the files above and can be regenerated at any time.

`dataset.json`¶

This is the manifest at the dataset root. It has two fields:

problems — list of integer problem IDs (e.g. 1 for p1). Each ID has a problems/pN/ directory.
reformulations — flat list of labelled formulation pairs (see Reformulation Pairs and Proofs)

Problem Directory¶

Each problems/pN/ directory describes one optimization problem and loads into a Problem.

`description.md`¶

A self-contained natural-language description of the optimization problem, exposed as the description attribute of Problem. It is also interpolated into the problem_description block when a formulation is rendered with Formulation.render_markdown().

`problem.json`¶

Defines the problem name, its data parameters, and freeform metadata:

name — human-readable problem name.
parameters — schema of the problem’s data parameters, keyed by name. Each value is a Parameter with a description, type, and shape (see Variable Shape Notation).
metadata — freeform; typically a source field recording which dataset the problem was adapted from and a notes field with commentary. These populate the source and notes blocks on the Problems pages.

`data.json`¶

A single concrete instance: a JSON object mapping each parameter name in problem.json to a value. The same instance is reused across every formulation of the problem (each formulation’s gen_params.py translates it into formulation-specific parameters). Exposed as the data attribute of Problem.

`solution.json`¶

A reference optimal solution for the instance in data.json, with variables (values keyed by name) and objective (the optimal value). It loads into a Solution, exposed as the solution attribute of Problem. Variable values are specific to a formulation; by convention they use the variable names of formulation a.

Formulation Directory¶

Each problems/pN/formulations/x/ directory describes one MILP formulation of the parent problem and loads into a Formulation.

`formulation.json`¶

The structured description of the MILP. Each field maps onto an attribute of Formulation:

valid — boolean indicating whether the formulation is a faithful formulation of the parent problem.
parameters — parameters of the MILP formulation, each a Parameter.
assumptions — list of Assumption records, each with a description, a LaTeX formulation, an explicit flag, and code.python. The explicit flag is true when the assumption is stated explicitly in the original problem text and false when it is implicit (e.g. non-negativity of an obviously physical rate). Assumptions constrain parameters.
definitions — (optional) ordered map of named derived quantities computed from parameters before variables are declared, each a Definition. Typical uses: big-M constants and pre-computed index sets referenced in variable or constraint code.
variables — decision variables keyed by name, each a Variable with a description, type ("integer", "continuous", or "binary"), and a shape — optionally accompanied by indices (see Variable Shape Notation).
constraints — list of Constraint records, each with a description, a LaTeX formulation, an explicit flag, and code.gurobipy. The explicit flag is false for implied constraints such as non-negativity bounds.
objective — a single Objective with a description, a LaTeX formulation, and code.gurobipy.
imports — (optional) list of additional Python import statements emitted into the generated solve.py (e.g. ["import math"]).

The code.python / code.gurobipy snippets are what Formulation.gen_solve_py() assembles into a runnable solver script.

problems/p12/formulations/a/formulation.json

{
    "valid": true,
    "parameters": {
        "n": {
            "description": "Number of cities",
            "type": "integer",
            "shape": []
        },
        "c": {
            "description": "Travel cost from city i to city j",
            "type": "continuous",
            "shape": [
                "n",
                "n"
            ]
        }
    },
    "assumptions": [
        {
            "description": "There must be at least two cities to form a tour.",
            "formulation": "n \\geq 2",
            "explicit": false,
            "code": {
                "python": "assert n >= 2"
            }
        }
    ],
    "variables": {
        "x": {
            "description": "1 if the tour goes directly from city i to city j, 0 otherwise",
            "type": "binary",
            "shape": [
                "n",
                "n"
            ]
        },
        "u": {
            "description": "MTZ position of city i in the tour",
            "type": "continuous",
            "shape": [
                "n"
            ]
        }
    },
    "constraints": [
        {
            "description": "Each city has exactly one outgoing arc in the tour.",
            "formulation": "\\sum_{j \\in V,\\, j \\neq i} x_{ij} = 1 \\quad \\forall i \\in V",
            "code": {
                "gurobipy": "model.addConstrs(gp.quicksum(x[i, j] for j in range(n) if j != i) == 1 for i in range(n))"
            },
            "explicit": true
        },
        {
            "description": "Each city has exactly one incoming arc in the tour.",
            "formulation": "\\sum_{i \\in V,\\, i \\neq j} x_{ij} = 1 \\quad \\forall j \\in V",
            "code": {
                "gurobipy": "model.addConstrs(gp.quicksum(x[i, j] for i in range(n) if i != j) == 1 for j in range(n))"
            },
            "explicit": true
        },
        {
            "description": "MTZ subtour elimination constraint.",
            "formulation": "u_i - u_j + n \\cdot x_{ij} \\leq n - 1 \\quad \\forall i, j \\in V \\setminus \\{0\\},\\; i \\neq j",
            "code": {
                "gurobipy": "model.addConstrs(u[i] - u[j] + n * x[i, j] <= n - 1 for i in range(1, n) for j in range(1, n) if i != j)"
            },
            "explicit": true
        },
        {
            "description": "Depot position is fixed to 1 to anchor the tour ordering.",
            "formulation": "u_0 = 1",
            "code": {
                "gurobipy": "model.addConstr(u[0] == 1)"
            },
            "explicit": true
        },
        {
            "description": "Lower bound on MTZ position: each non-depot city's position is at least 2.",
            "formulation": "u_i \\geq 2 \\quad \\forall i \\in V \\setminus \\{0\\}",
            "code": {
                "gurobipy": "model.addConstrs(u[i] >= 2 for i in range(1, n))"
            },
            "explicit": true
        },
        {
            "description": "Upper bound on MTZ position: each non-depot city's position is at most n.",
            "formulation": "u_i \\leq n \\quad \\forall i \\in V \\setminus \\{0\\}",
            "code": {
                "gurobipy": "model.addConstrs(u[i] <= n for i in range(1, n))"
            },
            "explicit": true
        },
        {
            "description": "No self-loops: a city cannot have an arc to itself.",
            "formulation": "x_{ii} = 0 \\quad \\forall i \\in V",
            "code": {
                "gurobipy": "model.addConstrs(x[i, i] == 0 for i in range(n))"
            },
            "explicit": false
        }
    ],
    "objective": {
        "description": "Minimize the total travel cost of the Hamiltonian cycle.",
        "formulation": "\\min \\sum_{i \\in V} \\sum_{j \\in V,\\, j \\neq i} c_{ij} \\cdot x_{ij}",
        "code": {
            "gurobipy": "model.setObjective(gp.quicksum(c[i][j] * x[i, j] for i in range(n) for j in range(n) if i != j), GRB.MINIMIZE)"
        }
    },
    "metadata": {
        "source": {
            "dataset": "EvoCut"
        },
        "notes": [
            "This is the Miller-Tucker-Zemlin (MTZ) MILP formulation given in Section F.1 of EvoCut arXiv submission `v1`."
        ]
    }
}

`gen_params.py`¶

A Python script that reads the problem’s data.json and writes a parameters.json holding this formulation’s parameters. It keeps a single source of instance data per problem while letting each formulation define its own parametrization. The script accepts the input data.json and output parameters.json paths as positional arguments and is run by Formulation.run_gen_params().

`Formulation.lean`¶

A Lean 4 encoding of the MILP formulation as a MILPFormulation (defined in Common.lean). MILP Formulation describes how a formulation is encoded in Lean; the path to this file is exposed as the lean_formulation_path attribute of Formulation.

Reformulation Pairs and Proofs¶

The reformulations field of dataset.json is a flat list of formulation pairs. Each entry names two formulations a and b and labels whether b is a reformulation of a:

{
  "a": {"problem": 1, "formulation": "a"},
  "b": {"problem": 1, "formulation": "b"},
  "reformulation": true
}

Each entry loads into a Reformulation. For every positive pair ("reformulation": true) there is a corresponding Lean file reformulations/pN/a_b.lean constructing a MILPReformulation instance (see Reformulation). The path to the Lean file is exposed as the lean_proof_path attribute. Pairs that are not reformulations of one another have no Lean file.

reformulations/p1/a_b.lean

import Common
import problems.p1.formulations.a.Formulation
import problems.p1.formulations.b.Formulation

namespace P1

-- ============================================================================
-- § Parameter Mapping
-- ============================================================================

private def paramMap (p : P1.a.Params) : P1.b.Params :=
  { A := p.CashMachineProcessingRate
    K := p.CardMachineProcessingRate
    Y := p.CashMachinePaperRolls
    W := p.CardMachinePaperRolls
    U := p.MinPeopleProcessed
    V := p.MaxPaperRolls
    hA_nn := p.hCashMachineProcessingRate_nn
    hK_nn := p.hCardMachineProcessingRate_nn
    hY_nn := p.hCashMachinePaperRolls_nn
    hW_nn := p.hCardMachinePaperRolls_nn
    hU_nn := p.hMinPeopleProcessed_nn
    hV_nn := p.hMaxPaperRolls_nn }

-- ============================================================================
-- § Forward Mapping and Feasibility
-- ============================================================================

private def fwd (p : P1.a.Params) (v : P1.a.Vars p) : P1.b.Vars (paramMap p) :=
  { s := v.NumCashMachines
    r := v.NumCardMachines }

private lemma fwd_feas (p : P1.a.Params) (v : P1.a.Vars p)
    (h : P1.a.Feasible p v) :
    P1.b.Feasible (paramMap p) (fwd p v) :=
  { hpeople := h.hpeople
    hpaper  := h.hpaper
    hcard   := h.hcard
    hs_nn   := h.hNumCashMachines_nn
    hr_nn   := h.hNumCardMachines_nn }

-- ============================================================================
-- § Backward Mapping and Feasibility
-- ============================================================================

private def bwd (p : P1.a.Params) (v : P1.b.Vars (paramMap p)) : P1.a.Vars p :=
  { NumCashMachines := v.s
    NumCardMachines := v.r }

private lemma bwd_feas (p : P1.a.Params) (v : P1.b.Vars (paramMap p))
    (h : P1.b.Feasible (paramMap p) v) :
    P1.a.Feasible p (bwd p v) :=
  { hpeople  := h.hpeople
    hpaper   := h.hpaper
    hcard    := h.hcard
    hNumCashMachines_nn := h.hs_nn
    hNumCardMachines_nn := h.hr_nn }

-- ============================================================================
-- § Reformulation Structure
-- ============================================================================

def aBReformulation : MILPReformulation P1.a.formulation P1.b.formulation where
  paramMap    := paramMap
  fwd         := fwd
  bwd         := bwd
  fwd_feas    := fwd_feas
  bwd_feas    := bwd_feas
  objMap      := id
  objMap_mono := strictMono_id
  fwd_obj _ _ _ := rfl
  bwd_obj _ _ _ := rfl

end P1

Variable Shape Notation¶

The shape field of a variable and a parameter is a list of strings encoding the index dimensions. formulation-bench parses it into a Shape, an ordered sequence of typed Dimension objects. An empty list [] denotes a scalar. Every dimension is one of four DimensionType types:

Fixed (DimensionType.fixed) — a string naming a scalar parameter (e.g., "n", "T"): the dimension ranges over range(param).
Expression (DimensionType.expression) — a string containing an arithmetic expression over scalar parameters (e.g., "K+N"): used as-is in generated code.
Ragged (DimensionType.ragged) — a string of the form "X[Y]" where X is a scalar array and Y is another dimension already in the shape: the size of this dimension depends on the value of X at the outer index Y. For example, ["n_G", "n_S[n_G]"] means for each generator g in range(n_G), the second dimension ranges over range(n_S[g]). Ragged variables are represented as dicts in generated code.
Cardinality (DimensionType.cardinality) — a string of the form "|X|" denotes the cardinality of set X. When an |X| dimension appears, the variable must also provide an indices expression.

The Shape, Dimension, Parameter, and Variable API references give worked examples of each case.

Dataset Schema¶

Directory Structure¶

dataset.json¶

Problem Directory¶

description.md¶

problem.json¶

data.json¶

solution.json¶