Dataset Schema¶
This page documents the directory structure and JSON schemas for the
FormulationBench dataset. The formulation-bench Python package is the
recommended way to work with the dataset. This schema reference is provided for those modifying or extending the dataset (see User Guides).
Directory Structure¶
Here is a summary of the directory structure:
dataset/
├── lakefile.toml # Lean project configuration
├── lake-manifest.json # Pinned Lean library dependencies
├── lean-toolchain # Pinned Lean version
├── Common.lean # Lean definitions
├── dataset.json # Problem IDs and reformulation pairs
├── problems/ # One subdirectory per problem
│ ├── p1/
│ │ ├── description.md # Natural-language problem description
│ │ ├── problem.json # Problem name, data parameters, metadata
│ │ ├── data.json # Concrete parameter values for one instance
│ │ ├── solution.json # Reference optimal solution and objective
│ │ └── formulations/
│ │ ├── a/
│ │ │ ├── formulation.json # Structured MILP formulation
│ │ │ ├── gen_params.py # Maps data.json to MILP parameters
│ │ │ └── Formulation.lean # Lean 4 MILP formulation
│ │ └── ...
│ └── ...
└── reformulations/ # Lean 4 reformulation proofs
├── p1/
│ ├── a_b.lean # Proof that formulation b is a reformulation of a
│ └── ...
└── ...
The root of the dataset directory contains:
Lean project files:
lakefile.toml,lake-manifest.json,lean-toolchain(see Compiling Lean files)Lean formulation and reformulation definitions:
Common.lean(see Definitions)A manifest of the problems and reformulations contained in the dataset:
dataset.jsonThe problems subdirectory (see Problem Directory)
The reformulations subdirectory (see Reformulation Pairs and Proofs)
The Dataset loads in this dataset directory.
Note
Two files appear inside a formulation directory at runtime but are not
shipped with the dataset: parameters.json, written by Formulation.run_gen_params(), and solve.py,
written by Formulation.gen_solve_py(). Both are derived
from the files above and can be regenerated at any time.
dataset.json¶
This is the manifest at the dataset root. It has two fields:
problems— list of integer problem IDs (e.g.1forp1). Each ID has aproblems/pN/directory.reformulations— flat list of labelled formulation pairs (see Reformulation Pairs and Proofs)
Problem Directory¶
Each problems/pN/ directory describes one optimization problem and loads
into a Problem.
description.md¶
A self-contained natural-language description of the optimization problem,
exposed as the description attribute of Problem. It is also interpolated into the
problem_description block when a formulation is rendered with
Formulation.render_markdown().
problems/p1/description.md
An amusement park is installing cash-based machines and card-only machines. A cash-based machine can process CashMachineProcessingRate people per hour, while a card-only machine can process CardMachineProcessingRate people per hour. The cash-based machine needs CashMachinePaperRolls rolls of paper per hour, while the card-only machine requires CardMachinePaperRolls rolls of paper per hour. The amusement park needs to be able to process at least MinPeopleProcessed people per hour but can use at most MaxPaperRolls paper rolls per hour. Additionally, the number of card-only machines must not exceed the number of cash-based machines. The objective is to minimize the total number of machines in the park.
problem.json¶
Defines the problem name, its data parameters, and freeform metadata:
name— human-readable problem name.parameters— schema of the problem’s data parameters, keyed by name. Each value is aParameterwith adescription,type, andshape(see Variable Shape Notation).metadata— freeform; typically asourcefield recording which dataset the problem was adapted from and anotesfield with commentary. These populate the source and notes blocks on the Problems pages.
problems/p1/problem.json
{
"name": "Amusement Park Ticket Machines",
"parameters": {
"CashMachineProcessingRate": {
"description": "Processing rate of a cash-based machine in people per hour",
"type": "continuous",
"shape": []
},
"CardMachineProcessingRate": {
"description": "Processing rate of a card-only machine in people per hour",
"type": "continuous",
"shape": []
},
"CashMachinePaperRolls": {
"description": "Number of paper rolls used per hour by a cash-based machine",
"type": "continuous",
"shape": []
},
"CardMachinePaperRolls": {
"description": "Number of paper rolls used per hour by a card-only machine",
"type": "continuous",
"shape": []
},
"MinPeopleProcessed": {
"description": "Minimum number of people that must be processed per hour",
"type": "continuous",
"shape": []
},
"MaxPaperRolls": {
"description": "Maximum number of paper rolls that can be used per hour",
"type": "continuous",
"shape": []
}
},
"metadata": {
"source": {
"dataset": "EquivaFormulation",
"instance_id": 47
},
"notes": [
"The source variable type is continuous; we correct the variable type to integer.",
"The cutting plane (source variation `47_e`) is excluded because it is instance-specific.",
"Implicit non-negativity assumptions are added for every parameter."
]
}
}
data.json¶
A single concrete instance: a JSON object mapping each parameter name in
problem.json to a value. The same instance is reused across every
formulation of the problem (each formulation’s gen_params.py translates it
into formulation-specific parameters). Exposed as the data attribute of
Problem.
problems/p1/data.json
{
"CashMachineProcessingRate": 20,
"CardMachineProcessingRate": 30,
"CashMachinePaperRolls": 4,
"CardMachinePaperRolls": 5,
"MinPeopleProcessed": 500,
"MaxPaperRolls": 90
}
solution.json¶
A reference optimal solution for the instance in data.json, with
variables (values keyed by name) and objective (the optimal value). It
loads into a Solution, exposed
as the solution attribute of Problem. Variable values are specific to a
formulation; by convention they use the variable names of formulation a.
problems/p1/solution.json
{
"variables": {
"NumCashMachines": 10.0,
"NumCardMachines": 10.0
},
"objective": 20.0
}
Formulation Directory¶
Each problems/pN/formulations/x/ directory describes one MILP formulation
of the parent problem and loads into a Formulation.
formulation.json¶
The structured description of the MILP. Each field maps onto an attribute of
Formulation:
valid— boolean indicating whether the formulation is a faithful formulation of the parent problem.parameters— parameters of the MILP formulation, each aParameter.assumptions— list ofAssumptionrecords, each with adescription, a LaTeXformulation, anexplicitflag, andcode.python. Theexplicitflag istruewhen the assumption is stated explicitly in the original problem text andfalsewhen it is implicit (e.g. non-negativity of an obviously physical rate). Assumptions constrain parameters.definitions— (optional) ordered map of named derived quantities computed from parameters before variables are declared, each aDefinition. Typical uses: big-M constants and pre-computed index sets referenced in variable or constraint code.variables— decision variables keyed by name, each aVariablewith adescription,type("integer","continuous", or"binary"), and ashape— optionally accompanied byindices(see Variable Shape Notation).constraints— list ofConstraintrecords, each with adescription, a LaTeXformulation, anexplicitflag, andcode.gurobipy. Theexplicitflag isfalsefor implied constraints such as non-negativity bounds.objective— a singleObjectivewith adescription, a LaTeXformulation, andcode.gurobipy.imports— (optional) list of additional Python import statements emitted into the generatedsolve.py(e.g.["import math"]).
The code.python / code.gurobipy snippets are what
Formulation.gen_solve_py() assembles into a
runnable solver script.
problems/p12/formulations/a/formulation.json
{
"valid": true,
"parameters": {
"n": {
"description": "Number of cities",
"type": "integer",
"shape": []
},
"c": {
"description": "Travel cost from city i to city j",
"type": "continuous",
"shape": [
"n",
"n"
]
}
},
"assumptions": [
{
"description": "There must be at least two cities to form a tour.",
"formulation": "n \\geq 2",
"explicit": false,
"code": {
"python": "assert n >= 2"
}
}
],
"variables": {
"x": {
"description": "1 if the tour goes directly from city i to city j, 0 otherwise",
"type": "binary",
"shape": [
"n",
"n"
]
},
"u": {
"description": "MTZ position of city i in the tour",
"type": "continuous",
"shape": [
"n"
]
}
},
"constraints": [
{
"description": "Each city has exactly one outgoing arc in the tour.",
"formulation": "\\sum_{j \\in V,\\, j \\neq i} x_{ij} = 1 \\quad \\forall i \\in V",
"code": {
"gurobipy": "model.addConstrs(gp.quicksum(x[i, j] for j in range(n) if j != i) == 1 for i in range(n))"
},
"explicit": true
},
{
"description": "Each city has exactly one incoming arc in the tour.",
"formulation": "\\sum_{i \\in V,\\, i \\neq j} x_{ij} = 1 \\quad \\forall j \\in V",
"code": {
"gurobipy": "model.addConstrs(gp.quicksum(x[i, j] for i in range(n) if i != j) == 1 for j in range(n))"
},
"explicit": true
},
{
"description": "MTZ subtour elimination constraint.",
"formulation": "u_i - u_j + n \\cdot x_{ij} \\leq n - 1 \\quad \\forall i, j \\in V \\setminus \\{0\\},\\; i \\neq j",
"code": {
"gurobipy": "model.addConstrs(u[i] - u[j] + n * x[i, j] <= n - 1 for i in range(1, n) for j in range(1, n) if i != j)"
},
"explicit": true
},
{
"description": "Depot position is fixed to 1 to anchor the tour ordering.",
"formulation": "u_0 = 1",
"code": {
"gurobipy": "model.addConstr(u[0] == 1)"
},
"explicit": true
},
{
"description": "Lower bound on MTZ position: each non-depot city's position is at least 2.",
"formulation": "u_i \\geq 2 \\quad \\forall i \\in V \\setminus \\{0\\}",
"code": {
"gurobipy": "model.addConstrs(u[i] >= 2 for i in range(1, n))"
},
"explicit": true
},
{
"description": "Upper bound on MTZ position: each non-depot city's position is at most n.",
"formulation": "u_i \\leq n \\quad \\forall i \\in V \\setminus \\{0\\}",
"code": {
"gurobipy": "model.addConstrs(u[i] <= n for i in range(1, n))"
},
"explicit": true
},
{
"description": "No self-loops: a city cannot have an arc to itself.",
"formulation": "x_{ii} = 0 \\quad \\forall i \\in V",
"code": {
"gurobipy": "model.addConstrs(x[i, i] == 0 for i in range(n))"
},
"explicit": false
}
],
"objective": {
"description": "Minimize the total travel cost of the Hamiltonian cycle.",
"formulation": "\\min \\sum_{i \\in V} \\sum_{j \\in V,\\, j \\neq i} c_{ij} \\cdot x_{ij}",
"code": {
"gurobipy": "model.setObjective(gp.quicksum(c[i][j] * x[i, j] for i in range(n) for j in range(n) if i != j), GRB.MINIMIZE)"
}
},
"metadata": {
"source": {
"dataset": "EvoCut"
},
"notes": [
"This is the Miller-Tucker-Zemlin (MTZ) MILP formulation given in Section F.1 of EvoCut arXiv submission `v1`."
]
}
}
gen_params.py¶
A Python script that reads the problem’s data.json and writes a
parameters.json holding this formulation’s parameters. It keeps a single
source of instance data per problem while letting each formulation define
its own parametrization. The script accepts the input data.json and output
parameters.json paths as positional arguments and is run by
Formulation.run_gen_params().
problems/p1/formulations/b/gen_params.py
import argparse
import json
def main(data_path: str, output_path: str) -> None:
with open(data_path) as f:
data = json.load(f)
params = {
"A": data["CashMachineProcessingRate"],
"K": data["CardMachineProcessingRate"],
"Y": data["CashMachinePaperRolls"],
"W": data["CardMachinePaperRolls"],
"U": data["MinPeopleProcessed"],
"V": data["MaxPaperRolls"],
}
with open(output_path, "w") as f:
json.dump(params, f, indent=4)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("data", help="Path to data.json")
parser.add_argument("output", help="Path to write parameters.json")
args = parser.parse_args()
main(args.data, args.output)
Formulation.lean¶
A Lean 4 encoding of the MILP formulation as a MILPFormulation (defined in
Common.lean). MILP Formulation
describes how a formulation is encoded in Lean; the path to this file is exposed as the lean_formulation_path attribute of Formulation.
problems/p1/formulations/a/Formulation.lean
import Common
namespace P1.a
structure Params where
CashMachineProcessingRate : ℝ -- cash machine processing rate (people/hour)
CardMachineProcessingRate : ℝ -- card machine processing rate (people/hour)
CashMachinePaperRolls : ℝ -- paper rolls/hour for cash machine
CardMachinePaperRolls : ℝ -- paper rolls/hour for card machine
MinPeopleProcessed : ℝ -- min people processed per hour
MaxPaperRolls : ℝ -- max paper rolls per hour
-- Implicit Assumptions
hCashMachineProcessingRate_nn : 0 ≤ CashMachineProcessingRate
hCardMachineProcessingRate_nn : 0 ≤ CardMachineProcessingRate
hCashMachinePaperRolls_nn : 0 ≤ CashMachinePaperRolls
hCardMachinePaperRolls_nn : 0 ≤ CardMachinePaperRolls
hMinPeopleProcessed_nn : 0 ≤ MinPeopleProcessed
hMaxPaperRolls_nn : 0 ≤ MaxPaperRolls
structure Vars (p : Params) where
NumCashMachines : ℤ -- number of cash machines
NumCardMachines : ℤ -- number of card machines
structure Feasible (p : Params) (v : Vars p) : Prop where
-- Process at least MinPeopleProcessed people per hour
hpeople : p.MinPeopleProcessed ≤ p.CashMachineProcessingRate * (v.NumCashMachines : ℝ) + p.CardMachineProcessingRate * (v.NumCardMachines : ℝ)
-- Use at most MaxPaperRolls paper rolls per hour
hpaper : (v.NumCashMachines : ℝ) * p.CashMachinePaperRolls + (v.NumCardMachines : ℝ) * p.CardMachinePaperRolls ≤ p.MaxPaperRolls
-- Card machines ≤ cash machines
hcard : v.NumCardMachines ≤ v.NumCashMachines
-- [Implicit Constraints]
hNumCashMachines_nn : 0 ≤ v.NumCashMachines
hNumCardMachines_nn : 0 ≤ v.NumCardMachines
-- Minimize the total number of machines
def obj (p : Params) (v : Vars p) : ℝ := (v.NumCashMachines : ℝ) + (v.NumCardMachines : ℝ)
def formulation : MILPFormulation where
Params := Params
Vars := Vars
feasible := Feasible
obj := obj
end P1.a
Reformulation Pairs and Proofs¶
The reformulations field of dataset.json is a flat list of formulation
pairs. Each entry names two formulations a and b and labels whether b is a reformulation of a:
{
"a": {"problem": 1, "formulation": "a"},
"b": {"problem": 1, "formulation": "b"},
"reformulation": true
}
Each entry loads into a Reformulation. For every positive pair
("reformulation": true) there is a corresponding Lean file
reformulations/pN/a_b.lean constructing a MILPReformulation instance (see
Reformulation). The path to the Lean file is exposed as the lean_proof_path attribute. Pairs that are not reformulations of one another have no Lean file.
reformulations/p1/a_b.lean
import Common
import problems.p1.formulations.a.Formulation
import problems.p1.formulations.b.Formulation
namespace P1
-- ============================================================================
-- § Parameter Mapping
-- ============================================================================
private def paramMap (p : P1.a.Params) : P1.b.Params :=
{ A := p.CashMachineProcessingRate
K := p.CardMachineProcessingRate
Y := p.CashMachinePaperRolls
W := p.CardMachinePaperRolls
U := p.MinPeopleProcessed
V := p.MaxPaperRolls
hA_nn := p.hCashMachineProcessingRate_nn
hK_nn := p.hCardMachineProcessingRate_nn
hY_nn := p.hCashMachinePaperRolls_nn
hW_nn := p.hCardMachinePaperRolls_nn
hU_nn := p.hMinPeopleProcessed_nn
hV_nn := p.hMaxPaperRolls_nn }
-- ============================================================================
-- § Forward Mapping and Feasibility
-- ============================================================================
private def fwd (p : P1.a.Params) (v : P1.a.Vars p) : P1.b.Vars (paramMap p) :=
{ s := v.NumCashMachines
r := v.NumCardMachines }
private lemma fwd_feas (p : P1.a.Params) (v : P1.a.Vars p)
(h : P1.a.Feasible p v) :
P1.b.Feasible (paramMap p) (fwd p v) :=
{ hpeople := h.hpeople
hpaper := h.hpaper
hcard := h.hcard
hs_nn := h.hNumCashMachines_nn
hr_nn := h.hNumCardMachines_nn }
-- ============================================================================
-- § Backward Mapping and Feasibility
-- ============================================================================
private def bwd (p : P1.a.Params) (v : P1.b.Vars (paramMap p)) : P1.a.Vars p :=
{ NumCashMachines := v.s
NumCardMachines := v.r }
private lemma bwd_feas (p : P1.a.Params) (v : P1.b.Vars (paramMap p))
(h : P1.b.Feasible (paramMap p) v) :
P1.a.Feasible p (bwd p v) :=
{ hpeople := h.hpeople
hpaper := h.hpaper
hcard := h.hcard
hNumCashMachines_nn := h.hs_nn
hNumCardMachines_nn := h.hr_nn }
-- ============================================================================
-- § Reformulation Structure
-- ============================================================================
def aBReformulation : MILPReformulation P1.a.formulation P1.b.formulation where
paramMap := paramMap
fwd := fwd
bwd := bwd
fwd_feas := fwd_feas
bwd_feas := bwd_feas
objMap := id
objMap_mono := strictMono_id
fwd_obj _ _ _ := rfl
bwd_obj _ _ _ := rfl
end P1
Variable Shape Notation¶
The shape field of a variable and a parameter is a list of strings
encoding the index dimensions. formulation-bench parses it into a
Shape, an ordered sequence of
typed Dimension objects. An
empty list [] denotes a scalar. Every dimension is one of four
DimensionType types:
Fixed (
DimensionType.fixed) — a string naming a scalar parameter (e.g.,"n","T"): the dimension ranges overrange(param).Expression (
DimensionType.expression) — a string containing an arithmetic expression over scalar parameters (e.g.,"K+N"): used as-is in generated code.Ragged (
DimensionType.ragged) — a string of the form"X[Y]"whereXis a scalar array andYis another dimension already in the shape: the size of this dimension depends on the value ofXat the outer indexY. For example,["n_G", "n_S[n_G]"]means for each generatorg in range(n_G), the second dimension ranges overrange(n_S[g]). Ragged variables are represented as dicts in generated code.Cardinality (
DimensionType.cardinality) — a string of the form"|X|"denotes the cardinality of setX. When an|X|dimension appears, the variable must also provide anindicesexpression.
The Shape, Dimension, Parameter, and Variable API references give worked examples of
each case.