Dataset Schema

This page documents the directory structure and JSON schemas for the FormulationBench dataset. The formulation-bench Python package is the recommended way to work with the dataset. This schema reference is provided for those modifying or extending the dataset (see User Guides).

Directory Structure

Here is a summary of the directory structure:

dataset/
├── lakefile.toml            # Lean project configuration
├── lake-manifest.json       # Pinned Lean library dependencies
├── lean-toolchain           # Pinned Lean version
├── Common.lean              # Lean definitions
├── dataset.json             # Problem IDs and reformulation pairs
├── problems/                # One subdirectory per problem
│   ├── p1/
│   │   ├── description.md   # Natural-language problem description
│   │   ├── problem.json     # Problem name, data parameters, metadata
│   │   ├── data.json        # Concrete parameter values for one instance
│   │   ├── solution.json    # Reference optimal solution and objective
│   │   └── formulations/
│   │       ├── a/
│   │       │   ├── formulation.json   # Structured MILP formulation
│   │       │   ├── gen_params.py      # Maps data.json to MILP parameters
│   │       │   └── Formulation.lean   # Lean 4 MILP formulation
│   │       └── ...
│   └── ...
└── reformulations/          # Lean 4 reformulation proofs
    ├── p1/
    │   ├── a_b.lean         # Proof that formulation b is a reformulation of a
    │   └── ...
    └── ...

The root of the dataset directory contains:

The Dataset loads in this dataset directory.

Note

Two files appear inside a formulation directory at runtime but are not shipped with the dataset: parameters.json, written by Formulation.run_gen_params(), and solve.py, written by Formulation.gen_solve_py(). Both are derived from the files above and can be regenerated at any time.

dataset.json

This is the manifest at the dataset root. It has two fields:

  • problems — list of integer problem IDs (e.g. 1 for p1). Each ID has a problems/pN/ directory.

  • reformulations — flat list of labelled formulation pairs (see Reformulation Pairs and Proofs)

Problem Directory

Each problems/pN/ directory describes one optimization problem and loads into a Problem.

description.md

A self-contained natural-language description of the optimization problem, exposed as the description attribute of Problem. It is also interpolated into the problem_description block when a formulation is rendered with Formulation.render_markdown().

problems/p1/description.md
An amusement park is installing cash-based machines and card-only machines. A cash-based machine can process CashMachineProcessingRate people per hour, while a card-only machine can process CardMachineProcessingRate people per hour. The cash-based machine needs CashMachinePaperRolls rolls of paper per hour, while the card-only machine requires CardMachinePaperRolls rolls of paper per hour. The amusement park needs to be able to process at least MinPeopleProcessed people per hour but can use at most MaxPaperRolls paper rolls per hour. Additionally, the number of card-only machines must not exceed the number of cash-based machines. The objective is to minimize the total number of machines in the park.

problem.json

Defines the problem name, its data parameters, and freeform metadata:

  • name — human-readable problem name.

  • parameters — schema of the problem’s data parameters, keyed by name. Each value is a Parameter with a description, type, and shape (see Variable Shape Notation).

  • metadata — freeform; typically a source field recording which dataset the problem was adapted from and a notes field with commentary. These populate the source and notes blocks on the Problems pages.

problems/p1/problem.json
{
    "name": "Amusement Park Ticket Machines",
    "parameters": {
        "CashMachineProcessingRate": {
            "description": "Processing rate of a cash-based machine in people per hour",
            "type": "continuous",
            "shape": []
        },
        "CardMachineProcessingRate": {
            "description": "Processing rate of a card-only machine in people per hour",
            "type": "continuous",
            "shape": []
        },
        "CashMachinePaperRolls": {
            "description": "Number of paper rolls used per hour by a cash-based machine",
            "type": "continuous",
            "shape": []
        },
        "CardMachinePaperRolls": {
            "description": "Number of paper rolls used per hour by a card-only machine",
            "type": "continuous",
            "shape": []
        },
        "MinPeopleProcessed": {
            "description": "Minimum number of people that must be processed per hour",
            "type": "continuous",
            "shape": []
        },
        "MaxPaperRolls": {
            "description": "Maximum number of paper rolls that can be used per hour",
            "type": "continuous",
            "shape": []
        }
    },
    "metadata": {
        "source": {
            "dataset": "EquivaFormulation",
            "instance_id": 47
        },
        "notes": [
            "The source variable type is continuous; we correct the variable type to integer.",
            "The cutting plane (source variation `47_e`) is excluded because it is instance-specific.",
            "Implicit non-negativity assumptions are added for every parameter."
        ]
    }
}

data.json

A single concrete instance: a JSON object mapping each parameter name in problem.json to a value. The same instance is reused across every formulation of the problem (each formulation’s gen_params.py translates it into formulation-specific parameters). Exposed as the data attribute of Problem.

problems/p1/data.json
{
    "CashMachineProcessingRate": 20,
    "CardMachineProcessingRate": 30,
    "CashMachinePaperRolls": 4,
    "CardMachinePaperRolls": 5,
    "MinPeopleProcessed": 500,
    "MaxPaperRolls": 90
}

solution.json

A reference optimal solution for the instance in data.json, with variables (values keyed by name) and objective (the optimal value). It loads into a Solution, exposed as the solution attribute of Problem. Variable values are specific to a formulation; by convention they use the variable names of formulation a.

problems/p1/solution.json
{
    "variables": {
        "NumCashMachines": 10.0,
        "NumCardMachines": 10.0
    },
    "objective": 20.0
}

Formulation Directory

Each problems/pN/formulations/x/ directory describes one MILP formulation of the parent problem and loads into a Formulation.

formulation.json

The structured description of the MILP. Each field maps onto an attribute of Formulation:

  • valid — boolean indicating whether the formulation is a faithful formulation of the parent problem.

  • parameters — parameters of the MILP formulation, each a Parameter.

  • assumptions — list of Assumption records, each with a description, a LaTeX formulation, an explicit flag, and code.python. The explicit flag is true when the assumption is stated explicitly in the original problem text and false when it is implicit (e.g. non-negativity of an obviously physical rate). Assumptions constrain parameters.

  • definitions(optional) ordered map of named derived quantities computed from parameters before variables are declared, each a Definition. Typical uses: big-M constants and pre-computed index sets referenced in variable or constraint code.

  • variables — decision variables keyed by name, each a Variable with a description, type ("integer", "continuous", or "binary"), and a shape — optionally accompanied by indices (see Variable Shape Notation).

  • constraints — list of Constraint records, each with a description, a LaTeX formulation, an explicit flag, and code.gurobipy. The explicit flag is false for implied constraints such as non-negativity bounds.

  • objective — a single Objective with a description, a LaTeX formulation, and code.gurobipy.

  • imports(optional) list of additional Python import statements emitted into the generated solve.py (e.g. ["import math"]).

The code.python / code.gurobipy snippets are what Formulation.gen_solve_py() assembles into a runnable solver script.

problems/p12/formulations/a/formulation.json
{
    "valid": true,
    "parameters": {
        "n": {
            "description": "Number of cities",
            "type": "integer",
            "shape": []
        },
        "c": {
            "description": "Travel cost from city i to city j",
            "type": "continuous",
            "shape": [
                "n",
                "n"
            ]
        }
    },
    "assumptions": [
        {
            "description": "There must be at least two cities to form a tour.",
            "formulation": "n \\geq 2",
            "explicit": false,
            "code": {
                "python": "assert n >= 2"
            }
        }
    ],
    "variables": {
        "x": {
            "description": "1 if the tour goes directly from city i to city j, 0 otherwise",
            "type": "binary",
            "shape": [
                "n",
                "n"
            ]
        },
        "u": {
            "description": "MTZ position of city i in the tour",
            "type": "continuous",
            "shape": [
                "n"
            ]
        }
    },
    "constraints": [
        {
            "description": "Each city has exactly one outgoing arc in the tour.",
            "formulation": "\\sum_{j \\in V,\\, j \\neq i} x_{ij} = 1 \\quad \\forall i \\in V",
            "code": {
                "gurobipy": "model.addConstrs(gp.quicksum(x[i, j] for j in range(n) if j != i) == 1 for i in range(n))"
            },
            "explicit": true
        },
        {
            "description": "Each city has exactly one incoming arc in the tour.",
            "formulation": "\\sum_{i \\in V,\\, i \\neq j} x_{ij} = 1 \\quad \\forall j \\in V",
            "code": {
                "gurobipy": "model.addConstrs(gp.quicksum(x[i, j] for i in range(n) if i != j) == 1 for j in range(n))"
            },
            "explicit": true
        },
        {
            "description": "MTZ subtour elimination constraint.",
            "formulation": "u_i - u_j + n \\cdot x_{ij} \\leq n - 1 \\quad \\forall i, j \\in V \\setminus \\{0\\},\\; i \\neq j",
            "code": {
                "gurobipy": "model.addConstrs(u[i] - u[j] + n * x[i, j] <= n - 1 for i in range(1, n) for j in range(1, n) if i != j)"
            },
            "explicit": true
        },
        {
            "description": "Depot position is fixed to 1 to anchor the tour ordering.",
            "formulation": "u_0 = 1",
            "code": {
                "gurobipy": "model.addConstr(u[0] == 1)"
            },
            "explicit": true
        },
        {
            "description": "Lower bound on MTZ position: each non-depot city's position is at least 2.",
            "formulation": "u_i \\geq 2 \\quad \\forall i \\in V \\setminus \\{0\\}",
            "code": {
                "gurobipy": "model.addConstrs(u[i] >= 2 for i in range(1, n))"
            },
            "explicit": true
        },
        {
            "description": "Upper bound on MTZ position: each non-depot city's position is at most n.",
            "formulation": "u_i \\leq n \\quad \\forall i \\in V \\setminus \\{0\\}",
            "code": {
                "gurobipy": "model.addConstrs(u[i] <= n for i in range(1, n))"
            },
            "explicit": true
        },
        {
            "description": "No self-loops: a city cannot have an arc to itself.",
            "formulation": "x_{ii} = 0 \\quad \\forall i \\in V",
            "code": {
                "gurobipy": "model.addConstrs(x[i, i] == 0 for i in range(n))"
            },
            "explicit": false
        }
    ],
    "objective": {
        "description": "Minimize the total travel cost of the Hamiltonian cycle.",
        "formulation": "\\min \\sum_{i \\in V} \\sum_{j \\in V,\\, j \\neq i} c_{ij} \\cdot x_{ij}",
        "code": {
            "gurobipy": "model.setObjective(gp.quicksum(c[i][j] * x[i, j] for i in range(n) for j in range(n) if i != j), GRB.MINIMIZE)"
        }
    },
    "metadata": {
        "source": {
            "dataset": "EvoCut"
        },
        "notes": [
            "This is the Miller-Tucker-Zemlin (MTZ) MILP formulation given in Section F.1 of EvoCut arXiv submission `v1`."
        ]
    }
}

gen_params.py

A Python script that reads the problem’s data.json and writes a parameters.json holding this formulation’s parameters. It keeps a single source of instance data per problem while letting each formulation define its own parametrization. The script accepts the input data.json and output parameters.json paths as positional arguments and is run by Formulation.run_gen_params().

problems/p1/formulations/b/gen_params.py
import argparse
import json


def main(data_path: str, output_path: str) -> None:
    with open(data_path) as f:
        data = json.load(f)

    params = {
        "A": data["CashMachineProcessingRate"],
        "K": data["CardMachineProcessingRate"],
        "Y": data["CashMachinePaperRolls"],
        "W": data["CardMachinePaperRolls"],
        "U": data["MinPeopleProcessed"],
        "V": data["MaxPaperRolls"],
    }

    with open(output_path, "w") as f:
        json.dump(params, f, indent=4)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("data", help="Path to data.json")
    parser.add_argument("output", help="Path to write parameters.json")
    args = parser.parse_args()
    main(args.data, args.output)

Formulation.lean

A Lean 4 encoding of the MILP formulation as a MILPFormulation (defined in Common.lean). MILP Formulation describes how a formulation is encoded in Lean; the path to this file is exposed as the lean_formulation_path attribute of Formulation.

problems/p1/formulations/a/Formulation.lean
import Common

namespace P1.a

structure Params where
  CashMachineProcessingRate :   -- cash machine processing rate (people/hour)
  CardMachineProcessingRate :   -- card machine processing rate (people/hour)
  CashMachinePaperRolls :   -- paper rolls/hour for cash machine
  CardMachinePaperRolls :   -- paper rolls/hour for card machine
  MinPeopleProcessed :   -- min people processed per hour
  MaxPaperRolls :   -- max paper rolls per hour
  -- Implicit Assumptions
  hCashMachineProcessingRate_nn : 0  CashMachineProcessingRate
  hCardMachineProcessingRate_nn : 0  CardMachineProcessingRate
  hCashMachinePaperRolls_nn : 0  CashMachinePaperRolls
  hCardMachinePaperRolls_nn : 0  CardMachinePaperRolls
  hMinPeopleProcessed_nn : 0  MinPeopleProcessed
  hMaxPaperRolls_nn : 0  MaxPaperRolls

structure Vars (p : Params) where
  NumCashMachines :   -- number of cash machines
  NumCardMachines :   -- number of card machines

structure Feasible (p : Params) (v : Vars p) : Prop where
  -- Process at least MinPeopleProcessed people per hour
  hpeople : p.MinPeopleProcessed  p.CashMachineProcessingRate * (v.NumCashMachines : ) + p.CardMachineProcessingRate * (v.NumCardMachines : )
  -- Use at most MaxPaperRolls paper rolls per hour
  hpaper : (v.NumCashMachines : ) * p.CashMachinePaperRolls + (v.NumCardMachines : ) * p.CardMachinePaperRolls  p.MaxPaperRolls
  -- Card machines ≤ cash machines
  hcard : v.NumCardMachines  v.NumCashMachines
  -- [Implicit Constraints]
  hNumCashMachines_nn : 0  v.NumCashMachines
  hNumCardMachines_nn : 0  v.NumCardMachines

-- Minimize the total number of machines
def obj (p : Params) (v : Vars p) :  := (v.NumCashMachines : ) + (v.NumCardMachines : )

def formulation : MILPFormulation where
  Params   := Params
  Vars     := Vars
  feasible := Feasible
  obj      := obj

end P1.a

Reformulation Pairs and Proofs

The reformulations field of dataset.json is a flat list of formulation pairs. Each entry names two formulations a and b and labels whether b is a reformulation of a:

{
  "a": {"problem": 1, "formulation": "a"},
  "b": {"problem": 1, "formulation": "b"},
  "reformulation": true
}

Each entry loads into a Reformulation. For every positive pair ("reformulation": true) there is a corresponding Lean file reformulations/pN/a_b.lean constructing a MILPReformulation instance (see Reformulation). The path to the Lean file is exposed as the lean_proof_path attribute. Pairs that are not reformulations of one another have no Lean file.

reformulations/p1/a_b.lean
import Common
import problems.p1.formulations.a.Formulation
import problems.p1.formulations.b.Formulation

namespace P1

-- ============================================================================
-- § Parameter Mapping
-- ============================================================================

private def paramMap (p : P1.a.Params) : P1.b.Params :=
  { A := p.CashMachineProcessingRate
    K := p.CardMachineProcessingRate
    Y := p.CashMachinePaperRolls
    W := p.CardMachinePaperRolls
    U := p.MinPeopleProcessed
    V := p.MaxPaperRolls
    hA_nn := p.hCashMachineProcessingRate_nn
    hK_nn := p.hCardMachineProcessingRate_nn
    hY_nn := p.hCashMachinePaperRolls_nn
    hW_nn := p.hCardMachinePaperRolls_nn
    hU_nn := p.hMinPeopleProcessed_nn
    hV_nn := p.hMaxPaperRolls_nn }

-- ============================================================================
-- § Forward Mapping and Feasibility
-- ============================================================================

private def fwd (p : P1.a.Params) (v : P1.a.Vars p) : P1.b.Vars (paramMap p) :=
  { s := v.NumCashMachines
    r := v.NumCardMachines }

private lemma fwd_feas (p : P1.a.Params) (v : P1.a.Vars p)
    (h : P1.a.Feasible p v) :
    P1.b.Feasible (paramMap p) (fwd p v) :=
  { hpeople := h.hpeople
    hpaper  := h.hpaper
    hcard   := h.hcard
    hs_nn   := h.hNumCashMachines_nn
    hr_nn   := h.hNumCardMachines_nn }

-- ============================================================================
-- § Backward Mapping and Feasibility
-- ============================================================================

private def bwd (p : P1.a.Params) (v : P1.b.Vars (paramMap p)) : P1.a.Vars p :=
  { NumCashMachines := v.s
    NumCardMachines := v.r }

private lemma bwd_feas (p : P1.a.Params) (v : P1.b.Vars (paramMap p))
    (h : P1.b.Feasible (paramMap p) v) :
    P1.a.Feasible p (bwd p v) :=
  { hpeople  := h.hpeople
    hpaper   := h.hpaper
    hcard    := h.hcard
    hNumCashMachines_nn := h.hs_nn
    hNumCardMachines_nn := h.hr_nn }

-- ============================================================================
-- § Reformulation Structure
-- ============================================================================

def aBReformulation : MILPReformulation P1.a.formulation P1.b.formulation where
  paramMap    := paramMap
  fwd         := fwd
  bwd         := bwd
  fwd_feas    := fwd_feas
  bwd_feas    := bwd_feas
  objMap      := id
  objMap_mono := strictMono_id
  fwd_obj _ _ _ := rfl
  bwd_obj _ _ _ := rfl

end P1

Variable Shape Notation

The shape field of a variable and a parameter is a list of strings encoding the index dimensions. formulation-bench parses it into a Shape, an ordered sequence of typed Dimension objects. An empty list [] denotes a scalar. Every dimension is one of four DimensionType types:

  • Fixed (DimensionType.fixed) — a string naming a scalar parameter (e.g., "n", "T"): the dimension ranges over range(param).

  • Expression (DimensionType.expression) — a string containing an arithmetic expression over scalar parameters (e.g., "K+N"): used as-is in generated code.

  • Ragged (DimensionType.ragged) — a string of the form "X[Y]" where X is a scalar array and Y is another dimension already in the shape: the size of this dimension depends on the value of X at the outer index Y. For example, ["n_G", "n_S[n_G]"] means for each generator g in range(n_G), the second dimension ranges over range(n_S[g]). Ragged variables are represented as dicts in generated code.

  • Cardinality (DimensionType.cardinality) — a string of the form "|X|" denotes the cardinality of set X. When an |X| dimension appears, the variable must also provide an indices expression.

The Shape, Dimension, Parameter, and Variable API references give worked examples of each case.