Data

This page includes details of the datasets included in BAIT and how they are processed.

The data directory includes processing scripts and utilities, separated into each dataset. The processing scripts typically download the raw data files into the respective dataset directory, and then process them into the format needed for experiments.

├── data
    ├── HOL4
    ├── HOList
    ├── INT
    ├── LeanDojo
    ├── premise_selection
    ├── utils

HOL4

Processed for premise selection and for TacticZero experiments.

By default, creates a hol4 MongoDB database, with the following collections used by the experiments:

split_data
vocab
expression_graphs
paper_goals

Also includes the following collections with additional data on the HOL4 expressions and their dependencies:

dependency_data
expression_metadata

HOList

Contains data based on HOL-Light proofs found in the HOList benchmark. Used for both the HOList Supervised training experiment, and for evaluation with the HOList Eval experiment.

By default, creates a holist MongoDB database, with the following collections used for the HOList experiments:

split_data
vocab
train_thm_ls
expression_graphs

INT

Data processed for the INT environment and experiments. Uses the INT environment to generate user configured synthetic proving problems, which are saved in this directory. The processing code for this is taken from the original paper, and unchanged.

Premise Selection

HOLStep

The HOLStep premise selection dataset.

Processed for Premise Selection experiments. Note that full graph processing takes a very long time, with over 1M expressions.

By default, creates a holstep MongoDB database, with the following collections used for Premise Selection:

split_data
vocab
expression_graphs

LeanStep

Data from the LeanStep benchmark, processed for Premise Selection experiments. Modified to include s-expression output to enable graph parsing.

By default, creates a leanstep MongoDB database, with the following collections for premise selection:

split_data
vocab
expression_graphs

Also processes and stores additional co-training data as outlined in the paper, for possible future experiments, with the following collections:

next_lemma
premise_classification
proof_step
proof_term
skip_proof
theorem_name
type_prediction

MIZAR40

The MIZAR40 premise selection dataset. Processed for the format in Premise Selection experiments.

By default, creates a mizar40 MongoDB database, with the following collections used for Premise Selection:

split_data
vocab
expression_graphs

LeanDojo

Data generated from the LeanDojo benchmark, used for End-to-End experiments and for training ReProver based models. Processed into a format which enables training generative models.

Will be updated in a future release.