Data
This page includes details of the datasets included in BAIT and how they are processed.
The data directory includes processing scripts and utilities,
separated into each dataset. The processing scripts typically download the raw data files into the
respective dataset directory, and then process them into the format needed for experiments.
├── data
├── HOL4
├── HOList
├── INT
├── LeanDojo
├── premise_selection
├── utils
HOL4
Processed for premise selection and for TacticZero experiments.
By default, creates a hol4 MongoDB database, with the following collections used by the experiments:
split_datavocabexpression_graphspaper_goals
Also includes the following collections with additional data on the HOL4 expressions and their dependencies:
dependency_dataexpression_metadata
HOList
Contains data based on HOL-Light proofs found in the HOList benchmark. Used for both the HOList Supervised training experiment, and for evaluation with the HOList Eval experiment.
By default, creates a holist MongoDB database, with the following collections used for the HOList experiments:
split_datavocabtrain_thm_lsexpression_graphs
INT
Data processed for the INT environment and experiments. Uses the INT environment to generate user configured synthetic proving problems, which are saved in this directory. The processing code for this is taken from the original paper, and unchanged.
Premise Selection
HOLStep
The HOLStep premise selection dataset.
Processed for Premise Selection experiments. Note that full graph processing takes a very long time, with over 1M expressions.
By default, creates a holstep MongoDB database, with the following collections used for Premise Selection:
split_datavocabexpression_graphs
LeanStep
Data from the LeanStep benchmark, processed for Premise Selection experiments. Modified to include s-expression output to enable graph parsing.
By default, creates a leanstep MongoDB database, with the following collections for premise selection:
split_datavocabexpression_graphs
Also processes and stores additional co-training data as outlined in the paper, for possible future experiments, with the following collections:
next_lemmapremise_classificationproof_stepproof_termskip_prooftheorem_nametype_prediction
MIZAR40
The MIZAR40 premise selection dataset. Processed for the format in Premise Selection experiments.
By default, creates a mizar40 MongoDB database, with the following collections used for Premise Selection:
split_datavocabexpression_graphs
LeanDojo
Data generated from the LeanDojo benchmark, used for End-to-End experiments and for training ReProver based models. Processed into a format which enables training generative models.
Will be updated in a future release.