Autodidactic Iteration (ADI) is the learning procedure used in DeepCube to train a neural network that evaluates Rubik’s Cube states (and provides move priors) without hand-crafted cube features or human demonstrations.[1] The trained model is then used as guidance inside Monte Carlo Tree Search (MCTS).[1:1]
ADI trains a parameterized function
where:
Rubik’s Cube is an extreme sparse-reward environment: only the solved state provides a positive reward signal, and random interaction almost never reaches it at meaningful depths.[1:3] ADI avoids this failure mode by:
Training inputs are created by starting from and applying random moves to obtain states at increasing scramble distances.[1:5]
For each training state , ADI performs a depth-1 breadth-first search (BFS) expansion over all actions and evaluates the current network on the children:
Targets are defined using a max-backup rule:
This yields a supervised-learning dataset of pairs that approximates an improvement step in classic dynamic programming / value-based learning.[1:6][2][3]
The network is updated to fit the value targets (regression) and policy targets (classification). DeepCube reports using RMSProp, mean-squared error for value, and softmax cross-entropy for policy.[1:7]
To reduce divergence, the DeepCube implementation weights samples closer to the solved state more strongly (inverse-distance weighting).[1:8]
ADI does not directly produce a solver on its own. Its role is to produce a model whose outputs are used to guide planning:
DeepCube then combines these signals with an asynchronous MCTS procedure to obtain explicit move sequences on fully scrambled cubes.[1:10]
Stephen McAleer, Forest Agostinelli, Alexander Shmakov, Pierre Baldi. Solving the Rubik’s Cube Without Human Knowledge. arXiv:1805.07470, 2018. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Richard Bellman. Dynamic Programming. Princeton University Press, 1957. ↩︎
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. ↩︎