Within these pages are recorded my attempts to wield the highest arcane art and conjure minds that play the game of CodeCraft.
As all advanced AI technologies, our tale begins with hacky plumbing that lets our game speak in the serpent’s tongue and links its fleeting worlds with magic mirrors made of chrome that let us peer inside. Even now, with our project just begun, we have committed numerous atrocities, including blocking inside async code and “JVM based enterprise-grade server framework”. To complete the cursed contraption we disguise CodeCraft as a gym, conjoining it with algorithms bestowed upon us by the Baselines. All that remains to seal the spell: a function meting out reward.
Modest is our first demand: to maximize the score move to the origin, as closely as you can. Lo and behold, three weeks of effort culminate in utter failure on this most trivial of tasks:
But blame lies squarely with our own impatience, which has cut short the life of hapless policies before they had a chance to prove their worth. Requests, now batched, fly forth with tenfold speed, and suddenly we see first signs of life. Quickly we pursue the signal, tuning our incantation, ’til, finally, our policies can master simple navigation within mere millions steps of their creation:
Recalling lessons taught to us by grizzled SREs, we animate untiring golems to supervise experiments and transcribe detailed records that we can review at our leisure. Dozens of attempts per day are aimed at finding further insight and soon discover new improvements. Two tricks of yore, well known to the initiated, prove particularly effective: award not the negation of the distance, but its change as the reward, and stagger start times of concurrent games to get decorrelated samples. A nuisance throughout all of this the code proves an effective fuzzer, and turns up bugs galore in CodeCraft’s faulty engine.
Arriving at the origin is but a stepping stone. Henceforth, a new reward: Move as closely as you can towards the largest crystal to be found. Proprioception insufficient, drones must sense all nearby objects to succeed. A first attempt: concatenate the features of the closest ten into a single vector, zero-padding missing entries, and feed it to the network whole. To our surprise ’tis all it takes and neural networks soon make sense of data so imperfectly supplied. We notice that the features for each crystal twice contain the x, omit the y, and vanquishing the bug we solve the task:
A drone can win no games alone, it must go forth and multiply; we shall reward for every offspring. The solution to the task is much the same as just before but sparsity makes learning it much harder. No longer does the drone receive reward on every tick. Instead, it must, perchance, seek out a crystal, harvest, build, and then divine which of its many prior actions led it to success. Ten thousand steps may pass before it can succeed just once. We try to help, award a partial score for crystals gained, but drones now hoard as decreased score from crystals spent exceeds discounted future gains. Eventually, we find a better path: to leave reward untainted, increase gamma, trust the math. At this time, too, our powers grow, and policies now hone their skills on countless cores. And yet perfection still eludes:
The cause revealed, another bug: we don’t keep track of drones with stable ordering so features jump between all allies, rendering them blind. Sight is restored, thrice-fold increased the score, success!
Deeper magic is required to proceed, and so we must outgrow constraints imposed upon us by the Baselines. At first we try to mold the code and make it our own, but sadly find we are no match for TensorFlow, a fearsome foe with countless victims that include great sages. We place our faith in a more lenient God instead and so a bright torch lights the way as we descent down into darkness.
Crossing the Valley of Despair
With reignited fervor, we learn to cast the grand old spells ourselves and soon add REINFORCE to our arsenal. Our metrics, that have just escaped the shackles cast by TensorBoard, are brought to life anew by One the Bee, a most benevolent of spirits, who henceforth joins us on our quest. For one whole moon, we weave deep lore into the code and excise nasty bugs. Imbued now with the full extent of GAE and PPO, our spells should once again perform just as before. But as we try them on our hardest task we find them either slow or wickedly unstable. Many times we meditate, reread the lore, examine incantations, but further bugs resist all revelation. The crucial insight might have been produced by painstaking comparison of latent states induced by both implementations, but once again the TensorFlow gives rise to great frustration. Dark and unusual thoughts of giving up emerge from the abyss, returning stronger each time they’re dismissed. Lest our quest so meets a premature demise, we swear a pact to push for three more moons until we see the sun of the next solar cycle rise.
The Fates are not yet done with us and send assistance through a fellow bright disciple making known his own interpretation of the lore which we adapt in place of our own. At first, we stumble with the alien magic, but soon adapt a new disguise for our world and learn to shape our tensors as required. Alas, the instabilities persist. Perhaps the Baselines used some tricks not mentioned in the lore? With careful study we reveal a value function head initialized to naught and clipping of its loss. We add the tweaks but find they’re not the cause of our despair.
Delving into records spawned by both implementations, we notice strange connections. Could it be that our sorrow’s source is to be found in unrelated code? An easy test, we rerun old configs on simpler tasks tested before and find performance similarly poor. Reverting to the simpler spells helps not and once again we carefully review all diffs. A fateful pair of small commits now catches our eye. After moving serialization out of the serpent’s slow domain, we added but a single number that conveys the total strength of allied forces to calculate rewards. This quantity can grow to tens, nay, hundreds strong and, fed into the network, wreaks havoc to the artificial brains. Oh, if only we had listened to the wisdom of our elders who admonished us to normalize all inputs. The number tamed, we watch the metrics of our latest run with anxious apprehension when suddenly they soar to heights not reached before! We trap the moment in a spell, and though since then distorted badly by the winds of time, it still remains until this very day a true epitome of joy.
With the magic back under control we once again start making progress on our quest. Collecting crystals all but solved, it’s time to try our hands at battle. We grant our policies another sense, to see all nearby drones, and order them to win in combat. In an initial test the mothership and one opponent share a tiny world. It works, but not as planned, instead of going quickly for the kill the mothership first finds some tiny crystals placed on accident to build and reach a higher score before it closes out the game:
Complexity of strategy cannot emerge without opponents fighting back. We seek Selfplay, a truly terrifying force with powers of creation, and channel it to do our bidding. A first attempt pits deadly seeker against unarmed hider. The task’s asymmetry proves potently destabilizing: Whichever role is easier to master quickly wins all games and stalls all learning ’til abilities regress. We tweak and randomize the size and speed of all contestants which allows us to proceed:
The time has come to lift the final limitation; let policies wield many drones at once. We give two drones to each contestant, either 1m and 1p, or 1m twice, a challenge that turns out to be much harder than expected. While lacking long-delayed rewards of prior tasks, good micro exacts flawless movement that depends on subtle aspects of all nearby drones’ positions. Many weeks we spend refining our code until first hints of strategy emerge:
One strange failure case remains, but is revealed by glimpse into the future as a bug not meant to be discovered yet. Forging onwards we create another world. Size of the map increased, the mothership 2s2c, it calls for all the skills we’ve taught our drones so far combined. The orange policy named Polar Fire shows promise right away, and three days later is defeated three to one by newly trained Whole Sun:
Just in time before departing to the annual gathering for processors of information, we double our powers once again with surging liquids keeping cool intricate mechanisms steeled in the forges of Aen Vidia. Returning to ancestral lands to there observe the solstice our masters grant us short respite, and so we can spend much of our time devoted solely to the craft. We send out sweeping hyperscans to find the rarest of params, employ enchanted value functions that peer beyond the fog of war, implore the God of Entropy, observe rotational invariance and scatter hidden states in convoluted cylinders. To guard it all, we summon a scaled multi-headed beast with restless eyes and piercing gaze which judges all who dare approach and lets just those deemed worthy pass.
Rich Lake the culmination of our work in blue, defeats Whole Sun by nine to one:
Notwithstanding our success and steady increase of all skills, our drones remain at times beset by twitching spasms. Our failed hypotheses as to the cause are now too numerous to count. So often have we seen the curse that we have come to recognize familiar patterns. We slow down time to but a crawl and pull two structures from the chaos: LR when turned away from crystals, LLFRRF when facing them at a right angle, both forming stable loops.
It is as if the drones intend to walk the ideal path but act on information from the past. We know now where to look, and take apart the cursed contraption once again. With its true name revealed, the fiendish demon can no longer hide and is soon cornered and expelled. We see a burning afterimage of its form: a race condition, triggering when we observe the game, delaying actions by one tick. Not the Heisenbug of yore, but a close relative for sure, and just as fierce. Now unpossessed by evil spirits, our policies springs forth devoid of obvious flaws:
We set a harder task with larger map and three more builds: 1m1p, 1s1c, 2m. Some tweaks and then it works though hardly well:
Our policies lack memory, cannot recall that which they saw just one step in the past. Whenever enemies retreat into the fog of war, the fact of their existence is immediately forgotten. And scouting, too, can’t well be done if visited locations fail to be remembered. On smaller maps ’twas not an issue but now the limitation becomes clear.
An ancient spell, LSTM, will soon dispel our problem. To cast it right our rituals must be rearranged, and while it works on proof-of-concept tasks it fails to bring the hoped-for gains. We start to doubt this fickle spell is our path to success; it breaks the symmetries we need to make our features most efficient. Previous attempts at sharing inputs between drones had failed and sharing memories seems harder yet. Too difficult and risky of an undertaking this late in our quest we look to other options. Not quite as elegant, but simple and effective, we engineer two features: Enemies, when hidden by the fog of war, still show their previous position, and ghostly objects mark locations of the map that benefit from scouting.
In parallel to these endeavors we optimize the incantation and figure out how we can train our policies for longer: Lucky Voice One Oh Five M beats Glad Breeze Two Five M by 3 to 1:
Still less than optimal, we see no longer major faults caused not by lack of memory. Encouraged by this quick progression, why not proceed to the full game: A world whose size is many times of those that came before, a mothership 10 modules strong, a great variety of builds.
We train on smaller versions of the map, and still, just one run out of four survives the rigors of this task. On a whim, we now attempt the challenge we set out to meet: surpass the Replicator, a mindless mechanism built in our youth that follows rules carved into stone. To our surprise, the Curious Dust defeats it on occasion, albeit on smaller maps and without great imagination:
On map size five we’re still entirely outclassed:
But surely now our victory is guaranteed, a matter just of time and more refinements. Yet progress stalls, no changes work, and Curious Dust remains the champion. Our policies make no effective use of past information, produce exclusively 1m, play passively and fail to mount attacks. Maybe a bug in memory? But no, on tasks designed specifically to test this skill it works quite well.
Are micro-skills with many types of drones too difficult to learn? ‘Tis not the case, when forced to face and steer a wide array of drones in battles to the death, the policies learn good control and group to take down bigger foes:
So say we mix this micro practice into normal training runs? No large effect. Another month where all our tools yield no results, no tuning works and longer training shows no benefit. We randomize the mothership but policies just overfit to all the diff’rent types. Shaped rewards yield no success and no more gains are found in better architectures. And so our policies from weeks ago remain still unsurpassed. The final goal, which seemed so close, we failed to grasp, and at long last, we must admit defeat and try to be content with what we have achieved so far.
As if to seal the final nail, now nature rises up and clogs our prized machines. Our tools kaput, we rest, recharging our depleted mana. When we return, the plan is clear: one last methodical attempt to realize our method’s full potential. A daunting task as three imposing eldritch horrors block our path. Undefeated as of yet, our previous attempts did manage to reveal their shape:
- The mothership, equipped with mighty weapons and impenetrable shields, annihilates whoever strays into her path. All drones, consumed by fear, will quickly learn to stay away and simply spin in place.
- Small maps cannot teach scouting skills, and large ones prevent learning altogether.
- Catch 22: not knowing how to micro other builds, production soon collapses onto drone 1m, precluding any chance of practice.
Together the triumvirate holds strong, but perhaps one we can defeat. A ruse: we change the world to one already solved, but with a twist, and so the mothership stands all alone without support from other foes:
Other options long exhausted, we turn to dark forbidden arts. With intricately shaped rewards we try to induce smoother paths. But our transgression’s sole reward is one amusing failure as enemies collude to kill off excess drones. This plan then, too, must be abandoned:
We pivot to a direct angle of attack: The mothership is plain too strong, so weaken it. Previously such efforts failed, but now we add one key ingredient by making randomness continuous. In some games now the mothership is strong, in others weak, but infinitely many points lie in between and policies, conditioned on the true amount, can slowly grow in strength and transfer skills from one game to the next. Our aim was true, the strike connects, a hellish scream, the foe drops dead:
One down, two more to go. We brace for battle with our newfound heavy hitting hammer held in hand. Our foes, so clearly nails, shall surely shatter on the next assault. A mighty swing of deadly force: all module costs are randomized, the policy conditioned once again, and woe to those who still just build 1m when cheaper builds of stronger drones abound:
A fatal blow, but dusk draws near. No matter, there’s need for sleep, a boundless force of racing thoughts compels us forth at breakneck speed. And so the final touch that spells the last opponent’s doom: we randomize the map and grow it gradually in size. We start the run, and watch with glee as our latest students train. Rejoice! Just before a distant dream, Wandering Eon now slays replicator fair and square in 1 games out of 4.
Swiftly do we now progress, and our methods are refined over the course of many moons.
Our quest long since complete, we have scarce reason to procrastinate and must embark on one more final task. Though so much more could still be done, the time has come to lay down arms, take up the quill, record the knowledge gained, and make our own small contribution to the grand edifice of machine learning.
Thanks to Anssi Kanervisto for reviewing… whatever this is. I have taken inspiration from many sources and credit assignment is as of yet an unsolved problem, but some related creations that I’ve enjoyed in the past include the Bastion of the Turbofish, the description of the Rustonomicon, King James Programming, and, of course, Hexing the technical interview and friends.