How to depress a computer without really trying
There exists a fighting game called MUGEN. It's been in development since (holy shit I just checked) 1999. The really novel thing about it is that it's an open source fighting game that supports characters and stages as plugins.
I will not be talking about MUGEN.
There exists a website called SaltyBet. It's existed since (their wiki has a question mark) 2013. It uses Twitch to stream bots fighting in MUGEN and players can bet virtual currency, called Salt, on these fights. SaltyBet pays for itself by players subscribing and buying Salty Gold. Gold, to the best of my knowledge, lets you vote on fights showing up in exhibitions and gives you stats about fighters. I looked at this last detail and thought, "Yeah right, I can just scrape that info."
I'm not certain when I started scraping it. I have a file with 11556 fights dated to September of 2015. I know that some time in 2016, when visiting a friend, I started cobbling together a small pile of machine learning in order to exploit this info and guide my bets.
Denial
The basic question I want answered is "Who do I bet on?"
The available data I have for each fight is, of course, the fighters' names and who won. This is sufficient for just checking "has this fight happened before?" But I also know the length of the fight, the tiers of the fighters, how "loud" (how many lines of text were posted) chat is, their win streaks, their previous tiers, and how much people bet on each fighter. The json blob that saltybet returns actually says what the bettor's bank balance is, too, so I can see if people bet the entire farm on a particular fight.
It's easy enough to just aggregate this data into "This guy wins 60% of the time and this guy wins 55% of the time" but I have so much more data available and, most importantly, I don't know how to weight the data. Is their streak length important at all? How about how recently they've changed tiers? And should I read anything into the average length of their fights?
Anger
The first tool I reached for was Decision Tree Learning. This is a technique that takes many labeled examples that have Attributes. Those attributes are used to create a decision tree. A classic example would be mushroom classification as poisonous or not. Your example would have attributes like, cap-color:brown. cap-shape:bell. odor:almond. gill-spacing:crowded. And so on. And from a big aggregate of this data you'd decide which attributes to split on when and create a decision tree that you take a novel example and follow to the end. The end hopefully only has one class, in the mushroom case poisonous or non-poisonous. That means that, at this point, the attributes examined in the training data exclusively consisted of poisonous or non-poisonous mushrooms.
Sure, some brown mushrooms are poisonous and some aren't. But every brown mushroom that also has an almond smell might be totally safe! That is the information that this technique tries to distill.
(aside: do not take any advice from a machine learning model frequently implemented by students about which mushrooms will kill you)
If, we get to the end, we've exhausted all our attributes, and we still have some poisonous and some non-poisonous shrooms, what can we do? Just figure out the counts and be like "Yeah, at this point 85% you'll live.
One of the convenient things about this technique is that redundant attributes, that is, attributes that have the same predictive power as other attributes, are simply ignored. If every mushroom that's almond-scented is lethal that'll (probably) be the first split the algorithm does. It seeks to split on the best available predictor.
One of the inconvenient things (in the toy implementation I wrote) about the algorithm is that you need to have attributes that are discrete values. It won't work correctly on a real number like win-rate! It'll see 0.5, 0.501, 0.502, etc, all as discrete values. It'll still work, but poorly. It won't understand that 0.501 is between 0.5 and 0.502, so it won't generalize anything useful. Therefore it's necessary to bucket the data. Today, I would use an algorithm like k-means to find clusters, but I went with a potentially cursed choice:
When comparing the winrate of Fighter A to the winrate of Fighter B, I did (fArate+1)/(fBrate+1)
. I then fit a 1/x
curve to it, with 11 buckets, so find how "evently matched" they were. So two evenly matched combatants, for this attribute, would go in the 1 bucket. If Fighter A was x times better, it'd be in the 10's bucket. Conversely, if Fighter A was x times worse, they would go into the 0.1 bucket.
So an example fight was the aggregated properties of Fighter A divided by the corresponding aggregated properties of Fighter B, with Fighter A as the numerator and Fighter B as the denominator, and the class I was training on was "A" or "B".
As an aside, I could basically double my training data for free through a technique called data augmentation. I would just reverse Figther A and Fighter B and the class and add that as an additional bit of training data. This was especially important so that it would not learn any biases towards always betting A or B: It had an exactly even number of them.
Finally, I added a technique called Random Forest. It avoids overfitting by making multiple trees and only training each tree on a subset of the data. This was especially convenient because it meant I could train 5 trees, each with 80% of the data, and then interrogate them about the 20% they weren't trained on.
This reported about 70% accuracy, which I validated by opening saltybet, starting an excel sheet, and betting.
Bargaining
70% is great and all, but surely we can bet better even without actually making our model more accurate. The predictor has a flaw in that it always suggests who to bet on. It makes the fatal mistake of conflating strategic and operational concerns. Respectively, that is, "do we bet" and "who do we bet on?" The first question has been assumed as "yes" so we always go with calculating a bet. What if we had a way to tell if we should bet at all?
This model, our self-evaluator, would understand if we fell into an edge case where our data mislead us, where our predictions were unreliable, where we needed to tread carefully. We can create a second set of training data that is the result of the first five trees being evaluated on data they've never seen with the result now being not "A" or "B" but "correct" or "incorrect". This is essentially the same training pipeline, only instead of our class being the class recorded in the training data, it is if our predicted class is the same as the recorded class.
Given that our predictor has an earlier-mentioned 70% accuracy we'll have 70% of our labeled examples as correct and 30% as incorrect. I, for not particularly mathematically-sound reasons, decided that I wanted it to take particular note of the cases where it was wrong and weigh those more heavily. If I wanted it to have an effective ratio of training data of 50/50, I would increase the weight of incorrect cases by 1/70% = 1.4286
. A simple tweak to the algorithm, when it counts up the number of examples in the node belonging to each class when deciding to split, lets me say "Now, when you split this node, pretend that there's more of this class."
This model cross-validated at, shockingly, 70% as well. I did not, back then, do a deeper analysis of the actual accuracy. I didn't break it down to true-positive, false-positive, true-negative, and false-negative. I somewhat suspect that, without my mucking about, the model would cross-validate higher. Surely there were cases where the predictor would correctly predict the winner, but the fight fell into a marginal case where there were 5 corrects and 4 incorrects in that terminal node of the tree. And were it not for the 1.4286 weight on the incorrects, the self-evaluator would have said "Yes, I'll bet correctly".
However, this weighting would not increase the false-positive rate. That is the most important rate for betting. If my false-negatives go up, or my true-positives down, I simply acrue fake wealth more slowly. Put a pin in this thought.
Depression
But this technique where I make the training data the ratios of the fighters is cursed. Most importantly, we can't distinguish between fighters with high or low numbers in some categories. If we want to use the number of times we've seen a fighter, we can't distinguish between two fighters with 100 observed fights, and two fighters with 10 observed fights. And while that might not influence our "A" or "B" decision, it's easy to imagine how that might influence our "should we bet?" decision.
So I rewrote it. Instead of using an old toy Decision Tree Learning algorithm I wrote in Lua for a class, I switched to Java and brought in a fairly bare-bones neural network library. Once I translated my data-processing pipeline I could start planning the actual training process.
One of the things you need to tweak for neural nets is the structure of the neurons. For classification tasks like this you generally want to use fully-connected layers. For image classification you want to start with convolutional layers, where each neuron on a layer pulls from only a handful on the layer above, but in the end you need to get a label out and so you'll see a few fully-connected layers at the end. The quintessential example of this is AlexNet, which was also one of the first deep-learning networks, thanks to its use of a new activation equation.
So I picked a few structures that looked good, some deeper, some shallower, some wider, some thinner, and trained on 95% of the training data and used the last 5% to validate it. This is the brute-force technique I used to figure out what shape to use.
The specifics of training are simple: I instantiate a neural network with a given structure and pass labeled training examples to it. The network's weights are randomly initialized and so it puts the input data in and starts tweaking the weights to get closer to the expected output, rather than the actual output. I repeat this process, going over the training data each time, slowly decreasing the magnitude of the tweaks on each iteration. The process stops when the amount to tweak comes within some very tiny threshhold of zero.
This cross validated at 68%, which was annoyingly lower than Decision Tree Learning. I didn't realize it at the time, but looking back I suspect it's because I put non-normalized numbers in for training: Some input neurons started at a range of 0-1, others started at a range of 0-millions. That's not a very good way to fly. At the very least, I should have taken the log of them.
Well, if that quality decreased a small amount, surely it'll be worth it because the new data will help the "should I bet?" model.
So I sorted my training data into two camps: correctly predicted and incorrectly predicted. I did a pass on the correctly predicted data, and then I did a pass on the incorrectly predicted data, with the above mentioned 1/68% = 1.4706
multiplier applied to the amount to tweak.
One last important detail for the differences between neural nets and decision tree learning: DTL would happily report "100% A" or "25% B" depending on how many examples were in a terminal node. The neural net, however, is an approximation of a function, so all of its outputs are real valued. Mine had two output neurons: A and B, or correct and incorrect, and I simply took the neuron with the greater output and the real valued number associated with it as its percent confidence in that output.
I started betting with this and noticed that the "should I bet?" predictor was only giving me two outputs:
- Mostly
0.99998 incorrect
- Rarely
0.995 correct
Why did the predictor, with such high confidence, almost always assess itself as incorrect?
Acceptance
You might notice the amount of detail I put into the order of examples I trained it on.
What had gone wrong is that I trained it, rather strongly, on 32% of the data that was pure incorrect examples. At the end of every epoch of training, I blasted this poor artificial brain with a constant stream of "No, you fucked up". No wonder it got depressed! When I stopped the training process, the last hundred thousand examples had all been correcting it towards the same output, how could it have learned anything besides "just output this"? The only time it could output "correct" was when I hit a path that had never ever seen an example of being wrong.
A pessimistic decision to bet isn't the worst thing, so I rolled with this for a while. I was a bit burnt out, anyway.
These days, I use a python library that provides a small neural network library. It's back to 71% accuracy, thanks to the fact that the library handles all the tricky things like normalizing, shuffling data for training, iterating epochs, and so on. My python script also provides a REST server that a tampermonkey script can use to automatically execute bets.
There's a saying: Don't roll your own crypto. Cryptography can be a matter of life-or-death, if your user is in an oppressive environment, be that at the governmental or societal level. But that's when you're using it in production, for things that really matter. Whenever you make some software project, there's the core capability you're providing: gamedevs provide the game, payment processors provide the platform, and so on. But none of these guys have any business making their own database engine. Nobody does! Except the folks who make database engines.
The lesson, if that's what I'm even trying to convey here, is that your life is complicated enough without reinventing the wheel. The core capability my predictor provides is predicting. Recording the data and processing it is my wheelhouse; everything else I should phone it in.
But this only applies to making a product. If you're trying to learn, roll your own X. Whatever you wanna learn about, tinker with it, break it, and come to understand why it broke.
The inspiration for the title comes from a blog post called "How to make a racist AI without really trying" which presents a series of perfectly reasonable steps on training an AI and then has it evaluate some sentences for sentiment and it responds like
My name is Emily: Great
My name is Yvonne: Good
My name is Shaniqua: Bad
when all "my name is" sentences should probably, in a world that conforms to some definition of "just", evaluate as neutral.
But that's not quite right, is it? What if I wanted a model that came up with an intimidating name for my villain? And you would probably expect a model, even in a "just" world, to evaluate My name is Mr. Rogers
as Excellent
.
Even when you do everything right it is really easy to create a biased model. And, as I said, I rolled with the biased model for a while. All models are wrong, some are useful. All the models can do is reflect the biases that they are trained upon. Awareness of these biases is critical to make them useful.