Jump to content

AI cheats! (with real data)


Recommended Posts

  • Replies 250
  • Created
  • Last Reply

Top Posters In This Topic

Originally posted by Mannheim Tanker:

Great analysis! I believe you're confusing level of significance (commonly accepted as 0.05) with the probability of a Type I error (p-value). The commonly accepted value for significance in the p-value is 0.001 IIRC. It's easy to confuse the two, even if you deal with statistics on a somewhat regular basis.

Hehe...I'm sure someone will correct me if I'm wrong. redface.gif

IIRC, the level of significance, alpha, IS the probability of a Type-I error. This value is commonly set to be .05. The P-value that has been evaluated here is referring to the chance that the experimenter has of recieving those same results or results even more extreme under the current assumption that the AI does NOT "cheat". Hence, only 3 times out of 1000, according to the original experiment made here, would the experimenter expect to get those results or results even more extreme (in favor of the AI) assuming the AI does not have an advantage.

So, if the p-value, the chance of recieving said data or data even more extreme, is less than the chance of a Type-I error, then the results are statistically significant. The original hypothesis, that the AI does not have an advantage, should be rejected for the alternative hypothesis, that the AI does in fact have an advantage.

However, as has been noted, we really need to ask Warren what, exactly, he did in his experiment and what level of control he was exerting. The experiment, seems as if it could be sound. Their is replication in the experiment, and he controlled for any other confounding variables by switching sides from Russian to German. But, if tanks could see through the lanes of tall pines, and if he was moving his tanks around, then this experiment would certainly lose some validity. Some of the above, controlled experiments would provide some helpful insight.

I, for one, am a bit skeptical as I do not believe that BTS would not mislead us as to any advantages the AI may have. Very interested to hear from Warren.

-Jim

Link to comment
Share on other sites

Excellent.

As score keeper and historian, here is how I see the argument progressing. (WP = Warren Peace, the protagonist. )

Phase I -- Our Hero Sallies Forth

1. WP challenges the view that the AI does not take advantage of knowledge that a reasonably intelligent and informed human would not have, and the AI does not take advantage of procedures or actions not available to that human.

2. WP says, "If this view is correct, then humans and the AI should be about equally successful in a series of contests in which they each use the same weapons. Half the time the human uses A against the AI's B; half the time it's the AI using A against the human's B."

3. WP sets up an experiment to test that. He finds results very unlikely to result from equally capable foes. The AI's wins are significantly higher than expected. Clear statistical significance.

Phase II -- The Detractors Are Drawn From the Forrest, As Flies To Carrion

1. WP is measuring the wrong thing. "I don't believe it's very sensible to measure the number of kills. It makes more sense to measure chance-to-hit-%." -- Sgt. Emren

2. WP's n is too small. -- Various posters.

3. WP may have set up his experiment in some peculiar fashion that forced results that are not generalizable. Give us the necessary information to permit replication. "Ok. So give us the data so we can reproduce your results. ~730M. Open terrain? Lanes of tall pines? I want to see if I can replicate your results" -- Sunflower Farm Boy

3b. (A specific instantiation of the general attack in [3].) WP's results may not have been due to the AI "cheating", but to uncontrolled confounding variables such as the AI's tanks happening to be better positioned. "In any case, as others have pointed out, the real sources of error creeping into your experiment are due to the other variables not being set as constants (the only variable changing should ideally be the one you're testing). In short, your assumptions are flawed. For example, were the starting positions of the AFVs identical in each run?" -- Mannheim Tanker

3c. "Is it possible that Global Morale is affecting the outcome? Is it more accurate use 72 single 1-on-1 tank duels, where Global Morale is not an issue, or should you do 6 12-on-12 battles, where the losing side's performance should decrease (snowball, really) as they start to fall behind in the duels?" -- Silvio Manuel

4. WP's outcomes were not independent, and thus any statistical test assuming independent outcomes would be invalid. His statistical results are thus invalid. " 'Played until one side or the other had all vehicles destroyed.' .....This can only be accomplished if all vehicles have LOS to all other vehicles." -- Ace Pilot "-- it's possible, if he allowed vehicles to move from lane to lane after destroying their opponents in their own lanes. Of course, this completely annihilates the "independent trial" assumption " -- Mud

---------------------

Other posters have offered variations and elaborations on the above or have offered attacks that are simply spurious. "I believe you're confusing level of significance (commonly accepted as 0.05) with the probability of a Type I error (p-value). The commonly accepted value for significance in the p-value is 0.001 IIRC" --Mannheim Tanker Mr. Tanker needs to stick to his core competencies, from which he often presents us with enjoyable and informative gems. p values are exactly the probability of Type I errors and the calcified standard is 0.05, with 0.01 being the only alternative standard.

I'm waiting for the next Phase. As always, try to get your reply turns in as promptly as possible. In the AAR of the next Phase, I'm hoping to report that someone has weighed in with a clarification about observed significance level and n. It is clear to me that those arguing the n is too small are very vulnerable and exposed. Expect an assault. I'm also looking for some defense of the original methodology. Were "lanes" fully isolated? And I'm hoping one or more of WP's supporters will clarify the emerging new theory, saying exactly what advantage the AI has that the human does not. (So far, the new theory seems mostly based on AI being able to compute optimal first target selection in a way that humans cannot. Elaborate on this if you believe there are other ways the AI may be taking advantage.)

-------------------

More replies have come in even as I was compiling the AAR above. I'm especially excited by Lt. Tankersley's attempt at replication, with variations (improvements? hmmm?) in methodology, a much clearer Methods section, a clear Results section, different results, and an interesting anecdotal observation that should alter the course of this battle.

Jolly good!

[ October 30, 2002, 02:28 PM: Message edited by: Lt. Kije ]

Link to comment
Share on other sites

Maybe you guys are forgetting one small thing?

Chance.

Just because a mathematical equation says something should happen doesn't mean that it will happen.

**edit**

Afterthought: If you wanted to test simple AI vs human wouldn't it be easier to have T-34s take on T-34s, or Stugs take on Stugs etc... (Ok so maybe the game doesn't allow you to do that but I guess it's an idea).

[ October 30, 2002, 02:24 PM: Message edited by: BulletRat ]

Link to comment
Share on other sites

Originally posted by L.Tankersley:

One very interesting observation: I did not gather statistics on this (yet), but my strong impression is that the AI side fired first in virtually all engagements. That is, in the "human as allied" trials, the StuGs got the first shot, and in the "human as axis" trials, the T-34s got the first shot. Given the relatively high hit probability and lethality at the test ranges, a discrepancy here could very well have a significant effect. I suggest that to gather data on this, you view the firing range from above using view 9 and watch the smoke plumes. Again, my impression is that most or all of the AI tanks fired before any "human" tanks fired, regardless of nationality.

Interesting. Why has nobody run a test yet with captured equipment, so we get a "Ryu vs Ryu" fight??

Just thought of this, but, DUH!

Edited so everyone knows that i didn't read the post above mine before writing this.

[ October 30, 2002, 02:28 PM: Message edited by: Lumbergh ]

Link to comment
Share on other sites

OK, I see I have stirred up the pot.

I encourage all of you to set up similar tests and see if you get similar results. I am a scientist by training (PhD in biochemistry) and I run a molecular genetics research program. I know something of scientific methods and I am also quite aware of the importance of independent validation to science. I do use statistics regularly, but I am not a statistician. For my P value I simply plugged my numbers into a 2x2 table and the statistics program I use (Statistica) gave me the value using a McNemer Chi2.

I will respond to some of the comments and questions.

1) Each battle is actually 6 identical sub-battles. (This is what the trees are for). There is no variability in the starting positions of the tanks, although they are free to move. (However, this rarely happens before they are destroyed). The only user imput is the initial targeting line to the only visible enemy tank. I have tried not targeting and this appeared to make no difference.

Most of the battles end within the first minute, with an occasional duel lasting longer. Thus there is a total of 36 identical battles so I think this is sufficient for chi 2 analysis.

Many people keep mentioning other uncontrolled variables. Please tell me what these might be?

Several people have mentioned that I should do the experiment in hot-seat mode and see what happens. I too thought of this, but I had to go to work! I will try and get to it tonight, unless someone else beats me too it!

Warren

Link to comment
Share on other sites

Originally posted by Lt. Kije:

The whole process of Normal Science can now go forward. Warren Peace's informed detractors will first attempt to find flaws in the experimental design. They will second argue for alternative explanations of his data. Finally, they will attempt to modify the theory his data threatens so that it can accomodate the new findings. Finally, when all of this fails, they will adopt the stance of, "Yes. Of course Warren Peace's experimental results are correct. Everyone knows that. Only a fool such as yourself would even raise this as an issue. Don't waste my time." His uninformed detractors will engage in ad hominem attacks on Warren Peace and anyone who seems to support him.

-- Lt. Kije

"Finally! Something I know something about!"

It sounds a little like you are saying that the burden of proof is not on the creator of the new hypotheses, but on the rest of the world to disprove it. This means I can come up with any cockamamie idea and the world will assume it's true. I say santa claus is real, prove me wrong...I saw jesus in a tortilla, prove me wrong....

The burden of proof is on he who asserts the existence of a thing, not on the world to disprove his assertion. Few laymen here can understand the original argument, thus it can't get much mileage. Is the objective to impress us with knowledge of arcane statistics or to prove a claim that the 'AI Cheats'?. I am perhaps impressed by your knowledge of P-square whatchamacallit, but since I don't know what you are talking about the argument is not convincing.

Ren

Link to comment
Share on other sites

Originally posted by Warren Peace:

Many people keep mentioning other uncontrolled variables. Please tell me what these might be?

Warren

Just because we can't think of them off of the top of our head doesn't mean they don't exist! This is *not* biology, unfortunately.

That said, I'll take your bait. Some uncontrolled variables might fall under the category of AI vs player subroutine processing. Maybe somewhere under the hood perhaps there are a series of subroutines that run faster for the AI calculations than for the player's side. Pure speculation, but it could have a large effect!

The point is, we are not sure what uncontrolled variables are affecting our work, so lets be scientists and use something a wee bit more robust than a chi-squared test. Thassall.

[ October 30, 2002, 02:38 PM: Message edited by: Lumbergh ]

Link to comment
Share on other sites

How fascinating.

One could argue that a Bayesian approach would be more appropriate (given the prior probability of the AI cheating is really quite low since we haven't caught Charles lying before), but then we'd have a religious stat war going on.

I'm very suspicious that your trials are not independent of each other. Since this is a critical assumption in calculating statistical probabilities (for the approaches mentioned to date), I'd be concerned. One culprit could be global morale, global morale certainly affects AFVs but it's not clear if it's a continuous or step function.

The AI "shooting first" phenomenon warrants a closer look.

Link to comment
Share on other sites

I've collected some data as well. I set up 44 fireing lanes each 20 meters wide. Lanes were seperated by tall pines so each test was independant. At the end of each I put a T-34/85 model 1944 and a Stug IIIG (middle). The tanks were 752 meters apart and were in rough so they couldn't move. Each tank was in rough, and so was imobalized. Further, each tank had only AP and HE rounds, I manually removed smoke and T rounds. Finally, I threw in a bunch of pillboxes for both sides such that nobody could see the pillboxes, this may reduce the effects of Global Moral. I issued no orders to my troops. I've put the scenario up on line here. I encourage everyone to take a look and to do your own tests with it and post the results here so we have as much data as possible.

I ran 2 tests in each of the following situations:

Human = Allies. Axis lose 11, Allies lose 33 and Axis lose 12, Allies lose 35

Human = Axis. Axis lose 18, Allies lose 28 and Axis lose 17, Allies lose 30.

(that's a total of 88 duels playing each side, 176 trials total)

I also ran one test in each of the following situations:

Human = both (hotseat). Allies process turns. Axis lose 14, Allies lose 33

Human = both (hotseat). Axis process turns. Axis lose 10, Allies lose 35

So, I'll calculate Chi Square in two different ways.

First I'll calculate the Chi Squre using the average of all Human vs. Ai trials as the Expected value. So:

Allied Expected loses = 63

Axis Expected loses = 29

When AI is playing Axis, observed Axis loses = 23

When AI is playing Axis, observed Allies loses = 68

When AI is playing Allies, observed Axis loses = 35

When AI is playing Allies, observed Allied loses = 58

So Chi Squre of AIs performence difference from the Expected performence = 1.638

And Chi Square of Human's performence difference from the Expected performence = 1.638

So p = .201 in both cases. In other words *not* significant.

Now I will make similar calculations, but for the expected value I will use the average of both trials that were played hotseat. This should, theoretically, eliminate all variables except the AI and Human issue. I'm playing the exact same scenario with the exact same orders (none) so the only variable is AI control of troops.

Allied Expected loses = 68

Axis Expected loses = 24

When AI is playing Axis, observed Axis loses = 23

When AI is playing Axis, observed Allies loses = 68

When AI is playing Allies, observed Axis loses = 35

When AI is playing Allies, observed Allied loses = 58

So Chi Squre of AIs performence difference from the Expected performence = 5.041

And Chi Square of Human's performence difference from the Expected performence = 1.512

So p = .219 in the Human's case. In otherwords, a human vs. the AI seems to take the same losses as a human vs. another human. But p = .025 in the AI' case. In otherwords, the AI seems to have the advantage over a human, as oposed to a human vs. human situation.

Comments to others

-----------------

Mannheim Tanker -- You said I've confused significance with the p value. I don't belive this is the case. Acording to the website I reference above (http://faculty.vassar.edu/lowry/webtext.html) they are the same. Can you explain what you belive the difference to be and how one converts from one value to the other?

L.Tankersley -- I get a p value of .416 with your data. This is pretty close to what you got, but I'm currious what the difference is.

Finnaly let me encourage everyone to run a set of tests with the scenario I link to. The more data we have the more accurate our conclusions will be. It only takes 15 minutes smile.gif

--Chris

Link to comment
Share on other sites

Limberg,

What statistic do you suggest? Chi square seems an appropriate test in this situation. However, if you would like to use some other method I'm all ears.

Your comment on variables is interesting. If it is a sub-routine problem that lets the AI shoot quicker or hit more often, I'd think that qualifies as an AI cheat, dont' you think.

Warren

PS Marlow mentioned that this was observed in CMBO. I'd like to find that thread.

Warren

Link to comment
Share on other sites

Originally posted by Warren Peace:

Limberg,

What statistic do you suggest? Chi square seems an appropriate test in this situation. However, if you would like to use some other method I'm all ears.

Your comment on variables is interesting. If it is a sub-routine problem that lets the AI shoot quicker or hit more often, I'd think that qualifies as an AI cheat, dont' you think.

Warren

PS Marlow mentioned that this was observed in CMBO. I'd like to find that thread.

Warren

No I said that I saw this in CMBO. I never mentioned it on the forum. I was taking a look at Shermans vs. mark IVs, and I noticed that the outcome was dependent on which side was AI. I also noticed that this was likely due to the AI usually firing first. Manually targeting gave better results for the human. I don't have the actual numbers any longer.
Link to comment
Share on other sites

Originally posted by Warren Peace:

Many people keep mentioning other uncontrolled variables. Please tell me what these might be?

Warren

I have only an elementary understanding of what different levels of AI operate in the game, so take this with a grain of salt. I know there is a TacAI that “helps” the human player AFTER he has plotted his move. I’m assuming this TacAI also helps the computer after it has plotted its move. Let's assume these act equally for both sides. I also assume the TacAI is different from the AI that plots the computer’s move (I’ll call it the PlotAI). Since you aren’t giving any orders to the human side, could this be giving the computer a slight edge, since the PlotAI is going up against a “do nothing” approach? I realize there isn’t a whole lot going on, especially since you said most fights are over by the end of the first minute. However, could the computer be rotating or even moving its tanks to gain a slight advantage, or for those fights that go over a minute, unbuttoning while the human side stays buttoned? Other events, possibly?

Just speculating,

Ace

Link to comment
Share on other sites

What fun...

With all due respect, Lt. Kije, not everyone waded in here looking to shoot WP's hypothesis, or testing, down in flames.

The first clue was his use of the word 'rigorously'. That he would speak of rigor at all told me that he is experienced in the area. Expert? Who knows? Am I? Not any more, but 15 years ago, maybe.

By his own admission, the test could have been more rigorous. What has been proposed in a number of posts are possible solutions to increasing the rigor of his test. Careful elimination of potential confounding variables is what sadistics is all about. Will variable X corrput your results? If you aren't sure, do your damnedest to control it.

Possible confounding factors include *nods* global morale. Another possibility, although I'm not sure how it would affect the combat AI, might be borg-spotting. Single tank to tank battles would probably be the best design, with no borg-caused distractions. And anything else anyone else can think of.

If you can control it down to just one variable, then you are testing just one variable. If you miss just one other influencing variable, you effectively have three influences on your outcome, the one you know, the one you don't, and the interaction between the one you know and the one you don't. Interactions are things of evil, and must be eliminated or accounted for.

So, stop keeping score, get off the sideline, and try to improve the design.

Edited fer spellin'

[ October 30, 2002, 03:15 PM: Message edited by: Herr Oberst ]

Link to comment
Share on other sites

Originally posted by Warren Peace:

Limberg,

What statistic do you suggest? Chi square seems an appropriate test in this situation. However, if you would like to use some other method I'm all ears.

Your comment on variables is interesting. If it is a sub-routine problem that lets the AI shoot quicker or hit more often, I'd think that qualifies as an AI cheat, dont' you think.

Warren

You've got me on the test-statistic--I'm just sitting here with my econometrics books and throwing stuff out there. I guess I was thinking that we could look at this in a regression analysis framework, but that would require a number of instrumental variables to account for unobservables.

As for quicker AI subroutines, as long as it was not intentionally programmed in there by BTS, then it does not constitute cheating! From websters, one of their definitions of cheating is, "to violate rules deliberately, as in a game". You could see how it would be difficult to test the code for the unintentional possibility of this...especially when there is only one guy doing the programming.

Link to comment
Share on other sites

Hold on, before everyone goes accusing the AI, pause for a second smile.gif

I also started running tests. One tank vs one tank and recorded the results (33 trials so far). Since it was 1 v 1, I paid attention to more details. 742 meters, Pz VA (early) vs Pz VA (early, captured). Neither had T rounds. I recorded first shot stats, not first kill. Both had a 50% chance to hit and a fair chance to kill. No FOW, regular crews, 1943. Immobilized on rough.

The first 20 trials I ran as the Soviet side, AI on the other. The second (13 so far) trials as the Germans, AI on the other.

The AI shot first, every time. Why? Because I did not choose a target BUT the StratAI did during the computer "thinking" phase. So, while I relied upon the TacAI to make my targeting choices, the computer would pick its target.

So this would likely explain why we see the AI "get an advantage" in tests where A) we don't choose the target and B) we're looking at first shot statistics. With guns that can kill on the first shot in many cases, or cause morale effects that delay the reply shot, then we'll see an inordinate amount of kills by vehicles on the AI side, even when testing total vehicles destroyed. By NOT choosing a target, we give the AI an advantage. So we have to eliminate that from our tests.

So, the AI isn't cheating, it does pick the target during the "thinking" phase, however. So, the way to elimnate this from the testing is to either have only one tank firing, AI "controlled" for one set of trials and human for a second set or have one v one where you set the target during your plotting phase. Or use a hot seat game and leave it all up to the TacAI.

I won't give my results because until I do the test to not allow the AI's StratAI to muddy the waters, it doesn't give clean results.

However, I did notice something interesting. When I allowed the Soviet TacAI to choose the target (as I did in all the tests), it took them about 5 seconds to acquire in clear terrain over 742 meters. The German side, however, when I let the TacAI target, took about 10 seconds to acquire.

Those were consistent. The Soviet's TacAI would acquire the German tank before the German's first shot. However, the German TacAI would not acquire the Soviet tank until after the first shot by the Soviets and sometimes not until after the second shot.

[Edit]

I intend to follow my own advice and redo the test to eliminate the "advantage" the AI gets from a human "do nothing" approach during turn calc. My hypothesis is that the AI is not cheating at all here.

[ October 30, 2002, 03:19 PM: Message edited by: Cameroon ]

Link to comment
Share on other sites

Originally posted by Maastrictian:

Now I will make similar calculations, but for the expected value I will use the average of both trials that were played hotseat. This should, theoretically, eliminate all variables except the AI and Human issue. I'm playing the exact same scenario with the exact same orders (none) so the only variable is AI control of troops.

Allied Expected loses = 68

Axis Expected loses = 24

When AI is playing Axis, observed Axis loses = 23

When AI is playing Axis, observed Allies loses = 68

When AI is playing Allies, observed Axis loses = 35

When AI is playing Allies, observed Allied loses = 58

The AI seems to do a bit better as the Allies.
Link to comment
Share on other sites

Lt. Kije, you forgot one item for your score:

Lt. Kije shouts out misinformed, unhelpful advice from the peanut gallery that in no way contributes the discussion. Yeah, I think I have it right. FWIW, you're not the only one on the forum that has taught graduate students, and neither am I. So if you're going to continue sharpshooting people, you might at least do so in a constructive manner since your credentials alone don't earn you anything around here. Best to ya.

Link to comment
Share on other sites

More data, generated while I wait for my meeting to start...

I decided to do a quick test of the "who shoots first" question. I modified the Axis forces to use captured T-34/85 M44 tanks (same as the Russians were using).

Sidebar: the point value for the captured T-34s is 169 (this is June of '44, south region). The point value for the Russian T-34s is 153. Clearly the point value calculation takes nationality into account in some way. Some of the Russian tanks did get a couple of Tungsten rounds; maybe that chance is factored into the value in some way.

Anyway, things were set up the same as before. (I neglected to mention earlier that I set up the games with Fog of War OFF and with computer player set up restricted to scenario defaults.) Hit chance for both sides was 44% according to the LOS tool.

I ran 4 trials, two with "human as Axis" and two with "human as allied." In the first trial of each pair, I let the TacAI handle targetting orders. In the second trial, I manually issued a targetting order for each human-controlled tank before hitting "go."

Trial 1 (human as Axis, no manual orders)

At T=0, all AI tanks were immediately targetting human tanks (red targetting lines). At T=2, 3 human tanks were targetting. At T=4, 7 human tanks were targetting. At T=5, 3 of the AI tanks had fired. At T=6, all ten AI tanks had fired, and only one human-controlled tank had fired (all were targetting).

Trial 2 (human as Axis, manual orders)

From the start, all tanks were targetting. At T=4, 4 of the human-controlled tanks had fired. At T=5, all ten human tanks had fired, as had 4 AI tanks. At T=6, all tanks had fired.

Trial 3 (human as Allied, no manual orders)

At T=2, 3 human tanks were targetting. At T=4, 9 human tanks were targetting, and 4 AI tanks had fired. At T=5, all ten AI tanks had fired. Human controlled tanks didn't begin firing until T=7, and some fired as late as T=10 (a few were knocked out before firing).

Trial 4 (human as Allied, manual orders)

At T=4, five AI tanks fired. At T=5, all ten AI tanks had fired, and 2 human tanks had fired. At T=7, all human tanks had fired.

Analysis

Hmmmm. It seems pretty clear that issuing manual targetting orders does help get shots off faster. How this compares to the computer player's performance is less obvious -- Trial 1 vs Trial 2 makes it appear that the computer player has the edge when you leave targetting to the TacAI, but that if you manually issue targetting orders, the advantage is neutralized or even shifts to the human player. But Trial 3 vs Trial 4 is less compelling. There may be an Axis vs Allied confound, an adjustment for use of captured equipment, or something else in the mix.

Link to comment
Share on other sites

Originally posted by Warren Peace:

Maastrician:

Your results seem consistent with my initial observations. But you have extended them with the all important hot seat experiment.

Seems to support an AI advantage.

Warren

Actually, my results do not suport yours as I find no significant difference when using the average of the AI vs. human battles as the expected result. The only significant finding I made was looking at the AI's advantage over a human vs. human battle.

I'm very interested to see others test using the same battle and posting their results. If everyone who has posted to this thread runs the battle once as the Allies, once as the Axis and once as hotseat we will have destroyed more tanks than the Germans had at Kursk smile.gif And with a few thousand trials we will certainly be able to push the p values into the thousands place... if that is where they want to go.

I also want to make clear to the general board here what I (at least) mean by "AI cheating". There is no reason to assume, nor am I assuming, that BTS (BFC, whatever) has made any concious decisions about the performance of the AI vs. the performance of a human player. The differences that are being seen (or are not being see in some cases) could easily be the result of minor programming error. I pursue this not because I want to show that BTS is a liar or something silly like that but because I want to improve the game.

--Chris

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.

×
×
  • Create New...