AI cheats! (with real data)

Marlow · October 31, 2002

Originally posted by Sgt. Emren:

You can stop testing. Your results already prove that the AI does not have an advantage (with your particular setup). You don't need to try it out more than 30-50 times!

He has proved nothing of the sort. What has been proved is that the difference is not sufficient to be significant in 200 test firings.

Maastrictian · October 31, 2002

Ok, more data.

This is using Cameroon's scenerio he posted a while back (page 3 or 4??). This is testing first hit when the german side is controled by the AI or a human. No FOW is used, and only the Germans have any ammo, and its all AP or HE.

AI hits 5

Human hits 7

AI hits 4

Human hits 5

AI hits 7

Human hits 6

AI hits 5

Human hits 8

AI hits 4

Human hits 5

AI hits 7

Human hits 5

AI hits 4

Human hits 7

AI hits 5

Human hits 4

AI hits 2

Human hits 2

AI hits 5

Human hits 4

AI hits 8

Human hits 5

AI hits 3

Human hits 6

AI hits 6

Human hits 8

AI hits 6

Human hits 6

AI hits 5

Human hits 6

AI hits 6

Human hits 3

AI hits 5

Human hits 5

AI hits 4

Human hits 7

AI hits 3

Human hits 8

AI hits 4

Human hits 5

So that's 200 trials on each side. Total results are 98 hits for the AI and 112 hits for the Human. Combining this with Cameroon's data with 100 trials on each side gives us:

Expected = 157

Observed AI hits = 153

Observed Human hits = 161

Chi Sqare = 0.2038

P = 0.6516

So this shows no signifcant results. More importantly this shows the Human edging out the AI, which is something I don't belive we've seen before, and sets my mind at ease to some extent.

Later today or tonight I'll combine the data Warren got with my scenario with the date I got with it before and see what the combined results give us.

--Chris

Lt. Kije · October 31, 2002

Lovely!

To summarize, again.

We are seeing a heartbreakingly beautiful unfolding of Normal Science. At the moment we are witnessing the search for the Critical Experiment. Including the important business of disputing each other's methods. And the always entertaining activity of watching the two camps stare at the same data set and see different things.

Here are the significant developments.

1. Maastrician's proposed world-wide test bed is not catching on. Various parties are cobbling together their own test methods. This will provide immense entertainment later if their results can not be matched up, squared off against one another. Keep an eye out for this big fun!

2. Warren Peace reports a very important finding. Running Maastrician's scenario (horray for replication!) he sees no AI advantage but re-running his own he continues to see an AI advantage. The difference between the scenarios? Whether the tanks are allowed to move or not! (M scen they cannot; WP scenario they can.)

3. L. Tankersley repeats and reinforces the point of view that we should get clear on our dependent variable. Are we measuring hit, kill, or who fires first? He is addressing an important tension here, the tension between everybody agreeing on a common methodology, including measure, vs. different measures (hit, kill, first) having their own individual advantages.

(a) kill is the one of real-world importance, but hit is a simpler, more direct measure of AI proficiency; it has less noise in it (e.g. kill has some randomn luck variation after hit has been achieved). First hit is even more primitive, more clean, more direct.

( More primitive measures are going to be more useful in theory testing, saying exactly what it is that the AI does better. They are like fine scalpels whereas kill percentage is like a lump of Semtex.

4. There continues to be wasted heat on issues of sample size. Some folks hold strong feelings about this but are not very well tutored in this arcane lore. Here are two simple things to hold on to. They are true and every statistician knows them (as do many graduate students).

(a) a significant effect with a small n is probably tapping a real effect

( finding no significant effect with a small n is not very informative

It is a sad consequence of this that those whose theoretical positions look to a finding of 'no difference' have a much higher bar to cross when they collect data. They have to say, "If the effect existed it would have shown itself after n trials," and then justify that. In academic science, there is a standardized procedure for coming up with that n. It is called 'estimating the power of the test', and we are not getting into it here. (Yes, you are welcome.)

Here's how it can play out. Let's say Treeburst runs his test 200 times, finds AI hits first 41% of the time, human hits first 36%. Sgt. Emrem then concludes 'no difference', someone runs a chi-square and sees the significance level is p < 0.15. Not the 0.05 science by convention requires. What if Treeburst now runs another 1000 trials and finds the same 41% vs 36%? Now p < 0.01. So is the effect there or is it not?

The answer is the following. Yes it is there, but when you have to run very large n studies to demonstrate an effect, the effect must be pretty small. A tension arises between statistical vs. practical significance. If you have to run 10,000 trials to demonstrate an AI advantage, then it isn't much of an AI advantage.

To repeat the main point. You cannot fail to find an effect then claim victory. Observing 'no difference' is a tough road and requires more justification than observing 'statistically significant difference'.

This is part of the reason why scientific literature is largely reports of significant differences. Papers reporting null effects are much harder to get published, although very important if they have met the tougher standards.

Looking forward, I see Warren Peace exploring why his method and Maastrician's lead to different outcomes. I predict very productive theory development resulting from this. I see Treeburst making a major contribution by virtue of careful and highly public methodology followed by relentless, daisy cutter, burn the entire nation to the ground scale of data collection. Maastrician and Warren Peace may prove to be mere 105s compared to Treeburst 155. Let's see.

I think you guys are doing great work! From the sidelines, your scorekeeper and historian salutes you!

-- Lt. Kije

Scorekeeper and Historian

[ October 31, 2002, 12:35 PM: Message edited by: Lt. Kije ]

Treeburst155 · October 31, 2002

My 4.5% difference in HIT percentage is not enough to be significant with only 200 trials. It just can't be. Even a margin of error of 2% or so would be too much. It's only logical to me that, the more times I test, the closer I will get to the TRUE percentages. If I flip a coin 10 times, odds are I will come up heads 50% of the time; BUT, this may very well not prove out until I flip the coin several hundred times. The more I flip, the closer my results will be to 50%. Empirical data is the key, especially if you don't know statistics.

Treeburst155 out.

demoss · October 31, 2002

>If I flip a coin 10 times, odds are I will come up heads 50% of the time

Actually, the odds are you WON'T come up with heads exactly 50% of the time - that happens a little under 25% of the time. It's just the most likely result.

October 31, 2002

(a) a significant effect with a small n is probably tapping a real effect
( finding no significant effect with a small n is not very informative

I based my conclusion admittedly on a gut feeling, but it turned out that I was right (at least if we can agree that p=0.15 is not significant). I also agree with the above. Lt. Kije, do you consider n=200 to be small?

4.5% difference in HIT percentage is not enough to be significant with only 200 trials

With the results of your test, you may reject the notion that there is a difference and be 85% sure that you are correct.

At any rate, IF the AI has a 4,5% bonus to hit on the first shot, then I think I can live with that. With regards to sample size, a practical consideration is always that of cost. To the question "How big should my sample size be?", you may reply "As big as you can afford". So you must ask yourself this question: "How much time do I really want to spend running this boring scenario and count first hits?"

[ October 31, 2002, 01:03 PM: Message edited by: Sgt. Emren ]

Lt. Kije · October 31, 2002

Treeburst, you are exactly correct. Larger and larger samples come closer and closer to revealing the true effect. If you measure 100 of the trees in a 10,000 tree forrest, you have a good estimate of average tree height. If you measure all 10,000, you no longer have an estimate, you know the true value of 'average tree height'. (Assuming accurate measurement, of course.) Larger n = closer estimate of the true population parameter. This is The Law Of Large Numbers. (First stated by Jacob Bernoulli, published posthumously in 1713 in a book that many people treat as the beginnings of a rigorous treatment of probability and thus of statistics. When I taught stat at Berkeley, I found the students really enjoyed the subject and learned it both clearly and durably if they developed these theories out of their own intuitions, as Treeburst has done here. Just giving students a bunch of formulas then punishing them for not knowing them or applying them correctly is the more typical but less effective method. Learning under this method, I made nothing but C's and D's when I was forced to take stat as a psychology student.)

Sgt. Emrem, 200 trials showing a 41% vs. 36% difference does not convince me that there is no true difference. It may be that there is a true difference of 41% vs 36% and 1000 trials would yield significance. It may also be that the true parameters are 40% and 40% and there is no real difference, in which case 1000 trials would probably bring us closer to those 40/40 numbers. Or it could be that the true difference is 50% vs 30% and we just need more trials to home in on those. I dunno. And I know that I dunno.

-- Lt. Kije

Scorekeeper and Historian

Treeburst155 · October 31, 2002

My time is valuable, but the truth is priceless to me. I must know.

After 400 shots the AI has a first round hit percentage of 39.5%

I, after 400 shots have a first round hit percentage of 33.25%

The gap after 400 trials is now 6.25% in favor of the AI.

Treeburst155 out.

L.Tankersley · October 31, 2002

Originally posted by Maastrictian:
The weird thing about all these tests is that we always see some small advantage to the AI. I have no good explanation for this, nor have we definitiavely seen this to a significant P value, but that is why I'm still conducting tests and am interested in seeing other's tests. Its just so odd.

Yeah, I agree. The consistent (but non-significant) bias is a bit troubling. Hopefully Treebursts's data collection extravaganza will help us figure out whether there is a real effect, or not.

Lumbergh · October 31, 2002

24 hours later and you guys are still here....

Jeez, for a bunch of scientists you guys must not be getting any research done!

"Sorry, I know I outspent my grant, but I was engaged in some groundbreaking AI-simulation testing....."

Battlefront.com · October 31, 2002

Wow, go out of town for a couple of days and lookie what happens

There is no need to do any more tests. The AI has zero advantage over the Human player. As has been our stated position since before any of you even heard of us, we do not make AIs that cheat. We also don't lie to our customers. So if we say it doesn't cheat, and we also don't lie, one doesn't need a degree in mathmatics to understand what that means

Steve

demoss · October 31, 2002

You're missing the point. I don't believe anyone is really accusing Battlefront of designing an AI that cheats. There is some concern that there may be a bug that manifests itself by somehow giving the AI a better chance to hit.

Warren Peace · October 31, 2002

I hope this posts. Third try.

In order to test the idea that immobilization my play a role, I modified Chris’s superb range by replacing the broken terrain with open. I did his range 3x as both Allies and Axis. Aggregate data is as follows.

Human as Allies

90 T34 vs. 51 Stugs (Kill ratio is 1.76:1)

Human as Germans

92 T34 vs. 47 Stugs (Kill ratio is 1.96:1)

I won’t bother with the statistics, they look identical.

I have played my small scenario (6 bank range) 9 times from each side (not including initial 6 in first post) and have found the following

Human vs. Allies

44 T34 vs. 11 Stugs (Kill ratio is 4:1!)

Human vs German

35 T34 vs. 21 Stugs (Kill ratio is 1.67:1)

Chi2 is 4.15; P<0.05

My best guess is that the weirdess I observe may be related to some sort of global morale effect as smaller battles would amplify this effect as each kill is a higher percentage of total morale. Perhaps global morale is not effecting the AI as it should? Just a thought.

Warren

Warren Peace · October 31, 2002

Steve:

No one (including me) thinks the AI cheats. However, there is concern that the AI may have some sort of advantage in certain situations. This advantage appears magnified in small battles. Are you sure this is not possible? Not a cheat but more a possible undiscovered programing oversight.

Warren

Treeburst155 · October 31, 2002

Yeah, Steve, we're just checking for a possible oddity somewhere deep in that beautiful program. Nobody is doubting the integrity of BFC. It's all in fun. If we come up with results that interest Charles, then we're helping. If not, we're having fun anyway.

The AI, after 600 shots, has a first round hit percentage of 38.0%

I, after 600 shots, have a first round hit percentage of 32.67%

The AIs lead has been reduced to 5.33%

With each group of 200 shots it becomes more and more unlikely that the percentages will change significantly; but I'm still not convinced they are accurate enough.

Treeburst155 out.

Warren Peace · October 31, 2002

One last thing

If I am right that size is key, I will test this by modifying Chris's scenario to reduce it to only six tanks per side. If I get the same sort of data that I got with my scenario this would show that size is a key variable.

Warren

Battlefront.com · October 31, 2002

Warren Peace,

This advantage appears magnified in small battles. Are you sure this is not possible? Not a cheat but more a possible undiscovered programing oversight.

I don't see how it can be possible. The way the code is set up there is no difference. The gunnery and ballistics calculations don't know who is controlling the forces they are calculating. The one part of the code asks another part "this tank is shooting at that tank, what happens?". The other part of the code states an outcome and that is that. There is no way that I can see how this system would have some sort of (unititional) bias introduced. I'll ask Charles, but I think the chance of there being a problem here is just about zero. If there are some differences in the numbers you guys are seeing there must be another reason for it.

Steve

Ace Pilot · October 31, 2002

Originally posted by Battlefront.com:
If there are some differences in the numbers you guys are seeing there must be another reason for it.

Steve

I think that is probably what is making this so much fun - what variables are accounting for these differences (if they are in fact differences and not just noise)? The fact that this game takes SO many things into account, (things that I wouldn't have even thought of) makes it fun trying to uncover just what those factors are. It appears that one such factor is the difference between manual targeting and letting the AI target for you. It seems that manual targeting results in a slightly faster first shot. After thinking about it, it actually makes sense to me. I see it as the difference between the tank commander giving a "Fire at will" order (equal to AI targeting) and giving a "Fire at the tank at 12 o'clock" order (manual targeting). In the first case, the gunner must take a couple of seconds to scan the battlefield and select his target, whereas, in the second case, he wastes no time deciding where to shoot. If this is what BFC intended, I am once again amazed by the level of detail and consideration that went into this game. And if it was unintentional, perhaps the fact that it seems to make so much sense is a reflection of how getting so many details right results in other (unintended) details falling into place.

Either way, thanks for a great game. Back to the discussion.

Ace

mike the wino2 · October 31, 2002

Origninally posted by Steve

There is no need to do any more tests. The AI has zero advantage over the Human player. As has been our stated position since before any of you even heard of us, we do not make AIs that cheat. We also don't lie to our customers. So if we say it doesn't cheat, and we also don't lie, one doesn't need a degree in mathmatics to understand what that means

Key word "IF". IF you do lie, how would we know?

And yes,

Wow, go out of town for a couple of days and lookie what happens

. Now you know we can't be trusted to behave while you are away. It's like have a couple thousand of your own children. Why didn't you leave a babysitter or somefink?

Maastrictian · October 31, 2002

Here is Treeburst155's latest numbers (the 600 trial)

Observed AI hits: 228

Observed Human hits: 196

Expected: 212

Chi Square: 2.4150943

P: 0.1202

So still not significant . But getting closer. I'd be very curious to see your scenario Treeburst155 so that maybe we can see what difference there is, if any, between yours and Cameroons. After 300 trials Cameroons seems to show no bias or hint of bias, but with 600 trials I have to defer to your work at this point.

Regarding Steve's post: Yea, as has been said no one thinks that BTS is lieing too us, if there is a difference it is due to a bug, nothing more. On the other hand if at the end of the day we find no discrepency then that will certainly silence BTS's detractors. Think the AI cheats? Well look at this thread where more than 1500 tests were run (and counting).

--Chris

Treeburst155 · October 31, 2002

After 800 shots the human player improves his first round hit percentage to 33.125%, and closes the gap just a hair.

The AI, after 800 shots has a first round hit percentage of 38.25%

The all important percentage difference is now 5.125 in favor of the AI.

Since the gap WILL be very near 5% after 1,000 trials, I think I should run the test another 500 times in order to put all doubts to rest. I want my margin of error to be very small since we only see a 5% difference anyway.

Treeburst155 out.

Scheer · October 31, 2002

Ummm...

I dont understand one word of your mathemagical statistical whiz bang stuff, but god, I love you people.

This board is just incredible. It´s a honour to be a

member in this community.

Warren Peace · October 31, 2002

Not to be picky but I don't think you are using the chi square properly. Your "expected" value assumes that the average of the two trials is the real expected value. This may or may not be true. In fact the proper way to do this is to simply due the chi2 using one set of values as the expected and the other as the experimental. Remember, the test simply asks what is the liklihood that the results obtained could have come from the same underlying distibution.

If one does this the chi2 comes out to 3.73 and the P<.053

Warren

Herr Oberst · October 31, 2002

Originally posted by Lt. Kije:
Lovely!

It is a sad consequence of this that those whose theoretical positions look to a finding of 'no difference' have a much higher bar to cross when they collect data. They have to say, "If the effect existed it would have shown itself after n trials," and then justify that. In academic science, there is a standardized procedure for coming up with that n. It is called 'estimating the power of the test', and we are not getting into it here. (Yes, you are welcome.)

-- Lt. "Pandora" Kije
Scorekeeper and Historian

Grr... one of the first really nasty things I had to do, years and years ago, was explain to my employer (a PhD mind you, I was the upstart Masters student) on how his understanding of Power was incorrect. The protocol evidenced a small effect. He wanted to know how many more subjects needed to be run to guarantee significance at a certain level. I told him that there was no guarantee for him unless he could guarantee me that the remaining subjects would perform exactly like his previous subjects. I leave it to the reader as an exercise to guess the likely outcome of the results after running the additional subjects.

There are essentially four, count 'em, four items we should be concerned about.

Type I Error, Significance level, that little alpha thingy. This is the chance that we will conclude that there is a relationship, when in fact there is not. Unless you're fond of seeming a git, you want this to be low.
Confidence level, 1 minus the alpha thingy. This is the odds that we will say "No relationship" when in actuality there is none. You want a high confidence level, it makes you look smart.
Type II Error, the beta doohickey. This is the chance that you will conclude that there is no relationship when, in actuality, there is one. This is the alternative bonehead position.
Power, 1 minus the beta doohickey. This is the chance that we will say there is a relationship, when in reality there is one. Just like Tim Allen, "more power" should be your battlecry.

There is a natural, and unavoidable tension between the notion of minimizing your chance of a Type I error and maximizing your power. The problem is the following:

Lower your alpha level, and you lower your power.

Increasing alpha increases power (you reject your null hypothesis more often, and therefore, when the alternative is in fact true, you have a greater chance to accept it).

Oh, and as Treeburst155 keeps running his test mil, at some point we might want to consider whether we should be changing our tests since we are no longer in the realm of small samples...

[ October 31, 2002, 04:17 PM: Message edited by: Herr Oberst ]

Treeburst155 · October 31, 2002

Surprising development!! The human turns in his best performance yet. Not only that, the AI has its worst round!

After 1,000 shots, the AI first round HIT percentage is 37.6%

The human first round hit percentage after 1,000 shots is up to 34.2%

The percentage difference is now only 3.4!

Clearly, the AI and I will have to each fire 500 more times. Maybe even 1,000 more times.

Treeburst155 out.

[ October 31, 2002, 04:36 PM: Message edited by: Treeburst155 ]

AI cheats! (with real data)

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Guest Sgt. Emren

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Announcements