Jump to content

AI cheats! (with real data)


Recommended Posts

  • Replies 250
  • Created
  • Last Reply

Top Posters In This Topic

Three observations here.

First, I am very surprised that an initial 600 trials vary significantly from an additional 1200 trials. I am tempted to replicate, but not tempted enough. Not questoning Treeburst's methodology, but I wish people would have taken up Maastrician's challenge and done a dozen independent experiments using a single test-bed.

Second, I'm surprised nobody has noticed that good old Warren Peace's work is still of great interest. His methods are the ones that seem to continue to show some anomolous AI advantage and thus seem to me to demand further exploration. His methods have led to an interesting theoretical conjecture -- that the AI does not suffer the same global morale problem as human forces.

This is of great interest because it would explain a variety of other things that have been reported in other threads! E.g. a computer owned Russian crew bails out of a just destroyed scout car, late in a game where the Russians have lost 90% of their battallion, into a rocky patch surrounded by fresh Veteran Panzergrenadiers, with a MK IV panzer thirty yards away and unoccupied by other targets, and the Russian crew does not surrender but rather pulls their (two) pistols out and starts to shoot at the Panzergrenadier squad a few feet away. Maybe the computer's forces almost never surrender because their global morale does not sink as it should?

Will no one pick up the Warren Peace Regimental standard and carry it forward? More data collected exploring his conjecture could lead to an important discovery that could improve CM.

Third, I never thought I would hear Bangor, Maine described as a 'central' location. Thewood, thank you for that laugh. It started my day and my day will be better for it.

-- Lt. Kije

Scorekeeper and Historian

Link to comment
Share on other sites

Most people don't realize that Bangor, Maine used to be the main(e) stopping point for transatlantic flights due the B-52 capable runway that exists there. Many planes continuing further west used it as a refueling point.

Not only that, but proximity to the UofM means a lot of drunk, toothless, and loose women. Enough to make any Maineiac drool.

PS It's always better than Bangor, WA...

[ November 01, 2002, 10:44 AM: Message edited by: thewood ]

Link to comment
Share on other sites

Originally posted by Lt. Kije:

Originally posted by Treeburst 155:

</font><blockquote>quote:</font><hr /> BTW, the worst human performance was 60/200. The best was 77/200. Worst AI performance 70/200, the best was 82/200.

I picture a pair of bell shaped curves, one shifted a bit to the right of the other.

-- Lt. Kije</font>

Link to comment
Share on other sites

Lt. Kije,

I've been wondering if I made some errors in my earliest test groups. The AI scored its very highest score in the first group of 200. The human scored his very worst score in the second group. I'm a real careful guy however.

I only ran 9 groups of 200. Taking each group as a whole, it's not too difficult to imagine the first two groups were unfortunately just the outliers. They weren't very out of whack at all really. It's just that the highest AI scores came at the same time as the lowest human scores by coincidence I think. In any case, the last 1,200 shots showed things to be perfectly even.

As for Warren's testing, I think he may have something there with the morale question. Now that I'm satisfied with my test I will study his stuff carefully from the beginning. We have eliminated first shot, and first round hits, as possible sources of AI advantage.

Treeburst155 out.

Link to comment
Share on other sites

Warren,

Send it my way, and I will add to the database. :D Email's in the profile.

Lt Kije,

I was under the impression that at no point in my early testing were the results considered significant inspite of the 5-6% gap. I don't think I ever managed a gap vs sample number situation that would have passed the P factor test.

Link to comment
Share on other sites

Treeburst,

First, my hat's off to you for all your persistence. No science ever gets done without an exhausting deathmarch through data collection and analysis. Hero Of The Forum medal to Treeburst!

Second, were I not a greybeard I would be embarassed to admit the number of time I've run experiments, realized late in the process that the early trials were not carried out or observed right, then had to start all over again. What was supposed to be The Experiment turned out to be a pilot experiment. And (I blush as I write this) sometimes the next, better, version turned out to be just another pilot experiment.

A guy I know (Danny Kahneman) just won a Nobel Prize, only the second ever won by a psychologist (if you don't count Pavlov as a psychologist). He was able to get interesting papers out every year, but only because he had an army of grad students running new experiments every week. Almost all got thrown out, a few became pilot experiments for further exploration, and one in a hundred actually turned out to have been done right and to lead to something interesting. Good science can be 90% 'wasted effort'.

I would not be surprised to hear that you changed your observation technique as the first 600 trials played out and that the final 1200 trials were observed in a more consistent fashion than the first 600.

Third, yes, I recall that in the 'early' smile.gif trials the computed chi-squares did not achieve significance. Which is what we would hope to see if your later trials demonstrate little or no difference. An apparent difference that has not achieved statistical reliability is exactly that, not reliable. 'Not reliable' means that if we were to re-run the experiment we might well get the opposite difference or no difference.

I have to add a tiny editorial comment that may offend a bit. Warren Peace re-ran Maastrician's chi square tests and got different results from Maastrician. I was never completely comfortable with Maastrician's chi-square reasoning and was hoping someone else would independently run the tests. I see the early p value reporting as possibly correct, possibly incorrect.

Your 'blue collar statistics' approach, though, may be taking us so close to undisputable population parameters that p value computations are no longer needed.

I am heartened that you may bring your bone crushing techniques to Warren Peace's conjecture.

-- Lt. Kije

Scorekeeper and Historian

Link to comment
Share on other sites

Since Warren has not sent me his tests yet, I created one of my own. I know, I know. We want to run the same tests. In the meantime however, here's what I'm doing.

20 isolated firing lanes. T34/85 against captured T34/85, regular crews on both sides, No FOW, tungsten rounds removed from both sides, Range 740 meters, kill chance OK both sides, hit chance 43% for both sides, all units confined to one 20 meter square by terrain. June '44 so Russian regulars will hopefully be as good as German regulars.

NO extra forces to alleviate possible morale issues!!

I will play the German side and manually target the enemy. I will run only one turn. At the end of the turn I will count the unscathed vehicles on each side. 'Unscathed' will be defined as NO vehicle damage or crew casualties. Crews may button up, be alerted, cautious, or shaken and still be considered unscathed; but they cannot have casualties. Immobilized, abandoned, KO, gun damaged, and bailing crews will not be counted.

Reasoning for testing like this:

I think whatever happens in the first minute will determine the end results of the battle if it were played out. It is enough to know, IMO, if a side does consistently better in the first minute when played by the AI as compared to a human playing the same side.

After a hundred runs as the Germans, I will play the Allies one hundred times, again manually targetting.

Treeburst155 out.

[ November 01, 2002, 02:17 PM: Message edited by: Treeburst155 ]

Link to comment
Share on other sites

Oh man... this thread still lives? Wow smile.gif

Warren Peace wrote:

I suspect that the global morale effect is only working for the human and is not engaged by the AI. I suggest that BFC examine this possibility.
No need smile.gif Global Morale has nothing to do with unit behavior related to gunnery. Global Morale only affects how fragile units are when they are significantly spooked. Thus a very low Global Morale will lead to more units wigging out all things considered. But it doesn't go any deeper than that. Global Morale also acts as a higher level trigger to end games early.

Steve

Link to comment
Share on other sites

By the way, Treeburst's excellent work proves the point (for the upteenth time) that it is folly to judge how something works from a small sample. This thread stated out with the "proven" sample that the AI has an advantage all the way to Treeburst's adjusted testing (i.e. tossing out first 600) showing statistically zero difference. One was a quick study and the other a more statistically significant one.

This is a good lesson for those who don't understand statistics and how vital repetition is to proving a point like the one this thread is based around.

Oh, and there is no inherent bonus/penalty for any side when firing rounds. A Veteran Soviet crew is identical (statistically and probility wise) to a Veteran Polish, Veteran German, Veteran Italian, Veteran Romanian, or Veteran Hungarian crew. Finns are of course +1 better smile.gif There might be some differences with the vehicles, however. So a Vet crew in a PzIVG might act differently than one in a T-34/76 (1942). However, this is generally not applicable unless there is some difference like copula and rate of fire.

Steve

Link to comment
Share on other sites

Steve,

Treeburst's findings have not invalidated Warren Peace's original findiing of an AI advantage. Treeburst's method (so far) has been different from Warren Peace's.

But you are right about the other point. In any domain where the findings from an initial 600 trials can be negated or reversed by the next 1200 trials, statistics are of no help. In such a domain, no sample of 30 or 40 trials could ever give us results we could have confidence in or generalize from.

Of course, I've never run across such a domain before now. Doesn't make me question statistics. Makes me question the domain.

[Edit: Actually, I have been able to imagine a domain where 600 trials is not enough to give a stable estimation of population parameters. It is one in which noise is exceedingly high and 600 trials have provided very little 'signal' in proportion to that noise. I'm now trying to imagine that being possible in Treeburst's experiment. So far, I cannot.]

It is vanishingly unlikely that I can draw 600 random samples from a book bag filled with checkers (with an unknown to me proportion of red vs. black ones), finding 400 black and 200 red ones, then continue drawing randomly from that book bag 1200 more times and draw 300 black and 900 red ones. (I replace each poker chip after drawing it.) Yes, it "could" happen. But if it did, I'd be pretty doggone amazed. And I'd have no confidence at all, even after 1800 draws, what the proportion of red to black was inside that book bag.

-- Lt. Kije

Scorekeeper and Historian

[ November 01, 2002, 03:52 PM: Message edited by: Lt. Kije ]

Link to comment
Share on other sites

Hi All,

I am not a statistician by profession but I am a software developer and I must agree with a previous post that it simply does not make sense from a S/W development viewpoint to design the game engine to have a seperate code path for turn resolution and execution to accomodate the AI. So I suspect that the original test case that spawned this thread, while well intentioned, may be flawed either in setup and/or analysis.

To see if this is the case simply run the same tests again, if the original test is legit each set of results from subsequent tests should be similar to the original test as there is only a 1/333 chance that the original result was achieved by chance. If these new tests diverge from the original then I suggest that the sample size of n=36 is an overstatment and is really n=6, and you will need to run 24 more tests on each side before your data set is meaningful.

BTW, the results from my tests show that the Axis will lose 3 Stug's for every 4 T-34/85's regardless of which side the AI is playing.

So what have I learned form this, first that the AI does not cheat (at least in this instance), second I really need to get a life smile.gif

-E

Link to comment
Share on other sites

Steve,

I think everything is totally consistant with a global morale problem. I am simply counting how frequently tanks are abandoned. If global morale were lower wouldn't that lead to crews panicking quicker and abandoning their tanks quicker? I have noticed that if tanks get immobilized in these battles the crews almost always abandon them before the tank actually gets penetrated.

Warren

[ November 01, 2002, 03:14 PM: Message edited by: Warren Peace ]

Link to comment
Share on other sites

Elbrus,

Actually if you sum up all of the six-tank tests I have performed I have done at least 30 for each side. The result is very consistent. I'll be happy to send you the scenario for your own testing if you wish.

Treeburst,

20 is too many tanks. Global morale issues will be swamped out because each tank destroyed will have a smaller effect on morale. Try a six vs. six tank battle. I will send you my scenario when I get home.

Warren

[ November 01, 2002, 03:38 PM: Message edited by: Warren Peace ]

Link to comment
Share on other sites

Fighting 220 one minute duels as the Germans, and manually targetting the enemy, I survived unscathed* 80 times. My AI Allied opponent survived the mad minute unscathed* 79 times.

Fighting the exact same duels again, but taking the Allied side, and again manually targetting, the AI German survived unscathed* 89 times, while I survived unscathed* 79 times.

German side human controlled:

German survivors: 80

Russian survivors: 79

German side AI controlled:

German survivors: 89

Russian survivors:79

*Unscathed is defined as NO vehicle damage or crew casualties. Crews could be panicked as long as none were casualties. This happened exactly one time. It must have been a penetration with no damage.

By this definition an immobilized vehicle, or one with just one crew casualty would not be counted. IOW, they were thrown in with the killed and abandoned vehicles. This probably makes the test less meaningful. Also, there is the fact that there were a few occasions where the duel was still in progress at the one minute mark. I would estimate approx. 5% of the duels were still going after the mad minute, some with immobilized vehicles I'm sure.

Treeburst155 out.

[ November 01, 2002, 04:17 PM: Message edited by: Treeburst155 ]

Link to comment
Share on other sites

Lt Kije,

At the end of 600 samples the AI percentage was 38.00%. At the end of 1,800 samples the AI percentage was 36.94%.

For the human at 600 the percentage was 32.67%. After 1,800 it was 35.11%. Is this change over 1,200 additional samples that unlikely?

Treeburst155 out.

[ November 01, 2002, 04:15 PM: Message edited by: Treeburst155 ]

Link to comment
Share on other sites

Warren,

Are you tracking abandoned tanks, and attempting to determine the cause of the abandonment? Are you looking at how many non-casualty crewmembers are actually abandoning the vehicles? Are you studying each of the six vehicles individually to determine what has occurred to cause them to abandon? It all seems fuzzy to me, and VERY time consuming.

Treeburst155 out.

Link to comment
Share on other sites

Just checking in as an amazed witness of all of this fascinating if increasingly arcane process. Clearly the qualities of persistence and attention to detail TB has revealed as a tester are also what make him an outstanding tourney organizer. I'm also enjoying the contributions of Steve, Warren Peace & Lt. Kije.

At least we've provoked Steve to tell us more about the CM engine. And Lt. K's discourses on scientific method are fascinating. In short, the accidental consequences of the testing process may be more interesting than the direct. That having been said, test on!

[ November 01, 2002, 05:03 PM: Message edited by: CombinedArms ]

Link to comment
Share on other sites

Nothing of the sort. I play the scenario until all firing stops (usually two; at most three turns. I see how many of my tanks are alive. I surrender and then see how many of my oppenents are alive. Then I play again. With 6 tanks it does not take very long.

My comment about abandonment is simply an observation that often after the first turn I will have an immobile tank with the crew inside, and on the next turn the crew will immediately abandon the tank rather than continue to shoot.

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.

×
×
  • Create New...