Wednesday, March 3, 2010

Some musings over data

I was playing around with some data from last year’s tournament. I wanted to see if models containing Top 50 Plus/Minus were better or worse at predicting whether a team would be in the tournament than Top 100 Plus/Minus. I ran two logistic regressions. In both cases, I included SOS and Conference Win Loss. In both cases, the model incorrectly projected 5 teams: Utah St, Maryland, San Diego St, Creighton, and New Mexico.

What surprised me the most was that the selection model gave Utah St. only a 4% chance of being selected. This indicated that Utah St, despite earning an 11 seed was an auto-bid that would not have made it as an at-large. Thus, it must have earned a true seed of 12, and then been swapped to avoid conflicts. What conflict? Well, three of the four 4/5/12/13 pods were located in Boise or Portland, and so Utah St couldn’t go to any of those by the protection requirement. The final 4/5/12/13 slot was taken by Utah, who Utah St couldn’t play. Since teams cannot be swapped into the top 4 or bottom 4 lines, Utah St had to go up. So, I switched the dummy variable tag on Utah St to “Out”. On the next run, both Creighton and Utah St. were predicted correctly. San Diego St was now correctly predicted, but Siena was not. Maryland was hovering around 50%. New Mexico seemed to be the only outlier in the group, but I don’t remember them being talked about overly much last year.

Another critical nugget of data – in every regression I ran involving a variable for road/neutral games*, that variable turned up as not statistically significant. Moreover, of the teams that I was looking at, Arizona had the fewest Road/Neutral wins. Therefore, when choosing your last few teams in, you should avoid making arguments from Road/Neutral. It does appear, however, that the seed a team is given is correlated with road/neutral winning percentage. Since the committee selects and seeds at different times, it is possible that some committee members weigh road/neutral more heavily when seeding than when selecting.

(Oh, right. The initial question. Top 100 Plus Minus is better than Top 50 wins, which is better than top 50 plus minus.)

*Road/neutral wins, road/neutral plus minus, non-conference road/neutral wins

