I did statistical tests to determine the significance of my Meal Tolerance Test (MTT) results.

I started with one-tailed tests of products that I had reason to suspect would raise my blood glucose more than the others. One-tailed tests are appropriate for cases where there is a priori knowledge. The suspect products were Smart Carb #1 bread from Julian Bakery, The Crisp Bar from eat-rite, and Glucerna Shakes. I expected Smart Carb #1 and The Crisp Bar to raise my blood glucose a lot based on my previous tests and from evaluating their ingredients. Glucerna Shakes were suspect because of their high carb count and because corn maltodextrin is their top nutrient.

The test statistic is the difference between the mean change in blood glucose (Δ BG) for the product being tested and the mean ΔBG for all the other products in the category (excluding the one being tested). The significance level is the probability that a difference this large could have arisen by chance if, in fact, the products have no differential effect on blood glucose.

One-tailed tests confirm or disprove an a priori hypothesis – e.g. in the case of Smart Carb #1, that all other breads would have lower ΔBG. Two-tailed tests make no a priori assumptions; they are for post hoc hypothesizing. Think of two-tailed tests as fishing expeditions to find significant differences between products. There is little difference in how one- and two-tailed tests are calculated: one-tailed tests are based on the actual difference of the mean ΔBGs, two-tailed tests on the absolute value of the difference of the mean ΔBGs.

I did two types of statistical test: homoscedastic t-tests, which assume all observations are drawn from the same normal population, and randomization tests (also called permutation tests), which make no assumptions about the underlying distribution.

## Randomization Tests

The null hypothesis is that the ΔBGs are independent of the products. By random shuffling, we assure that the ΔBGs are indeed independent of the products. The randomization test compares the difference of the measured mean ΔBG of a product and the measured mean ΔBGs of all other products to a large number of variations created through random shuffles to determine the likelihood that the actual difference between measured means could have occurred randomly.

*Randomization test process* – All the ΔBGs for all the products in a category are put into a single data set. This data set is randomly shuffled. The mean of the first Nt shuffled ΔBGs in the data set is calculated, where Nt is the number of ΔBGs observed for the product being examined. The mean of the rest of the ΔBGs in the data set is also calculated. If the difference between these two means is greater than or equal to the original difference between the mean ΔBGs, a counter (nge=number greater than or equal to) is increased by 1. This process is repeated 999 times for each statistical randomization test. The significance level is then (nge+1)/(NS+1), where NS is the number of shuffles (999). This is essentially the percentage of times after shuffling that the test statistic is at least as large as it was when it was not shuffled. A more complex explanation follows the example.

## “Fishing Expedition” Significance Level

When searching for the best product without a priori knowledge, the significance level produced by randomization tests or t-tests is misleading. It needs to be corrected based on the number of products tested. The more products tested, the more likely it is that products will be found to be better than others purely by chance. For example, if I test 10 products and significance levels are are valid, on average I should find one product whose significance level is 10% or better even if there is no differential ΔBG between the products. In fact, the probability of finding a product whose apparent significance level is 10% or better is in excess of 65% (1-0.90^^{10}). Therefore, a significance level of 10% for one or more products when testing 10 or more products is not at all unusual even if there is no differential ΔBG between the products. To correct for this, I calculate a “Fishing Expedition” significance level. This is the probability of obtaining a significance level better than or equal to the actual significance level obtained when there are multiple products. The “Fishing Expedition” significance level is 1-(1-significance level)^^{number of products}.

# Example

I used one-tailed tests for Smart Carb #1 because I had a priori knowledge that it would probably spike my blood glucose more than the other breads. The significance level of the one-tailed randomization test of the difference in the means between Smart Carb #1 bread and all of the other breads is 0.1%. This means that if there was really no difference at all between Smart Carb #1 and all of the other breads and all of the observations (regardless of distribution), the probability that Smart Carb #1 would have shown such poor blood gluclose results in my experiment just by chance is less than 0.1%. This is a very low probability, so I have a high degree of confidence that Smart Carb #1 really does spike my blood glucose.

I could not use one-tailed tests for Chompie’s Carbs…Not! Sesame Bread because I had no a priori knowledge of what the results would be. The “Fishing Expedition” significance level of the two-tailed t-test for Chompie’s Carbs…Not! Sesame Bread is 15%. This means that if there was really no difference at all between Chompie’s Carbs…Not! Sesame Bread and all of the other breads and all of the observations were drawn from the same normal population, the probability that at least one of the products would have shown such good blood gluclose results in my experiment just by chance is 15%.

Similarly, the “Fishing Expedition” significance level of the two-tailed randomization test for Chompie’s Carbs…Not! Sesame Bread is 12%, meaning that if there was really no difference at all between Chompie’s Carbs…Not! Sesame Bread and all of the other breads (regardless of the distribution), the probability that at least one of the products would have shown such good blood gluclose results in my experiment just by chance is 12%.

The 15% and 12% “Fishing Expedition” significance levels are low enough that I am reasonably confident that Chompie’s Carbs…Not! Sesame Bread has less effect on my blood sugar than the other breads.

# Further Explanation of Randomization Tests

Suppose the data consist of 10 ΔBG scores for Product X and 40 ΔBG scores for other products. Compute the mean ΔBG score for Product X and the mean ΔBG score for all the other products. Suppose we expected, based on the ingredients in Product X, that it would have a more unfavorable effect on blood glucose levels than would other products. In other words, our hypothesis is that the mean ΔBG score for Product X will be larger than the mean ΔBG score for other products. Suppose it turns out this way and the difference in the means is 18 mg/dl. But is this significant, or could it have occurred due to chance? It is possible that due to naturally occurring variations in blood glucose levels having nothing to do with ingestion of the products, one could get a difference in the mean ΔBG scores as large as 18 mg/dl. Indeed, because of naturally occurring variations in blood glucose levels and glucometer inaccuracies, in an experiment involving a small number of readings, one would seldom find that the mean ΔBGscores of two products would be the same. So how does one determine whether a difference in mean ΔBG scores as large as 18 mg/dl could be due to chance or is more likely due to a real difference between the products?

One way to do this is a randomization test. I used a computer, but the mechanics of the test are best visualized with a deck of 3″ by 5″ note cards. Recall that in our experiment Product X has 10 ΔBG observations and the other products have 40 observations and the mean ΔBG score for Product X is 18 mg/dl higher than the mean ΔBG score for the other products. This difference in the means is our test statistic and the “actual value” of the test statistic is 18 mg/dl.

Now write down each of the 50 ΔBG scores on a separate note card. The null hypothesis is that the ΔBG scores are independent of the products – in other words, the products have no effect on the ΔBG scores. The alternative hypothesis is that the products do have an effect on the ΔBG scores and indeed, the mean ΔBG score is higher (worse) for Product X than for the other products. We test the null hypothesis (which we hope to reject in favor of the alternative hypothesis) using the following procedure:

If the null hypothesis is true, the observed ΔBG scores are independent of the products. We can ensure this is true by shuffling the note cards into two piles – one containing 10 cards and the other containing 40 cards. We then compute the mean ΔBG score for each pile. The difference in these means we call the “pseudo statistic”. If the alternative hypothesis is correct and Product X really is worse than the other products, then ordinarily when we shuffle the data the value of the pseudo statistic should be less than the value of the statistic for the unshuffled data, which in this case was 18 mg/dl. However, by chance the pseudo statistic could be greater than or equal to 18 mg/dl – we know it could be at least 18 mg/dl by chance because even after shuffling, the note cards in the Product X pile could be the actual ΔBGs for Product X. It is unlikely, but it could happen.

If it turns out that the difference in the mean ΔBG score for the shuffled data is greater than or equal to 18 mg/dl, we will add 1 to a counter we call “nge”, which stands for number greater than or equal. Shuffle the note cards again into two piles of 10 cards and 40 cards. Recompute the difference in the mean ΔBG scores. Compare this to 18 mg/dl. If it is greater than or equal to 18 mg/dl, add 1 to the nge counter. Repeat NS (Number of Shuffle) times, where NS is a large number like 999.

After completing NS shuffles, the significance of the score 18 mg/dl is determined by computing (nge+1)/(NS+1). Let’s suppose that (nge+1)/(NS+1) is 0.003 after conducting 999 shuffles. This means that in 999 shuffles, there were only two instances where the difference in the mean (nge+1)/(NS+1) scores was greater than or equal to 18 mg/dl. We can conclude that if the products have no effect on (nge+1)/(NS+1) scores (which we ensure by shuffling the ΔBG scores relative to the products), it is extremely unlikely that the difference in the mean ΔBG scores between Product X and the other products would have been as large as 18 mg/dl by chance. Therefore, we reject the null hypothesis of no product-specific effect on blood glucose levels in favor of the alternative hypothesis that Product X is worse than the other products.

## Acknowledgement

Dr. Eric Noreen wrote most of this Statistical Analysis page and wrote the computer macros used in all our statistical analyses. Gary Noreen bears sole responsibility for everything on this web site, however.