For the expected values E you simply enter the O values of the large corpus as E. However, the corpus should be considerably larger to assure its quality of being a reference. If the reference data is only of similar size as the O data, the contingency-table method is preferrable, since it calculates E values composed of both the reference data and the O data to be tested, while respecting their appropriate sizes.
The example uses the distribution of the first 200 non-finite verbs of 5 different texts:
The distribution of the auxiliary have in the 3rd person singular present indicative in the different sections of an arbitrary part of the LOB Corpus (part B in this case) should be a normal distribution. By calculating a normal curve based on mean and standard deviation and using the normal distribution as teh expected (E) values, we can use the chi-square test for testing for normality.