Introduction to Statistics for Linguists

Morphologieanalyse und Lexikonaufbau (9. Vorlesung)

Dozent: Gerold Schneider

Übersicht

NAME: Introduction to Statistics for Linguists

AIM: To give an outline of the theoretical background of statistics for computational linguists, with a practical example of a descriptive linguistic test and the fundamental ideas of information theory, on which e.g. statistical taggers are based.

TOC:

CLAIM: Statistical methods are vital in quantitative linguistics. Although the host of theoretical background involves complicated mathematics, the usage of the tools most relevant to linguistics, e.g. standard deviation or the c2 test, are relatively simple. While information theory is the major method for staistical NLP, its central assumpitions are foreshadowed in the Zipf's laws.

FRAME: Literature:

GAME: Let's start!

Introduction

Qualitatitive and Quantitative Measures

Qualitative vs. quantitative linguistics

  • quantitative measures:
    gradable: 1.71 m, 3h, etc.
  • qualitative measures:
    features: [+female], [-Verb], etc.
    Frequencies of features can be expressed in a ratio

"Average": Mean, Median, Mode

mean: what we usually mean by "average". mean(1,5,6)=4. Equal area on both sides.

(1)

median: the value in the middle of a list. median(1,2,6)=2. 50% of the tokens have higher values, 50% have lower ones.

mode: the value which is most frequent. mode (1,1,4,6)=1. Peak in distribution graph.

Distribution Graphs and Normal Distribution

Suppose we want to illustrate test marks with a bar-graph:

(A)

We can easily draw similar distribution graphs for e.g.

  • word lengths (in letters) in 6 "London Gazettes" of 1691:

(B)

  • The weight of UK citzens (a sample of 1000 arbitrarily chosen people).
  • Frequencies of "has" in the 44 texts of LOB Category A (Press:reviews):

(C)

etc.

In these cases, you will get charts which more or less resemble to the so-called normal distribution:

In many other cases, you will not get a normal distribution. While we often expect normal distribution in qualitative measures, we often expect an even distribution in quantitative measures.

Non-finite Verbs in Texts I-V

Dispersion: Variance, Standard Deviation

Q1: Suggest a distribution graph for "Time needed to travel between home and work per day" on a scale of 0 to 3 hours.

Q2: Suggest a distribution graph for "Height of people" on a scale of 0 to 3m.

Q3: Compare the two graphs.

Even if results may have similar mean, median or mode, the dispersion may vary greatly. Consequently, comparing average values only is at best a hint at statistical peculiarities, but no reliable tool or even a "proof" of any theory. [Butler 68-9]: "If we have demonstrated a difference in, say, the means or medians for two sets of data, can we not just say that one is indeed greater than the other, and leave it at that? This would, in fact, be most unwise, and would show a lack of understanding of the nature of statistical inference."

Measures of dispersion needed:

First idea: Sum of differences from mean: [Woods/Fletcher/Hughes 41]

-> positives and negatives cancel each other out. The result is always ZERO!

Better idea: Sum of squared differences

-> renders positive values for each token

-> weighs strong deviations more heavily.

The sum of the squared differences, divided by the number of tokens (minus one) is the variance:

(2)

It is useful to use s instead of s2. s is called the Standard Deviation:

(3)

The standard deviation tells us how much any token deviates on average=how much we can expect a token to deviate. E.g. we want to add one more token to our sample. We can expect it to deviate by the standard deviation s.

In a PERFECT normal distribution 68% of all sample values lie within x-bar and x-bar ±s, =within the mean plus or minus the standard deviation, while 95% of all values lie within x-bar ± 2s. But most real distributions look more or less different from the PERFECT normal distribution, so these percentages vary accordingly. If for any particular value x we want to find out how much it deviates from the mean x-bar in relation to the standard deviation, we simply devide x minus x-bar by s.

(4)

This is the so-called z-score. For x = x-bar + s z is e.g. 1.

An easy way to express the amount of dispersion of a distribution graph is to calculate the standard deviation in relation to the mean, i.e. calculating a relative standard deviation, which is a coefficient, a ratio, i.e. it can be expressed in percents (thus x 100% in formula (4)). This percentage, called the variation coefficient, conveys how many percents of the mean is the standard deviation.

(5)

By comparing on the one hand the averages (mean, mode and median) and on the other hand the standard deviations of two sets of data, we already get a much clearer picture whether differences between them are statistically relevant or not. But still, because real distibutions differ from the perfect normal distribution, they do not deliver reliable data. A simple test of "how normal" a distribution is consists in calculating the mean, the median and the mode. Since they coincide in a perfect normal distribution, the amount of their differences gives a (very) rough idea of how closely a distribution is "normal".

In a perfect normal distribution, 95% of all values lie within x-bar ± 2s. A value value outside this interval (95% is a common confidence interval) can be said to be statistically "abnormal".

Since we often want to compare sets of data, and since most distribuitions are not perfectly normal, different tests are needed. They exist in fact, one of them is the chi-square test.

Statistical Relevance

When can a feature be said to be statistically relevant?

Relevance and Probability

What we want to know in descriptive linguistics (for sociolinguistic studies, etc.) is not the amount of differences between two observable sets, but the PROBABILITY of observing them. E.g. normal distribution, two events (like coin-tossing) with equal probability.

HEAD: 1/2        HEAD, followed by HEAD: 1/2 x 1/2=1/4 etc.
TAIL: 1/2        TAIL, followed by TAIL: 1/2 x 1/2=1/4 etc.

Let us name the probability of HEAD as h and the probability of TAIL as t.

Tosses:S(p) Probabilites=Binomials:              Pascal - Triangle  p(1)=1/q
 
1      1 =  (h+t) = h+t                                1  1              2=q
2      1 =  (h+t)2= h2+2ht+t2                          1  2  1            4=q
3      1 =  (h+t)3= h3+3h2t+3ht2+t3                   1  3  3  1           8=q
4      1 =  (h+t)4= h4+4h3t+6h2t2+4ht3+t4            1  4  6  4  1         16=q
5      1 =  (h+t)5= h5+5h4t+10h3t2+10h2t3+5ht4+t5   1  5 10  10  5 1       32=q
X      1 =  (h+t)X= ...                                 ...              2X=q

This is indeed the mathematical model of the normal distribution!

If for example we want to test how normal our sampled data is, we need to compare it with such a model, hoping the data will match it closely. Or if we want to show that our data is NOT normally distributed, it has to match as little as possible. Obviously such tests involve complex mathematics. Fortunately, charts with results are available; so we do NOT have to bother about mathematical issues too much.

Test for fit of data                          Test of statistical 
to a model or theory               vs.        relevance of a discrepancy
 
                                 CLAIM:
 
The probability of achieving                  The probability of achieving
the ACTUAL distribution of data               the ACTUAL distribution of data
is very high (> 95%)                          is very low (< 5%)
 
                               PROCEDURE:
 
"prove" that the null-hypothesis              "prove" that the null-hypothesis
does apply                                    does NOT apply

The null-hypothesis suggests that the devations and fluctuations in our data are due to chance, the small sample size, or insufficent care in selecting our sample, and that consequently our data is indeed very probable. [Butler 69-70]

Sample Size and Intervals

Number of groups: We need to divide our data into appropriate intervals. In order to attain a represenation which can resemble to a normal distribution, you need at the very least 3, but better at least 6 intervals. The c2 test needs at least two intervals.The more intervals you make, the more data you need, to avoid gaps and fluctuations. It is hardly useful to make more than 20 intervals.

Number of values per group: Groups with very low values cannot yield reliable statistical information. This is the so-called sparse data problem. For e.g. the c2 test every interval must contain at least 5 values. "Border intervals" may be collapsed. (cf. ill. C.2).

Number of total values: From the above it follows that the c2 test needs at the very least 10 samples to work. [Woods/Fletcher/Hughes103].

The following remarks refer to the c2 test, the only test I intend to introduce here.

Ad Graph (A): 29 pupils, barely enough to make 3 groups. The result will not be very reliable, but still valid. Including one or two more classes recommended.

Ad Graph (B): Fine. Groups 16 to 18 are collapsed, then containing 20 values.

Ad Graph (C): Re-grouping necessary, 1 new group containing three others. One text did not contain any "has". This fact should also be included in the chart. This entails collapsing the first two groups, as well. Sampling more data (e.g LOB B&C, Brown A, or collecting "has"/1000 words instead of "has"/text @ 2000 words) would be nice, but not necessary. The re-grouped graph:

(C.2)

Comparing to a Standard ("goodness-of-fit"): the c2 Test (Chi-square Test)

Threre is a big variety of statistic tests, the c2 test just one of them, perhaps not always the best suitable, but probably the most universal one.

Its principle: Compare (i.e. calculate the difference) the value of each interval with its corresponding expected value (from a "standard"). In order to eliminate negative values, and in order to count big aberrations more strongly, we square this difference (similar to standard deviation: x minus x-bar). We do not want to know the absolute difference, but the one relative to the height of the bar, so we divide it all by the expected value. Like for the standard deviation, we then add up all the values. In a formula (o=observed value, e=expected value, df+1=number of intervals):

(6a) c2test, step 1:

alternatively,

The total deviance SD does not yet convey information on significance directly. SD is further processed by a complex probabilistic calculation, whose results are compiled into charts much easier to handle.

  • in p=95% for proving the null-hypothesis: if SD < v then accept the null-hypothesis.

(6b) c2test, step 2: Look up the value v under the correct df, either / or \<

  • in p= 5% for refuting the null-hypothesis: if SD > v, then refute the null-hypothesis.

Degrees of Freedom (df): Generally the number of groups minus 1. [Woods/Fletcher/Hughes 138]: "[T]he degrees of freedom can be considered in a sense as the number of independent pieces of information we have on which to base the test of a hypothesis". In contingency tables the df is: (number of columns -1) x (number of rows-1).

a standard may be:

  • a theoretical model:e.g. the normal distribution, which is then a very accurate test for normality:

    Calculate the appropriate standard values e: , then look up expected proportion in chart, multiply with number of samples -> e. Then we proceed to the c2test, steps 1 + 2.

  • other sampled data, the bigger the better. In this case, contingency tables, a special variant of the c2test, are especially suitable. [Woods/Fletcher/Hughes 140]. Because this situation of qualitative measure is tvery frequent in linguistics, I am going to base my first practical example on it.

 

A practical Example of a Contingency Table

Returning to our discussion from the last lecture about the nominal/verbal character of Scientific English, we can now test if our findings are statistically relevant or a chance fluctuation.

VALUES FROM LOB

ABS(K)

ABS(J)

Tag

294

1472

_BE

274

584

_BED

1000

1127

_BEDZ

46

149

_BEG

104

14

_BEM

172

480

_BEN

155

1007

_BER

328

2486

_BEZ

223

159

_DO

167

75

_DOD

37

108

_DOZ

279

595

_HV

791

343

_HVD

29

64

_HVG

21

23

_HVN

81

469

_HVZ

1033

2135

_MD

12

56

_NC

7130

26933

_NN

2055

9897

_NNS

9

775

_NNU

2

9

_NP

79

147

_NR

1

5

_NRS

249

159

_PN

3180

2585

_PP

1761

563

_PPA

531

755

_PPAS

124

110

_PPL

48

81

_PPLS

819

113

_PPO

234

205

_PPOS

2484

4024

_VB

2820

1363

_VBD

987

1491

_VBG

1375

5284

_VBN

151

1449

_VBZ

ABS(K)

ABS(J)

Tag

29085

67294

All of the above

56853

154691

All words

for chi-square Test:

Contingency Table

with absolute values:

OBSERVED:

Totals

N&Pro

16234

42393

58627

All Verbs

12851

24901

37752

Verbs&N&Pro

29085

67294

96379

All Words

56853

154691

211544

V&N&P/All Words

51.16%

43.50%

The contingency table at the bottom of this chart sums up the nominal and the verbal categories. Let us assume that we simply want to compare the relation of all nominal categories (nouns, pronouns) to all verbal categories (main, verbs, auxiliaries, modals).

EXPECTED: (Row total * Column total)/Grand total

N&Pro

17692

40935

All Verbs

11393

26359

Verbs&N&Pro

29085

67294

(O-E)

N&Pro

-1458

1458

All Verbs

1458

-1458

(O-E)^2 / E

N&Pro

120

52

All Verbs

187

81

TOTAL= Chi-Square Value:

439

Look up in chart or use a probablity program

Probability at df=1

< 0.1%

More elaborate examples are available here.

Information Theory

Information Theory is used in many CL context, e.g. tagging.

Information Theory Terms

The term "information" is understood as a measure of rarity, unexpectedness and uncertainty.

Let us look at a short conversation between A and B:

A1: Hello!
B1: Oh, hi!
A2: How are you?
B2: Fine, and you?
A3: Great, I have just been on holidays!
B3: Holidays? Lucky you! Where to?
A4: O, to the Mediterranean.
B4: Loads of sunshine ...
A5: You can count on that!
B5: Well ... I've got to move on ...
    Are you in tomorrow for a cup of vodka?
A6: That would be great!
B6: Fine. See you then!
A7: See you!

At which places are there many options, are we thus uncertain how the conversation could go on? --> Entropy

Where do we encounter an unexpected, unlikely statement? --> Rareness, Mutual Information

Entropy is generally a measure of randomness in nature, which is also used in natural science. Entropy is low in situations where probabilites are very unequal. E.g. a greeting is usually answered by a greeting; A1 to B2 contain virtually no information, as we expect this continuation of the conversation. In situations where there are many possibilities of equal probabilities, such as after the "great" in A3, entropy is very high. In this sense, the utterance that A was on holidays has a very high information content.

Where entropy and thus information is low, mutual information (MI) between two succeeding units is high. While the word "Mediterranean" is generally infrequent and thus informative, the "holidays" context renders it much less informative and unlikely.

p("Meditarranean") < p("Mediterranean"|"holidays")

On the other hand, the "vodka" comes as a surprise in the "cup" context. Perhaps "vodka" is rarer in this context than in general language - although this a hypothesis one would have to prove:

p("vodka") > p("vodka"|"cup") ?

On the word level, the transition between "cup" and "of"

p("of"|"cup")

is likely, MI thus high, the transition from "of" to "vodka"

p("vodka"|"of")

more unlikely, but the collocation of "cup" and "vodka"

p("vodka"|"cup")

is most unlikely - there is hardly any or no MI.

On the POS level, all the transition probabilities p(ni|ni-1)are relatively high, however.

p(N|PREP), p(PREP|N)

Bayesian Statistics

p(A|B) is the relative or conditional probability of event A GIVEN event B, i.e. the probability of A if we already know that B. Relative probabilities are used in Bayesian statistics, on which most statistical NLP approaches rely.

(7)

While this formula is not easily provable it can be shown that p(A|B) depends on p(A « B) and p(B).

If p(A « B) is bigger while p(B) remains constant, then p(A|B) increases (positive correlation), there is more mutual information, A and B are more dependent on each other.

If p(B) is bigger while p(A « B) remains constant, then p(A|B) decreases (negative correlation), there are more B cases which are not in A, which decreases the dependency of B on A and thus the mutual information.

We have seen uses of conditional probabilities for hidden Markov models in the lecture on statistical taggers.

In case it is easier to determine p(B|A) instead of p(A|B) the order of dependence is related to its opposite by means of the Bayes' theorem:

Information, Communication, Efficiency

Information theory was developped in the 1940s by Claude Shannon in order to calculate maximally efficient compression algorithms for sending data over slow telephone lines. Entropy, the measure of unexpectedness and information of a random variable is normally measured in bits in Computing Science. In order to transmit the outcome of rolling a regular 8-sided die, 3 bits of information are necessary:

1    2    3    4    5    6    7    8
001  010  011  100  101  110  111  000

If, however, certain outcomes or patterns of the language to be transmitted are frequent, then they should be represented by the shortest bit-sequence possible for an optimal data compression. Also modern data compression algorithms are still based on this simple idea. In a (simplified) version of Polynesian, only 6 letters are known, with the following frequencies:

p    t    k    a    i    u    
1/8  1/4  1/8  1/4  1/8  1/8

The letter entropy is 2.5 bits. In order produce the shortest possible encoding, the frequent letters are given 2-bit codes, the others 3-bit codes:

p    t    k    a    i    u    
100  00   101  01   110  111

As 2-bit codes begin with 0 and 3-bit codes with 1, decoding is never ambiguous.

Zipf's laws, which state that

  • the most frequent words are shortest
  • the most frequent words are most ambiguous (but humans can easily disambiguate them in context)
  • wordlist rank * frequency is constant, i.e. the most frequent words are extremely frequent and the most expressive (informative!) ones very rare

are in full agreement with information theory, almost a consequence of it.