24
© Abbott Analytics, Inc. 2001-2013
Multi-Word Features: N-Grams
• N-Grams
• Combinations of characters or words
• “N” means how many character or word groups you identify and extract
•
2-grams (bigrams, digrams): “vice president”
•
3-grams (trigrams): “central intelligence agency”
•
4-grams: “united states of america”
• Constructing bigrams: a simple model, P(A|B)
• Example Corpus (from Jurafsky and Martin):
•
<s> I am Sam </s>
•
<s> Sam I am </s>
•
<s> I do not like green eggs and ham </s>
•
P(I|<s>) = 2/3; P(Sam|<s>) = 1/3; P(Sam|am) = 1/2; P(am|Sam)= 0/2, etc.
• More sophisticated: allow gaps between words
36
Wednesday, July 10, 13