In this assignment, you will implement models that correct words that are 1 and 2 edit distances away.
An edit could consist of one of the following options:
Given the dictionary of word counts, compute the probability that each word will appear if randomly selected from the corpus of words.
\(P(w_i) = \frac{C(w_i)}{M} \tag{Eqn-2}\) where
$C(w_i)$ is the total number of times $w_i$ appears in the corpus.
$M$ is the total number of words in the corpus.
For example, the probability of the word ‘am’ in the sentence ‘I am happy because I am learning’ is:
\[P(am) = \frac{C(w_i)}{M} = \frac {2}{7} \tag{Eqn-3}.\]character deleted.
For example, given the word nice, it would return the set: {‘ice’, ‘nce’, ‘nic’, ‘nie’}.
Step 1: Create a list of ‘splits’. This is all the ways you can split a word into Left and Right: For example,
‘nice is split into : [('', 'nice'), ('n', 'ice'), ('ni', 'ce'), ('nic', 'e'), ('nice', '')]
This is common to all four functions (delete, replace, switch, insert).
Step 2: This is specific to delete_letter
. Here, we are generating all words that result from deleting one character.
This can be done in a single line with a list comprehension. You can make use of this type of syntax:
[f(a,b) for a, b in splits if condition]
For our ‘nice’ example you get: [‘ice’, ‘nce’, ‘nie’, ‘nic’]
In Python, logical operations such as and
and or
have two useful properties. They can operate on lists and they have ‘short-circuit’ behavior. Try these:
# example of logical operation on lists or sets
print( [] and ["a","b"] )
print( [] or ["a","b"] )
#example of Short circuit behavior
val1 = ["Most","Likely"] or ["Less","so"] or ["least","of","all"] # selects first, does not evalute remainder
print(val1)
val2 = [] or [] or ["least","of","all"] # continues evaluation until there is a non-empty list
print(val2)
he diagram below describes how to initialize the table. Each entry in D[i,j] represents the minimum cost of converting string source[0:i] to string target[0:j]. The first column is initialized to represent the cumulative cost of deleting the source characters to convert string “EER” to “”. The first row is initialized to represent the cumulative cost of inserting the target characters to convert from “” to “NEAR”.
Filling in the remainder of the table utilizes the ‘Per Cell Operations’ in the equation (5) above. Note, the diagram below includes in the table some of the 3 sub-calculations shown in light grey. Only ‘min’ of those operations is stored in the table in the min_edit_distance()
function.
The smoothing was done as follows:
\[P(t_i | t_{i-1}) = \frac{C(t_{i-1}, t_{i}) + \alpha }{C(t_{i-1}) +\alpha * N}\tag{3}\]transition_counts
dictionary.tag_counts
dictionary.Instructions:
Write a program below that initializes the best_probs
and the best_paths
matrix.
Both matrices will be initialized to zero except for column zero of best_probs
.
best_probs
is initialized with the assumption that the first word of the corpus was preceded by a start token (“–s–”).Here is how to initialize column 0 of best_probs
:
Conceptually, it looks like this: ${best_probs}[s_{idx}, i] = \mathbf{A}[s_{idx}, i] \times \mathbf{B}[i, corpus[0] ]$
In order to avoid multiplying and storing small values on the computer, we’ll take the log of the product, which becomes the sum of two logs:
$best_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]$
Also, to avoid taking the log of 0 (which is defined as negative infinity), the code itself will just set $best_probs[i,0] = float(‘-inf’)$ when $A[s_{idx}, i] == 0$
So the implementation to initialize $best_probs$ looks like this:
${if}\ A[s_{idx}, i] <> 0 : best_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]])$
${if}\ A[s_{idx}, i] == 0 : best_probs[i,0] = float(‘-inf’)$
In this part of the assignment, you will implement the viterbi_forward
segment. In other words, you will populate your best_probs
and best_paths
matrices.
predict_pos
(the ‘warm-up’ exercise), this will include the path up to that (word,tag) combination.Here is an example with a three-word corpus “Loss tracks upward”:
Compute the probability that the tag of the second word (‘tracks’) is a verb, 3rd person singular present (VBZ).
best_probs
matrix, go to the column of the second word (‘tracks’), and row 40 (VBZ), this cell is highlighted in light orange in the diagram below.best_probs
matrix contains this value -14.32 in the column for ‘Loss’ and row for ‘NN’.A
transition matrix, and go to the row for ‘NN’ and the column for ‘VBZ’. The value is $4.37e-02$, which is circled in the diagram, so add $-14.32 + log(4.37e-02)$.best_probs
matrix at row ‘VBZ’ and column ‘tracks’ (as seen in the cell that is highlighted in light orange in the diagram).best_probs
, and so the most likely path to ‘VBZ’ is from ‘NN’. ‘NN’ is in row 20 of the best_probs
matrix, so $20$ is the most likely path.best_paths
table. This is highlighted in light orange in the diagram below.for each word in the corpus
for each POS tag type that this word may be
for POS tag type that the previous word could be
compute the probability that the previous word had a given POS tag, that the current word has a given POS tag, and that the POS tag would emit this current word.
retain the highest probability computed for the current word
set best_probs to this highest probability
set best_paths to the index 'k', representing the POS tag of the previous word which produced the highest probability
Now you will implement the Viterbi backward algorithm.
best_paths
and the best_probs
matrices.The example below shows how to walk backwards through the best_paths matrix to get the POS tags of each word in the corpus. Recall that this example corpus has three words: “Loss tracks upward”.
POS tag for ‘upward’ is RB
best_prob
table.best_probs
, the estimated probability is -34.99, which is larger than the other values in the column. So the most likely POS tag for ‘upward’ is RB
an adverb, at row 28 of best_prob
.z
is an array that stores the unique integer ID of the predicted POS tags for each word in the corpus. In array z, at position 2, store the value 28 to indicate that the word ‘upward’ (at index 2 in the corpus), most likely has the POS tag associated with unique ID 28 (which is RB
).pred
contains the POS tags in string form. So pred
at index 2 stores the string RB
.POS tag for ‘tracks’ is VBZ
RB
, which is uniquely identified by integer ID 28, go to the best_paths
matrix in column 2, row 28. The value stored in best_paths
, column 2, row 28 indicates the unique ID of the POS tag of the previous word. In this case, the value stored here is 40, which is the unique ID for POS tag VBZ
(verb, 3rd person singular present).VBZ
.z
, store the value 40 at position 1, and for array pred
, store the string VBZ
to indicate that the word ‘tracks’ most likely has POS tag VBZ
.POS tag for ‘Loss’ is NN
best_paths
at column 1, the unique ID stored at row 40 is 20. 20 is the unique ID for POS tag NN
.z
at position 0, store 20. In array pred
at position 0, store NN
.In this section, you will develop the n-grams language model.
The conditional probability for the word at position ‘t’ in the sentence, given that the words preceding it are $w_{t-n}\cdots w_{t-2}, w_{t-1}$ is:
\[P(w_t | w_{t-n}\dots w_{t-1} ) \tag{1}\]You can estimate this probability by counting the occurrences of these series of words in the training data.
Later, you will modify the equation (2) by adding k-smoothing, which avoids errors when any counts are zero.
The equation (2) tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator).