Week 1

In this assignment, you will implement models that correct words that are 1 and 2 edit distances away.

An edit could consist of one of the following options:

Given the dictionary of word counts, compute the probability that each word will appear if randomly selected from the corpus of words.

\(P(w_i) = \frac{C(w_i)}{M} \tag{Eqn-2}\) where

$C(w_i)$ is the total number of times $w_i$ appears in the corpus.

$M$ is the total number of words in the corpus.

For example, the probability of the word ‘am’ in the sentence ‘I am happy because I am learning’ is:

\[P(am) = \frac{C(w_i)}{M} = \frac {2}{7} \tag{Eqn-3}.\]

Python list comprehensions

character deleted.

For example, given the word nice, it would return the set: {‘ice’, ‘nce’, ‘nic’, ‘nie’}.

Step 1: Create a list of ‘splits’. This is all the ways you can split a word into Left and Right: For example,
‘nice is split into : [('', 'nice'), ('n', 'ice'), ('ni', 'ce'), ('nic', 'e'), ('nice', '')] This is common to all four functions (delete, replace, switch, insert).

Step 2: This is specific to delete_letter. Here, we are generating all words that result from deleting one character.
This can be done in a single line with a list comprehension. You can make use of this type of syntax:
[f(a,b) for a, b in splits if condition]

For our ‘nice’ example you get: [‘ice’, ‘nce’, ‘nie’, ‘nic’]

Short circuit

In Python, logical operations such as and and or have two useful properties. They can operate on lists and they have ‘short-circuit’ behavior. Try these:

# example of logical operation on lists or sets
print( [] and ["a","b"] )
print( [] or ["a","b"] )
#example of Short circuit behavior
val1 =  ["Most","Likely"] or ["Less","so"] or ["least","of","all"]  # selects first, does not evalute remainder
print(val1)
val2 =  [] or [] or ["least","of","all"] # continues evaluation until there is a non-empty list
print(val2)

Dynamic Programming

he diagram below describes how to initialize the table. Each entry in D[i,j] represents the minimum cost of converting string source[0:i] to string target[0:j]. The first column is initialized to represent the cumulative cost of deleting the source characters to convert string “EER” to “”. The first row is initialized to represent the cumulative cost of inserting the target characters to convert from “” to “NEAR”.

Filling in the remainder of the table utilizes the ‘Per Cell Operations’ in the equation (5) above. Note, the diagram below includes in the table some of the 3 sub-calculations shown in light grey. Only ‘min’ of those operations is stored in the table in the min_edit_distance() function.

Week 2 : Part of Speech taggin

The smoothing was done as follows:

\[P(t_i | t_{i-1}) = \frac{C(t_{i-1}, t_{i}) + \alpha }{C(t_{i-1}) +\alpha * N}\tag{3}\]

Exercise 6 - initialize

Instructions: Write a program below that initializes the best_probs and the best_paths matrix.

Both matrices will be initialized to zero except for column zero of best_probs.

Here is how to initialize column 0 of best_probs:

Conceptually, it looks like this: ${best_probs}[s_{idx}, i] = \mathbf{A}[s_{idx}, i] \times \mathbf{B}[i, corpus[0] ]$

In order to avoid multiplying and storing small values on the computer, we’ll take the log of the product, which becomes the sum of two logs:

$best_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]$

Also, to avoid taking the log of 0 (which is defined as negative infinity), the code itself will just set $best_probs[i,0] = float(‘-inf’)$ when $A[s_{idx}, i] == 0$

So the implementation to initialize $best_probs$ looks like this:

${if}\ A[s_{idx}, i] <> 0 : best_probs[i,0] = log(A[s_{idx}, i]) + log(B[i, vocab[corpus[0]]])$

${if}\ A[s_{idx}, i] == 0 : best_probs[i,0] = float(‘-inf’)$

3.2 - Viterbi Forward

In this part of the assignment, you will implement the viterbi_forward segment. In other words, you will populate your best_probs and best_paths matrices.

Here is an example with a three-word corpus “Loss tracks upward”:

Compute the probability that the tag of the second word (‘tracks’) is a verb, 3rd person singular present (VBZ).

for each word in the corpus

    for each POS tag type that this word may be
    
        for POS tag type that the previous word could be
        
            compute the probability that the previous word had a given POS tag, that the current word has a given POS tag, and that the POS tag would emit this current word.
            
            retain the highest probability computed for the current word
            
            set best_probs to this highest probability
            
            set best_paths to the index 'k', representing the POS tag of the previous word which produced the highest probability 

3.3 - Viterbi Backward

Now you will implement the Viterbi backward algorithm.

The example below shows how to walk backwards through the best_paths matrix to get the POS tags of each word in the corpus. Recall that this example corpus has three words: “Loss tracks upward”.

POS tag for ‘upward’ is RB

POS tag for ‘tracks’ is VBZ

POS tag for ‘Loss’ is NN

Course 3 Language Models and Auto-Complete

Develop n-gram based Language Models

In this section, you will develop the n-grams language model.

The conditional probability for the word at position ‘t’ in the sentence, given that the words preceding it are $w_{t-n}\cdots w_{t-2}, w_{t-1}$ is:

\[P(w_t | w_{t-n}\dots w_{t-1} ) \tag{1}\]

You can estimate this probability by counting the occurrences of these series of words in the training data.

\[\hat{P}(w_t | w_{t-n} \dots w_{t-1}) = \frac{C(w_{t-n}\dots w_{t-1}, w_t)}{C(w_{t-n}\dots w_{t-1})} \tag{2}\]

Later, you will modify the equation (2) by adding k-smoothing, which avoids errors when any counts are zero.

The equation (2) tells us that to estimate probabilities based on n-grams, you need the counts of n-grams (for denominator) and (n+1)-grams (for numerator).