Lab1

The transformers library provides pipelines for popular tasks like sentiment analysis, summarization, and text generation.

DistilBert for Q&A

from transformers import pipeline

question_answerer = pipeline(task="question-answering", model="distilbert-base-cased-distilled-squad")

# Then give 'context' as text blob
context = """
Tea is an aromatic beverage prepared by pouring hot or boiling water over cured or fresh leaves of Camellia sinensis,
an evergreen shrub native to China and East Asia. After water, it is the most widely consumed drink in the world. 
There are many different types of tea; some, like Chinese greens and Darjeeling, have a cooling, slightly bitter, 
and astringent flavour, while others have vastly different profiles that include sweet, nutty, floral, or grassy 
notes. Tea has a stimulating effect in humans primarily due to its caffeine content.

The tea plant originated in the region encompassing today's Southwest China, Tibet, north Myanmar and Northeast India,
where it was used as a medicinal drink by various ethnic groups. An early credible record of tea drinking dates to 
the 3rd century AD, in a medical text written by Hua Tuo. It was popularised as a recreational drink during the 
Chinese Tang dynasty, and tea drinking spread to other East Asian countries. Portuguese priests and merchants 
introduced it to Europe during the 16th century. During the 17th century, drinking tea became fashionable among the 
English, who started to plant tea on a large scale in India.

The term herbal tea refers to drinks not made from Camellia sinensis: infusions of fruit, leaves, or other plant 
parts, such as steeps of rosehip, chamomile, or rooibos. These may be called tisanes or herbal infusions to prevent
confusion with 'tea' made from the tea plant.
"""

result = question_answerer(question="Where is tea native to?", context=context)
print(result['answer'])

# Or for multiple ones:

questions = ["Where is tea native to?",
             "When was tea discovered?",
             "What is the species name for tea?"]

results = question_answerer(question=questions, context=context)

for q, r in zip(questions, results):
    print(q, "\n>> " + r['answer'])

Fine-tuning

To fine-tune your model, you will leverage three components provided by Hugging Face:

Datasets: Library that contains some datasets and different metrics to evaluate the performance of your models.
Tokenizer: Object in charge of preprocessing your text to be given as input for the transformer models.
Transformers: Library with the pre-trained model checkpoints and the trainer object.

Given that we used Apache Arrow format to save the dataset, you have to use the load_from_disk function from the datasets library to load it:

# Execute this cell if you will use the data we processed instead of downloading it.
from datasets import load_from_disk

#The path where the dataset is stored
path = '/content/tydiqa_data/'

#Load Dataset
tydiqa_data = load_from_disk(path)

idx = 600

# start index
start_index = tydiqa_data['train'][idx]['annotations']['minimal_answers_start_byte'][0]

# end index
end_index = tydiqa_data['train'][idx]['annotations']['minimal_answers_end_byte'][0]

print("Question: " + tydiqa_data['train'][idx]['question_text'])
print("\nContext (truncated): "+ tydiqa_data['train'][idx]['document_plaintext'][0:512] + '...')
print("\nAnswer: " + tydiqa_data['train'][idx]['document_plaintext'][start_index:end_index])tydiqa_data

Now, you have to flatten the dataset to work with an object with a table structure instead of a dictionary structure. This step facilitates the pre-processing steps.

flattened_train_data = tydiqa_data['train'].flatten()
flattened_test_data =  tydiqa_data['validation'].flatten()

# Selecting a subset of the train dataset
flattened_train_data = flattened_train_data.select(range(3000))

# Selecting a subset of the test dataset
flattened_test_data = flattened_test_data.select(range(1000))