Task 1

14.3.1. Task 1#

Learning Objectives:#

Read and process multiple text files using Python.
Build and normalize frequency models and save them as CSV files.
Visualize analysis results using plots.

Introduction#

Identifying the language of a written text can be a challenging task, but using what you have learned about Python programming, you can create a program to analyze and compare different languages. A simple way to compare written languages is to examine how often different character patterns appear.

One common approach is to build n-gram frequency models. An n-gram is a sequence of $n$ consecutive items. In this task, we will focus on character n-grams, where the consecutive items are consecutive individual characters in a given text. For example, the 2-grams (or bigrams) in the word hello are he, el, ll, and lo.

Different languages tend to produce different n-gram distributions. In this task, you will build n-gram relative frequency models for several languages using sample text files. You will then save these models as CSV files and visualize the results using plots. These models will be used in the next task to classify the language of an unknown text.

Task Instructions#

Develop a Python program that takes in multiple language sample text files, builds n-gram relative frequency models for each language, and displays the results.

Before creating the program, create a flowchart of the algorithm you will use and save it as py4_ind_1_username.pdf. Then start your program from a copy of the ENGR133_Python_Template.py Python template. Your program should be named py4_ind_1_username.py. You will also need to create a folder named sample_texts within the same folder as your Python script. Then, download each of the sample texts in Table 14.7 and place them into your sample_texts folder. Your program should do the following:

Load the sample text files.
Clean the text data by removing punctuation, ensuring consistent casing (e.g., all lowercase), and removing any non-alphabetic characters.
For each language sample, build n-gram counts for all values of $n$ from $1$ to $5$.
Normalize the n-gram counts to obtain relative frequencies.
Save models to a CSV output file for each language.
Generate plots for each language showing the top 10 n-grams for a user-specified value of $n$.

Table 14.7 Sample Text Download#
Sample Text	Download Link
Dutch Sample Text	`sample_dutch.txt`
English Sample Text	`sample_english.txt`
French Sample Text	`sample_french.txt`
German Sample Text	`sample_german.txt`
Italian Sample Text	`sample_italian.txt`
Spanish Sample Text	`sample_spanish.txt`

Note

Make sure you saved the samples into your sample_texts folder, and that your folder is located in the same directory as your Python script.

Load Samples Function#

Create a function named load_samples that has no parameters and returns a dictionary where the keys are the language names (e.g., “english”, “french”) and the values are the corresponding text data from the sample files.

When using open() to read the text files, ensure you specify the correct encoding (encoding='utf-8') to handle special characters properly.

Note

The iterdir() method from the pathlib module can be useful for listing all files in a directory.

from pathlib import Path

path = Path("sample_texts")
files = list(path.iterdir())

Path objects have a name attribute that can be useful in extracting the language name from the file name. You can read more about it in the official documentation.

Clean Text Function#

Create a function named clean_text that takes in a dictionary of texts (as returned by the load_samples function) and returns a new dictionary with the same keys but with cleaned text data. The cleaning process should do the following:

Convert all text to lowercase.
Remove all characters that are not in that language’s alphabet or a space (i.e., remove punctuation, numbers, special characters, and newlines).

Note

You can find various very useful methods for string manipulation in the official documentation.

Create N-gram Function#

Create a function named create_n_gram that takes in a number n, the language name, and the cleaned text dictionary (as returned by the clean_text function). This function should return a dictionary where each key is an n-gram and each value is the count of how many times that n-gram appears in the text for the specified language.

Normalize N-gram Function#

Create a function named normalize_n_gram that takes in an n-gram count dictionary (as returned by the create_n_gram function) and returns a new dictionary with the same keys but with relative frequencies as values. The relative frequency of an n-gram is calculated by dividing its count by the total number of n-grams.

Plotting Function#

Fill in the plot_top_k function in the provided template. This function takes in a dictionary with each key being the language name and each value being only the relative frequency n-gram of user specified n, the n-gram size, and the number of top n-grams to plot. The function should generate a bar plot for each language showing the top $k$ n-grams and their relative frequencies.

import matplotlib.pyplot as plt

def plot_top_k(models, n, k=10):
    fig, axs = plt.subplots(2, 3, figsize=(15, 10))

    row = 0
    col = 0

    for language in models:
        ax = axs[row][col]

        n_gram = models[language]

        # TODO :
        top_ngrams = # set top_ngrams to be the sorted list of tuples (n-gram, frequency)
        top_ngrams = top_ngrams[:k] # get the top k n-grams

        grams = []
        freqs = []
        for gram, freq in top_ngrams:
        # TODO :
        # set grams and freqs to be lists of n-grams and frequencies from top_ngrams


        ax.bar(grams, freqs)
        # TODO :
        # set title and axis labels
        ax.tick_params(axis="x", rotation=45)

        col += 1
        # move to next row after 3 columns
        if col == 3:
            col = 0
            row += 1

    plt.tight_layout()
    plt.show()

Note

To sort a dictionary by its values, you can use the sorted() function along with a lambda function as the key. Here is an example of how to sort a dictionary in descending order based on its values:

sample_dict = {'a': 3, 'b': 1, 'c': 2}
sorted_list = sorted(sample_dict.items(), key=lambda item: item[1], reverse=True)
print(sorted_list)  # Output: [('a', 3), ('c', 2), ('b', 1)]

# If you want to convert it back to a dictionary
sorted_dict = dict(sorted_list)
print(sorted_dict)  # Output: {'a': 3, 'c': 2, 'b': 1}

Main Function#

In your main function, you will need to do the following:

Load the sample texts and clean them.
For each language, create relative frequency n-grams for $n$ from $1$ to $5$.
For each language, save all n-gram models to a single CSV file named ‘py4_ind_1_lang.csv’, where lang is the name of the language (e.g., ‘py4_ind_1_english.csv’). The CSV file can be structured in any way you find appropriate (remember that you will be reading from this file in the next assignment). You can use the csv module from the Python standard library to help with this task, reference the official documentation for more information.
Prompt the user for an n-gram size.
Plot the top $10$ n-grams for the user-specified n-gram size.

Sample Output#

Test cases for the n-gram analysis and visualization. Use the values in Table 14.8 below to test your program.

Table 14.8 Test Cases#
Case	n-gram size
1	1
2	3

Ensure your program’s output matches the provided samples exactly. This includes all characters, white space, and punctuation. In the samples, user input is highlighted like this for clarity, but your program should not highlight user input in this way.

Case 1 Sample Output

$ python3 py4_ind_1_username.py Enter the n-gram size to plot (1-5): 1

Case 2 Sample Output

$ python3 py4_ind_1_username.py Enter the n-gram size to plot (1-5): 3

Table 14.9 Deliverables#
Deliverables	Description
py4_ind_1_username.pdf	Flowchart(s) for this task.
py4_ind_1_username.py	Your completed Python code.
py4_ind_1_dutch.csv, py4_ind_1_english.csv, py4_ind_1_french.csv, py4_ind_1_german.csv, py4_ind_1_italian.csv, py4_ind_1_spanish.csv	The generated n-gram CSV files for each language.

Task 1

Contents

14.3.1. Task 1#

Learning Objectives:#

Introduction#

Task Instructions#

Load Samples Function#

Clean Text Function#

Create N-gram Function#

Normalize N-gram Function#

Plotting Function#

Main Function#

Sample Output#