Working with Text Files (pt. 2)#

Goals of this lecture#

In the previous lecture, we discussed the basics of opening a .txt file, as well as reading and writing to that file.

In this lecture, we’ll talk about what you can do with a text file once you’ve already opened it. Because text files are read in as strings, you’ll see many echoes of our previous lecture on working with strings.

  • Finding a target str in a file.

  • Counting the number of words in a file.

    • Counting how many times each word occurs.

    • Finding the most frequent word in a text.

Finding a target str#

One common use case is searching a large volume of text to return particular sub-string.

  • Where in the text does this sub-string occur?

  • What is the text surrounding one of its occurrences?

Note that this is not too far afield from a search engine like Google!

Our sample text#

To start, we’ll use a .txt file of Hamlet, by William Shakespeare. The .txt file was retrieved from the Project Gutenberg Corpus online, and should be credited as such.

The file is included in the lectures GitHub repository under the data directory.

First, let’s use readlines() to extract each line of the play as a separate item in a list.

with open("data/hamlet.txt") as f:
    book = f.readlines()

Inspecting the text#

## This is just the title
book[0]
'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK\n'
## Partial list of characters in play
for line in book[5:12]:
    l = line.replace("\n", "")
    print(l)
Dramatis Personae

  Claudius, King of Denmark.
  Marcellus, Officer.
  Hamlet, son to the former, and nephew to the present king.
  Polonius, Lord Chamberlain.

Check-in#

How could we check how many lines are in the .txt file?

### Your code here

Solution#

### Number of entries in list
len(book)
4934

Finding a sample str#

One of the most famous lines in Hamlet reads:

To be, or not to be- that is the question…

Suppose we wanted to find the str "that is the question" in the book, and return the line number (at least in this .txt file).

How could we go about that?

Solution: enumerate#

  • Use enumerate to iterate through each line of the play.

  • For each line, check if some target_str occurs in that line.

  • If it does, use break to stop iterating, and record which line it is.

target_str = "that is the question"
for index, line in enumerate(book):
    if target_str in line:
        break
print("Line: {x}".format(x = line.replace("\n", "")))
print("Line number: {x}".format(x = index))
Line:   Ham. To be, or not to be- that is the question:
Line number: 2048

Check-in: Finding the next \(N\) lines#

What if we wanted to return the next \(N\) (e.g., 5) lines after this target string?

  • To do this, we just need to add another variable: keep_lines, which tells us how many additional lines we want to return.

  • Then, once we’ve retrieved the index of our target_str, we can slice between that index and index + 3.

Try implementing this algorithm yourself first.

Hint: The code can be mostly the same as before (i.e., use enumerate, etc.).

### Your code here

Solution#

target_str = "that is the question"
keep_lines = 5 ### New variable to track
for index, line in enumerate(book):
    if target_str in line:
        break
# Retrieve all lines between target and 3 lines later
targets = book[index: index+ keep_lines]
## Now, print out each of those lines
for i in targets:
    print(i.replace("\n",""))
  Ham. To be, or not to be- that is the question:
    Whether 'tis nobler in the mind to suffer
    The slings and arrows of outrageous fortune
    Or to take arms against a sea of troubles,
    And by opposing end them. To die- to sleep-

Check-in: What if target_str occurs multiple times?#

What if we were looking for a more common target_str, e.g., one that occurred multiple times?

  1. What problems do you see with our previous approach (e.g., using break once we find target_str)?

  2. How might you solve this problem?

### Your answer here

Solution#

Problem: If we break as soon as we find the target_str, we’ll always only ever find the first instance.

Solution: Instead, we can track the indices of each occurrence of target_str. Then, we return a list of these indices when we finish checking the entire book.

target_str = "the question"
## Track a list of indices
line_indices = []
for index, line in enumerate(book):
    if target_str in line:
        ## Rather than breaking, we 
        ## add index to list_indices
        line_indices.append(index)

Checking line_indices#

print(line_indices)
[192, 1658, 2048, 3354]
for l in line_indices:
    line = book[l].replace("\n", "")
    print("Line {x}: {y}".format(x = l, y = line))
Line 192:     That was and is the question of these wars.
Line 1658:     went to cuffs in the question.
Line 2048:   Ham. To be, or not to be- that is the question:
Line 3354:     Will not debate the question of this straw.

Check-in: Other considerations#

These exercises really only scratch the surface of searching a file. Here are some other issues for consideration and discussion.

How might you address:

  1. Issues of case: e.g., what if question is spelled "Question", not "question"?

  2. Situations where a target_str spans multiple lines?

  3. Mismatch in punctuation, e.g., a misplaced ,?

  4. A partial match, e.g., if \(90\%\) of the characters match?

Note: These are challenging issues! And each of them likely has multiple solutions.

### Discussion

Counting Words#

Another very common use case is simply counting words.

  • How many words are there overall?

  • How many unique words are used?

  • How many times does each word occur?

  • What is the most frequent word?

Caveat: what is a word?#

The question of what defines a word is surprisingly complex.

  • First, languages have very different morphological systems. So even conceptually, it’s not always clear what makes a word “a word” in a given language.

  • Second, languages have very different writing systems.

    • Some languages (like English, Spanish, etc.) have spaces between words in their written form.

    • Other languages (like Classical Latin, Chinese, etc.) do not typically use spaces between words in their written form.

Many conceptual definitions and tools for identifying words are rooted in English specifically, but those definitions and tools don’t always generalize––languages can be very different.

NLP is (historically) English-centric#

Historically, work in Natural Language Processing has been quite English-centric.

  • Often, English was seen as the “default language”––to such a degree that researchers didn’t always mention that they were working on English specifically!

  • However, this is starting to change.

This will be important to keep in mind as we discuss identifying and counting words in written English text.

How many words?#

The first question that might occur to us is how many words are in a book.

To do this, we could:

  • read the book in as one long str.

  • Use the split function to separate this long str by spaces, into a list of words.

  • Count the number of items in this list.

Using split: a review#

sentence = "To be or not to be, that is the question"
sentence.split(" ")
['To', 'be', 'or', 'not', 'to', 'be,', 'that', 'is', 'the', 'question']

Using split for Hamlet#

# First, read in as string
with open("data/hamlet.txt", "r") as f:
    book_str = f.read()
# We should also clean up all those *newline* characters.
book_str = book_str.replace("\n", " ")
# To make it easier for later, we can also turn it into lowercase
book_str = book_str.lower()
# Now, use split to separate into words
book_words = book_str.split()
book_words[0:5]
['the', 'tragedy', 'of', 'hamlet,', 'prince']
# How many items in list?
len(book_words)
32724

How many unique words?#

Above, we calculated how many word tokens were in the book.

  • This means that the word “the” will be counted every time it occurs.

  • Instead, let’s calculate the number of unique word types.

Using set#

The set function will turn a list into a set object, which contains only the unique elements in that list.

my_list = ["the", "dog", "is", "the", "best"]
set(my_list)
{'best', 'dog', 'is', 'the'}

Check-in#

Use the set function to calculate how many unique words are in this book.

### Your code here

Solution#

We already have the book_words list, so we can just convert this to a set.

words_set = set(book_words)
len(words_set)
7251

How many times does each word occur?#

We might also want to know how many times each word occurs.

  • For example, perhaps “the” occurs \(>1000\) times, whereas “question” occurs only ~\(10\) times.

  • Ideally, we would store this in a dict:

    • Each key represents a word.

    • Each value represents how many times that word occurred in Hamlet.

How might we go about this?

First pass: counting each word#

As a first pass, let’s use the following approach:

  • First, create a dict to store our words.

  • Then, iterate through our list of words.

  • if a given word is not in our dict, add an entry for it (and set the value to 1).

  • if a given word is in a dict, increase its value by 1.

word_counts = {}
for w in book_words:
    if w not in word_counts:
        word_counts[w] = 1
    else:
        word_counts[w] += 1
# How many times does "the" occur?
word_counts['the']
1095
# How many times does "king" occur?
word_counts['king']
43

Check-in#

Any issues with this first pass approach?

Hint: One issue could have to do with punctuation…

### Your code here

Solution#

One problem that you might’ve noticed is that words occurring at the end of a sentence don’t have a space between the word and a period (e.g., question.).

  • This will under-count certain words.

To resolve this, we can replace all periods with an empty character before adding a word to our dict.

word_counts = {}
for w in book_words:
    w_no_period = w.replace(".", "")
    if w_no_period not in word_counts:
        word_counts[w_no_period] = 1
    else:
        word_counts[w_no_period] += 1
# How many times does "king" occur?
word_counts['king']
162

Which word is most common?#

Now that we have a dict representing how many times each word occurs, we can calculate which word is most common.

Check-in: Which word do you think is most frequent in Hamlet?

Finding the most frequent word#

As always, there are multiple ways to do this.

But one simple approach is to:

  • Use a for loop to iterate through all items() in the dict.

  • Track the key_with_highest_value we’ve seen so far.

  • Once the for loop is done, inspect key_with_highest_value.

key_with_highest_value = None
max_count = 0
for word, count in word_counts.items():
    # If this word frequency > max_count
    if count > max_count:
        # Set new "highest word" to this word
        key_with_highest_value = word
        max_count = count
## Now, inspect which word was most frequent
key_with_highest_value
'the'

Other approaches#

There are many different approaches you could take to solving this problem. Some are more generalizable (but also more complicated) than what I’ve shown here.

  • You can sort the dictionary by value (see the lecture on advanced dictionary operations).

  • You could use the max function with dict.get as your key parameter (see below).

# Another approach
max(word_counts, key = word_counts.get)
'the'

Conclusion#

This concludes our brief introduction to working with text files.

  • The material in this lecture is a pre-cursor to basic Natural Language Processing techniques.

  • The material in the previous lecture covers the basics of interacting with files (opening a file, using read and write, etc.).