Working with Text Files (pt. 2)
Contents
Working with Text Files (pt. 2)#
Goals of this lecture#
In the previous lecture, we discussed the basics of open
ing a .txt
file, as well as reading and writing to that file.
In this lecture, we’ll talk about what you can do with a text file once you’ve already opened it. Because text files are read
in as strings, you’ll see many echoes of our previous lecture on working with strings.
Finding a target
str
in a file.Counting the number of words in a file.
Counting how many times each word occurs.
Finding the most frequent word in a text.
Finding a target str
#
One common use case is searching a large volume of text to return
particular sub-string.
Where in the text does this sub-string occur?
What is the text surrounding one of its occurrences?
Note that this is not too far afield from a search engine like Google!
Our sample text#
To start, we’ll use a .txt
file of Hamlet, by William Shakespeare. The .txt
file was retrieved from the Project Gutenberg Corpus online, and should be credited as such.
The file is included in the lectures
GitHub repository under the data
directory.
First, let’s use readlines()
to extract each line of the play as a separate item in a list.
with open("data/hamlet.txt") as f:
book = f.readlines()
Inspecting the text#
## This is just the title
book[0]
'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK\n'
## Partial list of characters in play
for line in book[5:12]:
l = line.replace("\n", "")
print(l)
Dramatis Personae
Claudius, King of Denmark.
Marcellus, Officer.
Hamlet, son to the former, and nephew to the present king.
Polonius, Lord Chamberlain.
Check-in#
How could we check how many lines are in the .txt
file?
### Your code here
Solution#
### Number of entries in list
len(book)
4934
Finding a sample str
#
One of the most famous lines in Hamlet reads:
To be, or not to be- that is the question…
Suppose we wanted to find the str
"that is the question"
in the book, and return the line number (at least in this .txt
file).
How could we go about that?
Solution: enumerate
#
Use
enumerate
to iterate through each line of the play.For each line, check if some
target_str
occurs in that line.If it does, use
break
to stop iterating, and record which line it is.
target_str = "that is the question"
for index, line in enumerate(book):
if target_str in line:
break
print("Line: {x}".format(x = line.replace("\n", "")))
print("Line number: {x}".format(x = index))
Line: Ham. To be, or not to be- that is the question:
Line number: 2048
Check-in: Finding the next \(N\) lines#
What if we wanted to return the next \(N\) (e.g., 5
) lines after this target string?
To do this, we just need to add another variable:
keep_lines
, which tells us how many additional lines we want to return.Then, once we’ve retrieved the
index
of ourtarget_str
, we can slice between thatindex
andindex + 3
.
Try implementing this algorithm yourself first.
Hint: The code can be mostly the same as before (i.e., use enumerate
, etc.).
### Your code here
Solution#
target_str = "that is the question"
keep_lines = 5 ### New variable to track
for index, line in enumerate(book):
if target_str in line:
break
# Retrieve all lines between target and 3 lines later
targets = book[index: index+ keep_lines]
## Now, print out each of those lines
for i in targets:
print(i.replace("\n",""))
Ham. To be, or not to be- that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles,
And by opposing end them. To die- to sleep-
Check-in: What if target_str
occurs multiple times?#
What if we were looking for a more common target_str
, e.g., one that occurred multiple times?
What problems do you see with our previous approach (e.g., using
break
once we findtarget_str
)?How might you solve this problem?
### Your answer here
Solution#
Problem: If we break
as soon as we find the target_str
, we’ll always only ever find the first instance.
Solution: Instead, we can track the indices of each occurrence of target_str
. Then, we return
a list
of these indices when we finish checking the entire book.
target_str = "the question"
## Track a list of indices
line_indices = []
for index, line in enumerate(book):
if target_str in line:
## Rather than breaking, we
## add index to list_indices
line_indices.append(index)
Checking line_indices
#
print(line_indices)
[192, 1658, 2048, 3354]
for l in line_indices:
line = book[l].replace("\n", "")
print("Line {x}: {y}".format(x = l, y = line))
Line 192: That was and is the question of these wars.
Line 1658: went to cuffs in the question.
Line 2048: Ham. To be, or not to be- that is the question:
Line 3354: Will not debate the question of this straw.
Check-in: Other considerations#
These exercises really only scratch the surface of searching a file. Here are some other issues for consideration and discussion.
How might you address:
Issues of case: e.g., what if question is spelled
"Question"
, not"question"
?Situations where a
target_str
spans multiple lines?Mismatch in punctuation, e.g., a misplaced
,
?A partial match, e.g., if \(90\%\) of the characters match?
Note: These are challenging issues! And each of them likely has multiple solutions.
### Discussion
Counting Words#
Another very common use case is simply counting words.
How many words are there overall?
How many unique words are used?
How many times does each word occur?
What is the most frequent word?
Caveat: what is a word?#
The question of what defines a word is surprisingly complex.
First, languages have very different morphological systems. So even conceptually, it’s not always clear what makes a word “a word” in a given language.
Second, languages have very different writing systems.
Some languages (like English, Spanish, etc.) have spaces between words in their written form.
Other languages (like Classical Latin, Chinese, etc.) do not typically use spaces between words in their written form.
Many conceptual definitions and tools for identifying words are rooted in English specifically, but those definitions and tools don’t always generalize––languages can be very different.
NLP is (historically) English-centric#
Historically, work in Natural Language Processing has been quite English-centric.
Often, English was seen as the “default language”––to such a degree that researchers didn’t always mention that they were working on English specifically!
However, this is starting to change.
Researchers like Emily Bender have pushed scholars to name the language they’re working on.
Word segmentation is increasingly recognized as an important problem.
This will be important to keep in mind as we discuss identifying and counting words in written English text.
How many words?#
The first question that might occur to us is how many words are in a book.
To do this, we could:
read
the book in as one longstr
.Use the
split
function to separate this longstr
by spaces, into alist
of words.Count the number of items in this list.
Using split
: a review#
sentence = "To be or not to be, that is the question"
sentence.split(" ")
['To', 'be', 'or', 'not', 'to', 'be,', 'that', 'is', 'the', 'question']
Using split
for Hamlet#
# First, read in as string
with open("data/hamlet.txt", "r") as f:
book_str = f.read()
# We should also clean up all those *newline* characters.
book_str = book_str.replace("\n", " ")
# To make it easier for later, we can also turn it into lowercase
book_str = book_str.lower()
# Now, use split to separate into words
book_words = book_str.split()
book_words[0:5]
['the', 'tragedy', 'of', 'hamlet,', 'prince']
# How many items in list?
len(book_words)
32724
How many unique words?#
Above, we calculated how many word tokens were in the book.
This means that the word “the” will be counted every time it occurs.
Instead, let’s calculate the number of unique word types.
Using set
#
The set
function will turn a list
into a set
object, which contains only the unique elements in that list.
my_list = ["the", "dog", "is", "the", "best"]
set(my_list)
{'best', 'dog', 'is', 'the'}
Check-in#
Use the set
function to calculate how many unique words are in this book.
### Your code here
Solution#
We already have the book_words
list, so we can just convert this to a set
.
words_set = set(book_words)
len(words_set)
7251
How many times does each word occur?#
We might also want to know how many times each word occurs.
For example, perhaps “the” occurs \(>1000\) times, whereas “question” occurs only ~\(10\) times.
Ideally, we would store this in a
dict
:Each key represents a word.
Each value represents how many times that word occurred in Hamlet.
How might we go about this?
First pass: counting each word#
As a first pass, let’s use the following approach:
First, create a
dict
to store our words.Then, iterate through our
list
of words.if
a given word is not in ourdict
, add an entry for it (and set the value to1
).if
a given word is in adict
, increase its value by1
.
word_counts = {}
for w in book_words:
if w not in word_counts:
word_counts[w] = 1
else:
word_counts[w] += 1
# How many times does "the" occur?
word_counts['the']
1095
# How many times does "king" occur?
word_counts['king']
43
Check-in#
Any issues with this first pass approach?
Hint: One issue could have to do with punctuation…
### Your code here
Solution#
One problem that you might’ve noticed is that words occurring at the end of a sentence don’t have a space between the word and a period (e.g., question.
).
This will under-count certain words.
To resolve this, we can replace
all periods with an empty character before adding a word to our dict
.
word_counts = {}
for w in book_words:
w_no_period = w.replace(".", "")
if w_no_period not in word_counts:
word_counts[w_no_period] = 1
else:
word_counts[w_no_period] += 1
# How many times does "king" occur?
word_counts['king']
162
Which word is most common?#
Now that we have a dict
representing how many times each word occurs, we can calculate which word is most common.
Check-in: Which word do you think is most frequent in Hamlet?
Finding the most frequent word#
As always, there are multiple ways to do this.
But one simple approach is to:
Use a
for
loop to iterate through allitems()
in thedict
.Track the
key_with_highest_value
we’ve seen so far.Once the
for
loop is done, inspectkey_with_highest_value
.
key_with_highest_value = None
max_count = 0
for word, count in word_counts.items():
# If this word frequency > max_count
if count > max_count:
# Set new "highest word" to this word
key_with_highest_value = word
max_count = count
## Now, inspect which word was most frequent
key_with_highest_value
'the'
Other approaches#
There are many different approaches you could take to solving this problem. Some are more generalizable (but also more complicated) than what I’ve shown here.
You can
sort
the dictionary by value (see the lecture on advanced dictionary operations).You could use the
max
function withdict.get
as yourkey
parameter (see below).
# Another approach
max(word_counts, key = word_counts.get)
'the'
Conclusion#
This concludes our brief introduction to working with text files.
The material in this lecture is a pre-cursor to basic Natural Language Processing techniques.
The material in the previous lecture covers the basics of interacting with files (
open
ing a file, usingread
andwrite
, etc.).