Programming Assignment 5 & 6

Spellcheck

Due:
Part I: 11:59pm, Wednesday, April 8
Part II: 11:59pm, Wednesday, April 15

Overview

In a two-part assignment, we will design and implement a fully-functional spell checking program. The first part of the assignment will be designing a class to help manage a collection of all "official" English words (we will provide a data file). The second part will be to design a program that leads a user through a dialogue for spell checking a document of his or her choice. The software will highlight apparent misspelled words, give the user a range of replacement options, and then write the corrected version of the document back out to a file.

Key Techniques

File input and output.
Classes

Collaboration Policy

For this assignment, you are allowed to work with one other student if you wish (in fact, we suggest that you do so). If any student wishes to have a partner but has not been able to locate one, please let the instructor know so that we can match up partners.

Please make sure you adhere to the policies on academic integrity in this regard.

Outward Behavior

By the time all the pieces are in place, the completed program should work as follows. It begins by prompting the user for the name of a file which provides the complete list of words in the language, a file which is the document to be spell-checked, and the a filename to which the the corrected document should be written.

At this point, the program should begin spell-checking the document line-by-line, word-by-word. Furthermore, for each word that is not in the language, the program should alert the user and ask how to deal with the word. The user should be given the option to ignore the warning, to enter a replacement, or to be able to select a replacement from one or more suggested options (more about this later).

After the entire session, the corrected text should be saved to disk as a replacement for the original document (using the same filename).

Example

To give a more concrete view of the goal, we provide the following sample of a potential spell-checking session. Admittedly this example is a very polished one which demonstrates lots of intricate features of our working software. This format can serve as a goal, though you do not have to precisely match it. However you will have to implement the general functionality shown here.

For this example we begin with an original document, demo.txt, with the following content,

This is a tesk of the "best" spell-checking prograg
in missouri --- if not in the entire USA.
(Howevr, it is only a small tess.)

Still, wild success wil require a hugh, all-out effort.
How'd it go?

A run of our spell-checking program then looks like:

Enter the name of the language file: English.txt
Enter the name of the document to spellcheck: demo.txt
Enter the name for saving the corrected version: demoNew.txt

The word: tesk on line 1 is not in the language.

This is a tesk of the "best" spell-checking prograg
          ====

a) Accept
r) Replace
1) teschermacherite
2) teskere

Option: r
Enter your replacement: test

The word: spell-checking on line 1 is not in the language.

This is a test of the "best" spell-checking prograg
                             ==============

a) Accept
r) Replace
1) spell-caught
2) spell-free

Option: a

The word: prograg on line 1 is not in the language.

This is a test of the "best" spell-checking prograg
                                            =======

a) Accept
r) Replace
1) prograde
2) program

Option: 2

The word: missouri on line 2 is not in the language.

in missouri --- if not in the entire USA.
   ========

a) Accept
r) Replace
1) missounds
2) Missouri

Option: 2

The word: Howevr on line 3 is not in the language.

(Howevr, it is only a small tess.)
 ======

a) Accept
r) Replace
1) However
2) Howf

Option: 1

The word: tess on line 3 is not in the language.

(However, it is only a small tess.)
                             ====

a) Accept
r) Replace
1) teslas
2) Tess

Option: r
Enter your replacement: test

The word: wil on line 5 is not in the language.

Still, wild success wil require a hugh, all-out effort.
                    ===

a) Accept
r) Replace
1) Wikstroemia
2) Wilberforce

Option: r
Enter your replacement: will

The word: hugh on line 5 is not in the language.

Still, wild success will require a hugh, all-out effort.
                                   ====

a) Accept
r) Replace
1) huggle
2) Hugh

Option: r
Enter your replacement: huge

The word: How'd on line 6 is not in the language.

How'd it go?
=====

a) Accept
r) Replace
1) How
2) How's

Option: a
Done spellchecking. File Saved

Upon completion of the program the file demoNew.txt, should have the following contents.

This is a test of the "best" spell-checking program
in Missouri --- if not in the entire USA.
(However, it is only a small test.)

Still, wild success will require a huge, all-out effort.
How'd it go?

Part I: A Language Helper

The overall goal is a very complex task with many required features. There is an ongoing dialogue with the user as well as many intricate issues involving the management of the terms in the language in comparison to words in the document. This makes it easy to get lost when trying to program it. You will do much better if you organize your efforts into clearly defined subtasks. For this reason, we are requiring that you do the following.

Many of the subtasks are related to comparing words presumably from the user's document to the larger set of words that are considered part of the language. We will be providing a file, English.txt, which represents the "official" set of words which are considered part of the language. This file has one "word" per line and is already alphabetized by standard dictionary order.

However the language file contains both capitalized and uncapitalized words. A word that is capitalized in the language file is only legitimate if capitalized in the document (i.e., 'Missouri' is okay but 'missouri' is not). A word that is uncapitalized in the language file can be used in the document in either capitalized or uncapitalized form (i.e., 'This' and 'this' are both legitimate although 'this' is the only one literally in the language file).

You should encapsulate these issues by developing a separate class, LanguageHelper. This class should minimally support the following three behaviors.

__init__(self, languagefile)
The initialization should take care of reading the raw file of words and entering those words into an internal list that will be used by the other behaviors.
__contains__(self, word)
The parameter is a verbatim word (presumably from a user's document). This method should determine whether or not that word is considered a legitimate part of the language, returning True if the word is contained in the language and False otherwise.

This special method is used by Python to support the in syntax, such as 'Missouri' in language, presuming that language is an instance of our LanguageHelper class.

Note as well that the implementation of this method should rely upon the aforementioned distinction between capitalized and uncapitalized words. So with our given English.txt wordlist, it should be that this, This and Missouri are contained in the language, yet missouri and Missourri are not contained.
getSuggestions(self, word)
Given a word presumably typed by the user, but which was not spelled correctly, this method should return a list of suggested words for replacement. Doing a good job at offering suggestions is actually the toughest part of writing a good spell-checker. For this assignment, we are going to suggest the following simple rule (which admittedly is not usually very helpful). We will leave it as extra credit to design a more intelligent rule.

The language file is written so that words are alphabetized in typical dictionary order. When you are processing a misspelled word, use a loop to determine where the misspelled word would have been placed if it had been a legitimate word. Generally, this lets you determine two real words of the dictionary which bracket the misspelled words. You should offer those two words as suggestions (in the special case where the misspelled words would be at the very beginning or end of the language, you may offer just a single suggestions).

Furthermore, if the misspelled word begins with a capital letter (e.g., Howevr), the reported suggestions should be presented as capitalized, even if the nearest underlying words from the language are uncapitalized. In spirit, if the user typed a capital letter when misspelling a word, they would be likely to want the correctly spelled word to be capitalized as well.

Testing Your Code

As a matter of principle, you are required to test your LanguageHelper class. Write this class in its own file and then include code which tests your class at the bottom of the source code.

Make sure that your test includes many interesting cases, demonstrating both the check of containment for words that are included and are not included, and a variety of calls to getSuggestions (see examples from the early sample of a complete spell-checker).

Submitting Part I

You most officially submit your solution to Part I of the assignment by the first due date. Please submit your sourcecode, LanguageHelper.py, which must include the LanguageHelper class as well as code which tests various words and demonstrates that your class works correctly. If you worked as a pair, please make this clear in the email and briefly describe the contributions of each person in the effort.

Part II: User Dialogue

The final goal is to create a complete, working spell-checker which provides a user dialogue as shown in the earlier example, for spell-checking a given document. This program should make use of the existing support of the LanguageHelper class.

In addition to instantiating the helper based on the file of English words, the complete program should open the user's document and then proceed to analyze it on a line-by-line, word-by-word basis. One of the first challenges will be in determining what constitutes a word. Though we typically use split() as a rough guide for breaking a line into words, that does not really work for typical English prose. Many of the resulting pieces would involve leading or trailing punctuation which would throw-off our spell-checking when compared to the legitimate words of the language. More so, even if we were able to strip away the punctuation, we would want to make sure that we keep it there when replacing a misspelled word.

Rather than relying on split, we want you to determine the word-by-word breakdown as you go using the following rules. Assuming some current index for starting the search for a word,

Find the index of the next alphabetic character (i.e., satisfies isalpha()). Consider this index the start of the next word.
From there, step forward until finding the next whitespace character (i.e., satisfies isspace()) or until reaching the end of the line.
Now step backward until encountering some alphabetic character. Consider this index the end of the word.

By using these rules, unnecessary punctuation will be stripped away from the left and right ends of the words, while meaningful punctuation punctuation within the interior of the word remains, such as hyphens and apostrophes.

Each word should be checked against the true language, and if it is not included there, the user should be prompted for directions. Any changes specified by the user should be carefully tracked so that the corrected version of each line can be written to a new file.

Submitting Part II

You most officially submit your solution to Part II of the assignment as by the first due date. Please submit your sourcecode, Spell.py and for continuity another copy of LanguageHelper.py. The helper may be the identical code you had submitted for Part I. Alternatively, it may be that you found problems with your original code while doing the second part. In that case, you may resubmit your revised version here, although this will not change your grade for the first submission.

You should submit an additional readme file at this stage detailing your continued efforts.

Please see details regarding the submission process from the general programming web page, as well as a discussion of the late policy.

Hints

Watch out for effect of newlines, when read from the dictionary or document.
Make sure to close your files when you're done with them.
Many of the methods of the Python string class which we have not previously emphasized will be quite useful for this assignment. Most notable are: isalpha(), isdigit(), islower(), isspace(), isupper(), rstrip(). Type help(str) in a Python interpreter for more details.

Files You Will Need

We will provide a file, English.txt which lists over 364,000 correctly spelled English words. The words may involve a combination of uppercase and lowercase letters, as discussed above. % The file has one word per line and words are alphabetized as they might appear in a standard English dictionary.

To ease your program development, you can make it appear as if this file is in your own directory by typing the following command.

ln -sf /export/mathcs/home/faculty/echambe5/Public/English.txt .

This doesn't really copy the file (given that it is very big, it seems unnecessary for everyone to have their own copy). But it creates what is called a symbolic link to this file within your directory.

Grading Standards

The two parts of the assignment will be graded separately. I expect commenting and docstrings to be provided as part of your code. Please include the names of both partners in a comment at the top of any files you submit.

Extra Credit

The rule that we used for producing suggested corrections for a misspelled words was not actually a very good approach. Rarely was the intended word right next to the mistake when alphabetized because the error may have occurred on an early letter.

A much better rule is to try to find words that are really in the language that are "nearby" the mistaken word, measured by the number of changes that would have to be made to get from one word to the other. This is typically called the edit distance between two words.

More concretely, consider the following types of edits which might get from a misspelled word back to a legitimate word.

Change a single letter to some other letter
(e.g. converting flexable to flexible)
Delete one character from the word.
(e.g. converting unneccessary to unnecessary)
Add one letter to the word.
(e.g., converting writen to written)
Take two neighboring characters and invert them
(e.g., converting wierd to weird)

As extra credit, add a new method, getGoodSuggestions to the LanguageHelper class which returns a list of all legitimate language words which are precisely one edit away from the mistaken word (we want you to still implement the original getSuggestions method for the required assignment, so that a botched extra credit attempt does not jeopardize your main grade).

The challenge will be in implementing this efficiently enough so that you can produce these suggestions without any significant delay for the user. There are several possible approaches. One is to write a method to check whether any two words are within one edit of each other. Then you could iterate this test between the mistaken word and each of the 364,000 words in the language file.

Another approach is to instead take the mistaken word, and generate all possible strings which are one edit away from it, and then for each of those strings see if it turns out to be contained as a legitimate word of the language. Notice that for a typical 7-letter word, there are only 7 ways to delete a character, 6 ways to invert two neighboring characters, and 8*26 ways to insert a new character, because there are 8 possible slots, and 26 possible letters to put in each slot. Okay, depending on how capitalization is managed, perhaps 52 possible characters, or more if allowing hyphen or apostrophe. But still, this seems like a far small number of things to check than trying to compare the mistaken word to each of the 364,000 other language words. If using this approach, make sure to get rid of any apparent duplicate suggestions (as there may have been two different edits which result in the same word).

Even though this extra credit challenge involves the LanguageHelper class, you may feel free to attempt it by either deadline. Just make sure to point out in a readme file that you have done this and where it was submitted. Also, show use of the new method in unit tests.

Michael Goldwasser

Last modified: Tuesday, 28 November 2006