Creating a Text Complexity Algorithm

Jesse Berkowitz
5 min readSep 25, 2020

I am taking part in the labs portion of Lambda School, and our team is working with Story Squad, an organization that helps children improve their creative writing abilities. The project which we are working on allows children to expand upon a story , by submitting their own extension directly to an app.

From a data science perspective, it is our job to take in either a PDF or URL, process it with an image recognition model, and evaluate the text using text complexity algorithms. From there, we will match students up with similar scores into teams, where they will take part competing against each other in a multiplayer game.

I worked with my team to implement the Google Vision API which takes in either a PDF or a URL from the backend. If a PDF is passed in, it is analyzed through a PDF reading function, turns it into local image files, and each image file is read individually and compiled into a dictionary object, which then produces a string of text. If a URL is passed in , the URL is sent directly to the Google Vision API and a string is returned directly.

There are many different ways to analyze text. The rest of this article will be about the methods I took in order to analyze and score strings of text.

In order to analyze spellchecked words, I imported a Python spellcheck package called AutoCorrect. The spell method is a built in function of AutoCorrect.

def spellcheck(input_str: str) -> str:    textcorrected = spell(input_str)    return textcorrected

I created a custom made tokenization function using spaCy. It was important to allow periods to be scanned, in order for some of the functions to work, so I had to manually add this in.

def tokenize(input_str: str) -> str:    tokens = re.sub(‘[^a-zA-Z 0–9 \.]’, ‘’, input_str)    tokens = tokens.lower().split()    STOP_WORDS = nlp.Defaults.stop_words    arr = []    for token in tokens:        if token not in STOP_WORDS:        arr.append(token)    return arr

Once all the tokens or individual words of the text have been returned, I can run both functions into a third function called spellchecked words.

This function first tokenizes the input string, runs spellcheck on the string, and finds how many words in the original string did not appear in the spellchecked input string, giving us the total amount of spell checked words.

def spellchecked_words(input_str: str) -> int:    arr = []    words1 = tokenize(input_str)    words2 = tokenize(spellcheck(input_str))    for word in words1:        if word not in words2:        arr.append(word)    return len(arr)

The efficiency function ties all of these previous functions together. It takes the total original words, subtracts spellchecked words, and divides that number by the original number of words, to find percentage of words that were not spellchecked.

def efficiency(input_str: str) -> int:    original = len(tokenize(input_str))    difference = original — spellchecked_words(input_str)    percentage = difference / original    return percentage

The descriptiveness function first spellchecks the child’s text, then checks for parts of speech using spaCy’s part of speech feature. It compares the number of action words such as verbs, adjectives, and adverbs, to the number of nouns and proper nouns, in order to come up with a descriptiveness score. The more action words relative to the nouns, the higher the score. For example, “a cat jumped up a tall tree” would receive a higher score than “a cat jumped up a tree” .

def descriptiveness(input_str: str) -> str:    input_str2 = spellcheck(input_str)    doc = nlp(input_str2)    x = [token.pos_ for token in doc]    count = 0    count2 = 0    for part_of_speech in x:        if part_of_speech == “PROPN” or part_of_speech == “NOUN” :            count += 1        elif part_of_speech == “VERB” or part_of_speech == “ADJ” or\                part_of_speech == “ADV” :            count2 += 1    return count2 / count

The next function finds the percentage of unique words relative to total words.

def unique_words(input_str: str) -> int:    arr = []    arr2 = set()    words = tokenize(input_str)    for word in words:        arr.append(word)        arr2.add(word)        x = len(arr2) / len(arr)    return x

It scrolls through the string two times, appending all the words one by one to an array, and then a second time to a set (which naturally excludes duplicates), calculates the different lengths in the array and set, and finds what percentage of overall words were original words…testing for repetition.

The next function used in the evaluate function measures average sentence length.

def avg_sentence_length(input_str: str) -> int:    arr = []    words = tokenize(input_str)    count = 0    for word in words:        if ‘.’ in word:        count += 1    for word in words:        arr.append(word)        x = (len(arr) / 10)    return x / count

By allowing periods to manually occur in the text through regex manipulation, I was able to count words that had periods in them, then divide that number of periods into overall words in the text, to find average sentence length. The outcome is divided by ten in order to avoid the function from having too much of an effect on the final evaluate score. The goal is to have all final scores fall between 0 and 1, in order to avoid having to standardize them later on.

Finally I checked average length of words. This manually checks the child’s ability to use reasonably good vocabulary, assuming that bigger words, in general, imply a better vocabulary. There are ways to use spaCy’s interpretation of vocabulary, but I was interested in using as many of my own functions as possible for this project.

def avg_len_words(input_str: str) -> int:    arr = []    words = tokenize(input_str)    for word in words:        x = len(word)        arr.append(x)        y = (sum(arr) / len(arr)) /10    return y

This function scrolls through all the words in the string, inserts their individual lengths into an array, then takes the sum of the array divided by the length of the array to find average word size. The outcome is also divided by 10 to avoid the function from having too much of a sway on the final evaluate function.

The evaluate function uses unique words, average length of words, average sentence length, descriptiveness, and spellchecking efficiency to calculate the overall text complexity score.

def evaluate(input_str: str) -> int:    score = (    (.2 * unique_words(input_str)) +    (.2 * avg_len_words(input_str)) +    (.2 * avg_sentence_length(input_str)) +    (.2 * efficiency(input_str)) +    (.2 * descriptiveness(input_str))    )    return score

So far, our DS team has created an image recognition model, created a text complexity model, and has connected these models to Amazon’s AWS through various endpoints. The main challenge now is to store the text complexity scores in a database, and use a statistical model or create another algorithm to match users up with other users who have similar scores. In some cases, the amount of players in the database will not be able to be evenly divided into teams, so there may be a big challenge in the match-making process.

I am very interested in Natural Language Processing, and in particular how it can be used in statistical analysis. I think creating this text complexity model from scratch, as well as learning about spaCy’s many built-in features, will be very beneficial for me in my career and career search. It will also be quite useful in improving accuracy of predictive models in a variety of fields.

--

--

Jesse Berkowitz

I am interested in Data Science and its applications to various interests of mine. I am passionate about sports, music, and general life adventures.