Matt Eding Python & Data Science Blog:
    About     Archive     Feed

Dde Summarizer

Discrete Differential Evolution Text Summarizer

Part 1

I took it upon myself to implement a new text summarizer in Python. While looking into NLP tasks I stumbled upon evolutionary algorithms and thought it would be awesome to program one of them from scratch.

For those of you who are unfamiliar with evolutionary algorithms, they try to emulate natural selection in the real world. They start with a random population from which you create new offspring by mixing data from members of the parent population. Then using a fitness function, you select the best individuals from both the parents and offspring to propagate to the next generation. Furthermore, a mutation step can occur which alters the data individuals carry to keep the population from becoming stagnant (stuck in local minima). This process is repeated until an optimal solution has converged.

Some nomenclature used in this writing:

  • discrete differential evolution (dde): The algorithm component to make new offsring.
  • document (doc): A collection of tokenized text. We will use a collection of sentences.
  • population (pop): The group of individuals for each generation.
  • chromosome (chrom): An individual in the population. In this case sentences.
  • gene: The data of a chromosome. This will represent words used in the entire document.

So how is this applied to text summarization? Well, first you have to prepare the text into numerical data to work with. This is achieved by splitting the text into sentences, and then you make a set of distinct words that appear in each. In my implementation we use Scikit-Learn’s CountVectorizer and cast it into a bool dtype to reflect that we only care it a word appears rather than the number of times it shows.

count_vec = CountVectorizer(stop_words='english')
document = (count_vec.fit_transform(sentence_tokens)
                     .toarray()
                     .astype(bool))

Using this document, we can compare the similarity of any two sentences with Jaccard similarity. This is defined as the intersection divided by the union of two sets. We leverage numpy bitwise operators for these set operations.

def jaccard_similarity(a, b):
  intersection = np.sum(a & b)
  union = np.sum(a | b)
  return intersection / union

As an example to illustrate its use with unique words from sentences:
Sentence A: Python is a dynamic language.
Sentence B: C++ is a compiled language.
Intersection: {is, a, language}
Union: {python, is, a, dynamic, language, c++, compiled}
Jaccard: 3 / 7 = 0.486

Our goal is to partition the sentences in the document into clusters, from which we will pick representative sentences from each to use in the summary. To achive this, we will introduce cohesion (closeness within a cluster) and separation (how far apart different clusters are). From the papers I researched we can define them as follows:

# to maximize
def cohesion(chrom, doc):
    total = 0
    for p in np.unique(chrom):
        sents = doc[chrom == p]
        for sent_i, sent_j in itertools.combinations(sents, r=2):
            total += jaccard_similarity(sent_i, sent_j) / len(sents)
    return total

# to minimize
def separtion(chrom, doc):
    total = 0
    for p, q in itertools.combinations(np.unique(chrom), r=2):
        sents_p = doc[chrom == p]
        sents_q = doc[chrom == q]
        for sent_i, sent_j in itertools.product(sents_p, sents_q):
            total += jaccard_similarity(sent_i, sent_j) / (len(sents_p) * len(sents_q))
    return total

We aim to have a balance between the above functions. To do so we introduce the sigmoid function.

def sigmoid(x):
    return 1 / (1 + np.exp(x))

# to maximize
def cohesion_separation(chrom, doc):
    coh = cohesion(chrom, doc, sim)
    sep = separation(chrom, doc, sim)
    return (1 + sigmoid(coh)) ** sep

Now that we have the prerequisite tools, we will delve into the evolutionary algorithm in Part 2, so stay tuned for more!

For the full implementation feel free to look at my EvolveSumm repository on GitHub.

School Performance Household Income

School Performance & Household Income

Coming from a background in education I decided to investigate whether school performance could be predicted by the income of the community. I know that even within a single district, funding can vary substantially as well as student achievement. This achievement gap is a concern that I decided to try tackling with data science regression techniques. I figured that if my findings indicate wealthier communities perform better on California state testing, then the CA Department of Education could make better informed decisions about allocation of funding.

My approach to this problem was to use 2018 data from Smarter Balanced Assessment Consortium test data obtained from CA Assessment of Student Performance and Progress. I then paired zip codes for each school to median household income using United States Zip Codes.

Once all the data was collected, I decided upon using Root Mean Square Error as my metric of model performance. After feature engineering my data, I trained and validated my data using Linear, Ridge, LASSO, and Polynomial Regressions. Comparing the model results, I selected Linear Regression even though it performed worse than Ridge (only marginally) since it is a simpler model.

Despite my best attempts to eek out the best results while still keeping Durbin-Watson, Kurtosis, Skew, and Condition Number in check, my model’s prediction of income are roughly 30% off from the true median income. So while there is some predictive power, I concluded that Department of Education should probably not use this as a metric of how to fund schools as it might in some cases actually compound achievement gap issues by under funding some schools that perform poorly and vice-versa.

For more details of my implementation, please visit my repository.