Back from my teaching days, I had written some rudimetry code to assist in making
tile patterns
for my algebra students to make predictions and write equations. Recently I had decided
to revisit this idea in a more robust manner and it eventually evolved into a Flask app
where you can find a working version deployed on Heroku.
Color-coding exposes the pattern’s structure.
It has definitely been a fun and interesting experience making this come to fruition.
My plan was to first make a Python script to parse a text file that contains a
template for the tile pattern. I quickly settled on using “.” for a unit tile,
“-“ or “|” for a linear tile, and “O” for a quadratic tile as I felt that these
were easy to type and correspond visually with actual algebra tiles. For example the pattern above would have a text template of:
.
||O
.
Algebra tiles are a nice aide for a wide range of math topics.The small squares, rectangles, and large squares correspond to 1, x, and x2 respectively.
Luckily I had recently researched into sparse matrices and it didn’t take that long
to realize that block matrices would be the perfect way to take the blueprint
components. While I could’ve used np.block, I didn’t want to figure out what the
dimensions of the null matrices (blank spaces) and use np.zeros(shape=?).
Instead sparse.bmat abstracts this away by letting me simply use None as a
placeholder for the null matrices, and it would automatically calculate the
appropriate dimensions. Great! Now I can have the following mappings:
"." -> np.array(scalar)
"-" -> np.array(vector)
"|" -> np.array(vector).T
"O" -> np.ones(square_matrix)
Really cool stuff, but there is still a minor caveat–
I currently have not implemented tiles to align where side lengths would not align
nicely. My plan is to eventually scan rows/columns, find if it needs to be a
unit or variable length based on the “biggest” tile in the given row/column,
and use slicing on square arrays of 0’s to set a restricted subset to 1’s.
Now it was time to make this a viable tool for other teachers to use. I had never
used argparse before and wanted to make a CLI to facilitate usage. While I had
known about "__main__" in Python for quite some time, it was interesting to find out
about __main__.py as a way to make a package into a script! Another point of
interest was the ability to have const and default values for a single flag to
assume as its value when the flag was present without a value or when the flag was
absent respectively. This allowed for nice actions for saving an image to the CWD vs
an image pop-up while also giving a user the ability to specify a target directory.
>>> python -m tileapp -h
usage: tileapp [-h] [-bw] [-cm COLORMAP] [-o [DIR]] [-p PREFIX] [-v]
infile dim [dim ...]
Tile pattern parser from txt to png
positional arguments:
infile filepath with pattern to parse
dim dimensions to form pattern
optional arguments:
-h, --help show this help message and exit
-a, --alpha transparency of the colors used in png output;
set to 0 for b/w png
-cm COLORMAP, --colormap COLORMAP
colormap used to differentiate tile parts; see
https://matplotlib.org/tutorials/colors/colormaps.html
-o [DIR], --outdir [DIR]
destination file for png output; if omitted, png is
popup; if not arg, save png to cwd
-p PREFIX, --prefix PREFIX
prefix used for png output; use alongside outdir
-v, --verbose print to stdout the array used for png creation
Still not satisfied that there would be much of an audience of math teachers who
would use the then current incarnation as is, I began developing a Flask app
to make anyone lacking programming knowledge able to reap the benefits of my work.
For this I had to make a decision regarding how to store the generated images
to display to the user. Initially I was thinking about taking advantage of
SQLite’s in in-memory capabilities since
the use case of this application does not need persitent storage. But then again
I thought that while this could work, it seemed like over engineering for a single
image generated at a time. It was time to reformulate my approach.
Most Python programmers are familiar with open(outfile, 'wb') and it was time to
put working with files into overdrive. Right off the bat, I decided a io.BytesIO would
be the perfect choice since they allow working with image files directly as an object
rather than needed to write to disk. Alas things are never as simple as intially
thought; HTML was designed to work with plain text. After some sluthing, I stumbled upon
someone’s proposed workaround using a module that I have never looked twice at–base64.
In a nutshell, Base64 encodes bytes into text
which makes it ideal to act as a medium between the binary images and my web app.
Interface for producing a tile pattern. Need a different figure? Different color?Easy, just change the parameters and you’re good to go!
Despite needing to make the app more exciting with CSS/Bootstrap, those were
secondary to getting my now fully functional app out to the public via Heroku.
In the past, Heroku had never cooperated with me, and I was unsurprised when it
invariably threw errors at me. Yet this time I was determined to get it to work
and finally got it working. My Achilles heel?–thinking that gunicorn was embedded
within Heroku and omitting it from the requirements.txt. As the saying goes: “Fool
me once, shame on you; fool me twice, shame on me.”
Now all the remains is spicing up the front-end of the app. But that will have to
wait until another day. In the meantime if you are interested in looking at the
code source visit my GitHub.
A sparse matrix is a matrix that has a value of 0 for most elements.
If the ratio of Number of Non-Zero (NNZ) elements to the size is less than 0.5, the matrix is sparse.
While this is the mathematical definition, I will be using the term sparse for matrices with only NNZ elements and dense for matrices with all elements.
Storing information about all the 0 elements is inefficient, so we will assume unspecified elements to be 0.
Using this scheme, sparse matrices can perform faster operations and use less memory than its corresponding dense matrix representation, which is especially important when working with large data sets in data science.
Today we will investigate all of the different implementations provided by the SciPy sparse package.
This implementation is modeled after np.matrix opposed to np.ndarray, thus is restricted to 2-D arrays and having quirks like A * B doing matrix multiplication instead of element-wise multiplication.
Different sparse formats have their strengths and weaknesses.
A good starting point is looking at formats that are efficient for constructing these matrices.
Typically you would start with one of these forms and then convert to another when ready to do calculations.
Coordinate Matrix
Perhaps the simplest sparse format to understand is the COOrdinate (COO) format.
This variant uses three subarrays to store the element values and their coordinate positions.
The savings on memory consumption is quite substantial as the matrix size increases.
Managing data in a sparse structure is a fixed cost unlike the case for dense matrices.
The overhead incurred from needing to manage the subarrays is becomes negligible as data grows making it a great choice for some datasets.
A word of caution: only use sparse arrays if they are sufficiently sparse enough; it would be counterproductive storing a mostly nonzero array using several subarrays to keep track of position and data.
In[11]:defmemory_usage(coo):...:# data memory and overhead memory...:coo_mem=(sum(obj.nbytesforobjin[coo.data,coo.row,coo.col])...:+sum(obj.__sizeof__()forobjin[coo,coo.data,coo.row,coo.col]))...:print(f'Sparse: {coo_mem}')...:mtrx=coo.todense()...:mtrx_mem=mtx.nbytes+mtrx.__sizeof__()...:print(f'Dense: {mtrx_mem}')In[12]:memory_usage(coo)# Sparse: 480# Dense: 448In[13]:coo.resize(100,100)In[14]:memory_usage(coo)# Sparse: 480# Dense: 80112
Dictionary of Keys Matrix
Dictionary Of Keys (DOK)
is very much like COO except that it subclasses dict to store coordinate-data information as key-value pairs.
Since it uses a hash table as storage, identifying values at any given location has constant lookup time.
Use this format if you need the functionality that come with builtin dictionaries, but be mindful that hash tables hog much more memory than arrays.
In[15]:dok=sparse.dok_matrix((10,10))In[16]:dok[(3,7)]=42# store value 42 at coordinate (3, 7)In[17]:dok[(9,5)]# zero elements are accessibleOut[17]:0.0In[18]:dok.keys()|dok.transpose().keys()# union of key viewsOut[18]:{(3,7),(7,3)}In[19]:isinstance(dok,dict)Out[19]:True
Note: be careful of potential problems using the methods inherited from dict; they don’t always behave.
In[20]:out_of_bounds=(999,999)In[21]:dok[out_of_bounds]=1# works as expectedIndexError:Indexoutofbounds.In[22]:dok.setdefault(out_of_bounds)# silently ignored...In[23]:dok.toarray()# ...until nowValueError:rowindexexceedsmatrixdimensionsIn[24]:dok.pop(out_of_bounds)# fix issue by removing bad pointIn[25]:sparse.dok_matrix.fromkeys([...,...,...])# don't get me startedTypeError:__init__()missing1requiredpositionalargument:'arg1'
Linked List Matrix
The most flexible format to insert data is through usage of LInked List (LIL) matrices.
Data can be set via indexing and slicing syntax of NumPy to quickly populate the matrix.
In my opinion, LIL is the coolest sparse format for constructing sparse matrices from scratch.
LIL stores information in lil.rows where each list represents a row index and the elements inside the list match columns.
In a parallel array, lil.data, the NNZ values are stored.
But unlike other sparse formats, these subarrays cannot be explicitly passed to the constructor; LIL matrices must be made from either an empty state or from existing matrices, dense or sparse.
Below is an illustration of various techniques used to build up a LIL matrix.
In[26]:lil=sparse.lil_matrix((6,5),dtype=int)In[27]:lil[(0,-1)]=-1# set individual pointIn[28]:lil[3,(0,4)]=[-2]*2# set two pointsIn[29]:lil.setdiag(8,k=0)# set main diagonalIn[30]:lil[:,2]=np.arange(lil.shape[0]).reshape(-1,1)+1# set entire columnIn[31]:lil.toarray()Out[31]:array([[8,0,1,0,-1],[0,8,2,0,0],[0,0,3,0,0],[-2,0,4,8,-2],[0,0,5,0,8],[0,0,6,0,0]])
So what’s the drawback…? Well, it utilizes jagged arrays under the hood which requires np.dtype(object).
This costs a lot more memory than a rectangular array, so if the data is big enough, you may be forced to work with COO instead of LIL.
In short, LIL is mostly offered as a convenience, albeit an awesome one at that.
As an aside, Linked List Matrix is a misnomer since it does not use linked lists behind the scenes!
LIL actually uses Python’s list which is a dynamic array, so it should really be called a List of Lists Matrix, in spite of what the documentation says.
(A missed opportunity to christen it as LOL…)
In[34]:sparse.lil.__doc__# module docstringOut[34]:'LInked List sparse matrix class\n'
Compressed Sparse Matrices
The formats described earlier are great for building sparse matrices, but they aren’t as computationally performant than more specialized forms.
The reverse is true for compressed sparse matrix family, which should be treated as read-only rather than write-only.
These are more difficult to understand, but with a little patience their structure can be grokked.
Compressed Sparse Row/Column
The Compressed Sparse Row/Column (CSR and CSC) formats are designed for computation in mind.
Adjacent pairs of index pointers determine two things.
First, their position in the pointer array is the row number.
Second, these values represent the [start:stop] slice of the indices array, and their difference is the NNZ elements in each row.
Using the pointers, look up the indices to determine the column for each element in the data.
CSC works exactly the same as CSR but has column based index pointers and row indices instead.
Below is a diagram of the same data in this format.
Notice for this particular case, CSC is slightly more compact with two fewer index pointers.
As promised, the compressed formats are indeed faster than their COO counterpart.
For a modest-sized matrix, we see a 2x speed gain vs COO and 60x speedup vs dense!
In[44]:csr.resize(1000,1000)In[45]:%timeitcsr@csr# 111 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)In[46]:coo=csr.tocoo()# another way to convertIn[47]:%timeitcoo@coo# 251 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)In[48]:arr=csr.toarray()In[49]:%timeitarr@arr# order of magnitude slower!# 632 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Block Sparse Row
Block Sparse Row (BSR) is like CSR but stores sub-matrices rather than scalar values at locations.
This implementation requires all the sub-matrices to have the same shape, but there are more generalized constructs with block matrices that relax this constraint.
These matrices do not have their unique data structure in SciPy, but can be indirectly made via the sparse.bmat constructor function.
In[56]:A=np.arange(8).reshape(2,4)# can use dense arraysIn[57]:T=np.tri(5,4)In[58]:L=[[8]*4]*2# can use listsIn[59]:I=sparse.identity(4)# can use sparse arraysIn[60]:Z=sparse.coo_matrix((2,3))# zeros to create column gapIn[61]:sp.bmat([[A,Z,L],...:[None,None,I],...:[T,None,None]],dtype=int)Out[61]:<11x11sparsematrixoftype'<class 'numpy.int64'>'with33storedelementsinCOOrdinateformat>In[62]:_.toarray()# ipython previous outputOut[62]:array([[0,1,2,3,0,0,0,8,8,8,8],[4,5,6,7,0,0,0,8,8,8,8],[0,0,0,0,0,0,0,1,0,0,0],[0,0,0,0,0,0,0,0,1,0,0],[0,0,0,0,0,0,0,0,0,1,0],[0,0,0,0,0,0,0,0,0,0,1],[1,0,0,0,0,0,0,0,0,0,0],[1,1,0,0,0,0,0,0,0,0,0],[1,1,1,0,0,0,0,0,0,0,0],[1,1,1,1,0,0,0,0,0,0,0],[1,1,1,1,0,0,0,0,0,0,0]])
Diagonal Matrix
Perhaps the most specialized of the formats to store sparse data is the DIAgonal (DIA) variant.
It is best suited for data that appears along the diagonals of a matrix.
The data is stored in an array of shape (offsets) x (width) where the offsets dictate the location of each row in the data array along diagonal.
Offsets are below or above the main diagonal when negative or positive respectively. Note that if a row in the data matrix is cutoff, the excess elements can assume any value (but they must have placeholders).
In[67]:dia.data.ravel()[9:12]=0# replace cutoff dataIn[68]:dia.dataOut[68]:array([[1,2,3,4,5],[6,7,8,9,0],[0,0,13,14,15]])In[69]:dia.toarray()# same array repr as earlierOut[69]:array([[1,0,13,0,0],[0,2,0,14,0],[0,0,3,0,15],[6,0,0,4,0],[0,7,0,0,5],[0,0,8,0,0],[0,0,0,9,0]])
Specialized Functions
In addition to the multitude of formats, there is a plethora of functions specialized just for sparse matrices.
Use these functions whenever possible rather than their NumPy counterparts, otherwise speed performances will be compromised. Even worse, the resulting calculations could be incorrect!
Looking into the details of these are left as an exercise to the avid reader.
Other Libraries
SciPy is not the only resource for working with sparse structures in the Python ecosystem.
While most appear to use the SciPy package internally, they have all made it their own.
I will present several libraries that I find most compelling, but this is not supposed to be the end all be all.
Pandas
Data science today wouldn’t be what it is without Pandas, so it doesn’t come as a surprise that it supports sparse variants of its data structures.
A really neat feature is that inferred elements do not have to be forms of 0!
In[70]:importpandasaspdIn[71]:ss=pd.SparseSeries.from_coo(dia.tocoo())In[72]:ss# uses a MultiIndex for the coordinatesOut[72]:0012131123142234153063441745528639dtype:Sparse[int64,0]BlockIndexBlocklocations:array([0],dtype=int32)Blocklengths:array([12],dtype=int32)In[73]:data=dict(A=[np.nan,1,2],B=[np.nan,3,np.nan])In[74]:sdf=pd.DataFrame(data).to_sparse()In[75]:type(sdf).mro()# class inheritance hierarchyOut[75]:[pandas.core.sparse.frame.SparseDataFrame,pandas.core.frame.DataFrame,pandas.core.generic.NDFrame,pandas.core.base.PandasObject,pandas.core.base.StringMixin,pandas.core.accessor.DirNamesMixin,pandas.core.base.SelectionMixin,object]In[76]:sdtype=pd.SparseDtype(object,fill_value='e')# not restricted to null valuesIn[77]:pd.SparseArray(list('abcdeeeeeeee'),dtype=sdtype)Out[77]:[a,b,c,d,e,e,e,e,e,e,e,e]Fill:eIntIndexIndices:array([0,1,2,3],dtype=int32)
Scikit-Learn
The machine learning powerhouse, Scikit-Learn, supports sparse matrices in many areas.
This is important since big data thrives on sparse matrices (assuming enough sparsity).
After all, who wouldn’t want to have performance gains from these number-crunching algorithms?
It hurts having to wait on CPU intensive SVMs, not to mention discovering some data won’t fit into working memory!
Scikit-Learn’s term-document matrices produced by text vectorizers result in CSR matrices.
This is crucial for NLP since most words are used sparingly if at all.
Naively using a dense format might otherwise cause speed bottlenecks and lots of wasted memory.
In[78]:fromsklearn.feature_extraction.textimportCountVectorizerIn[79]:bow=CountVectorizer().fit_transform(['demo'])In[80]:sparse.isspmatrix(bow)Out[80]:TrueIn[81]:sparse.save_npz('bag_of_words.npz',bow)# store for future use
Other areas where Scikit-Learn has the ability to output sparse matrices include:
sklearn.preprocessing.OneHotEncoder
sklearn.preprocessing.LabelBinarizer
sklearn.feature_extraction.DictVectorizer
Moreover there are utilities that play well with sparse matrices such as scalers, a handful of decompositions, some pairwise distances, train-test-split, and many estimators can predict and/or fit sparse matrices.
In short embrace their usage whenever possible to make your machine learning models more efficient.
PyData Sparse
As another implementation, PyData’s sparse library provides an interface like np.ndarray instead of np.matrix, permitting creation of multidimensional sparse arrays.
The caveat is that as of the writing of this article, only COO and DOK formats are supported.
In[82]:importsparseassp# avoid name clobbering with scipy.sparseIn[83]:sarr=sp.random((3,4,2),density=0.2)# 3-D sparse arrayIn[84]:sarrOut[84]:<COO:shape=(3,4,2),dtype=float64,nnz=4,fill_value=0.0>In[85]:sarr+=1# not possible in scipy.sparseIn[86]:sarrOut[86]:<COO:shape=(3,4,2),dtype=float64,nnz=4,fill_value=1.0># fill_value updates!In[87]:sarr.todense()Out[87]:array([[[1.,1.],[1.,1.],[1.,1.],[1.,1.]],[[1.,1.],[1.86024163,1.],[1.37233162,1.1114997],[1.,1.]],[[1.,1.],[1.,1.16850612],[1.,1.],[1.,1.]]])
In SciPy, logical operators are not directly implemented, but AND (&) and OR (|) can be emulated by constraining the dtype to bool:
Unfortunately NOT (~) is impossible since it would make a sparse matrix into a dense one (theoretically self - 1).
Until now, that is.
As seen earlier, sparse will dynamically update the fill value to accommodate current states.
Hopefully this article has enlightened how to use sparse data structures properly so you can go forth and use them with confidence for future projects.
Knowing the pros and cons of each format (including dense) will aid in selecting the optimal one for a given task.
Be mindful that while sparse matrices are are great tool, they are not necessarily a replacement for arrays.
If a matrix is not sufficiently sparse, the multitude of storage arrays behind the scenes will actually take up more resources than a regular dense array would.
Furthermore if you need to regularly mutate an array, perform computations in between, and display output, then sparsity simply isn’t worth the trouble.
But all these concerns aside, hopefully sparse matrices can help “lighten” your load.
Comparison of Matrix Formats
COO
DOK
LIL
CSR
CSC
BSR
DIA
Dense
indexing
no
yes
yes
yes
yes
no†
no
yes
“write-only”
yes
yes
yes
no
no
no
no
yes
“read-only”
no
no
no
yes
yes
yes
yes
yes
low memory‡
yes
no
no
yes
yes
yes
yes
no
PyData sparse
yes
yes
no
no
no
no
no
n/a
† BSR raises NotImplementedError: rather than explicitly raising TypeError: 'xxx_matrix' object is not subscriptable. So maybe in the future it will support indexing.
‡ Assuming enough NNZ.
In[92]:exit()# the end :D
Code used to create the above animations is located at my GitHub.