Tilepattern

21 May 2019

Tile Pattern App

Back from my teaching days, I had written some rudimetry code to assist in making tile patterns for my algebra students to make predictions and write equations. Recently I had decided to revisit this idea in a more robust manner and it eventually evolved into a Flask app where you can find a working version deployed on Heroku.

Pattern Fig. 2
Color-coding exposes the pattern’s structure.

It has definitely been a fun and interesting experience making this come to fruition. My plan was to first make a Python script to parse a text file that contains a template for the tile pattern. I quickly settled on using “.” for a unit tile, “-“ or “|” for a linear tile, and “O” for a quadratic tile as I felt that these were easy to type and correspond visually with actual algebra tiles. For example the pattern above would have a text template of:

 .
||O
 .

Algebra tiles are a nice aide for a wide range of math topics. The small squares, rectangles, and large squares correspond to 1, x, and x² respectively.

Luckily I had recently researched into sparse matrices and it didn’t take that long to realize that block matrices would be the perfect way to take the blueprint components. While I could’ve used np.block, I didn’t want to figure out what the dimensions of the null matrices (blank spaces) and use np.zeros(shape=?). Instead sparse.bmat abstracts this away by letting me simply use None as a placeholder for the null matrices, and it would automatically calculate the appropriate dimensions. Great! Now I can have the following mappings:

"." -> np.array(scalar)
"-" -> np.array(vector)
"|" -> np.array(vector).T
"O" -> np.ones(square_matrix)

Really cool stuff, but there is still a minor caveat– I currently have not implemented tiles to align where side lengths would not align nicely. My plan is to eventually scan rows/columns, find if it needs to be a unit or variable length based on the “biggest” tile in the given row/column, and use slicing on square arrays of 0’s to set a restricted subset to 1’s.

Now it was time to make this a viable tool for other teachers to use. I had never used argparse before and wanted to make a CLI to facilitate usage. While I had known about "__main__" in Python for quite some time, it was interesting to find out about __main__.py as a way to make a package into a script! Another point of interest was the ability to have const and default values for a single flag to assume as its value when the flag was present without a value or when the flag was absent respectively. This allowed for nice actions for saving an image to the CWD vs an image pop-up while also giving a user the ability to specify a target directory.

>>> python -m tileapp -h
usage: tileapp [-h] [-bw] [-cm COLORMAP] [-o [DIR]] [-p PREFIX] [-v]
               infile dim [dim ...]

Tile pattern parser from txt to png

positional arguments:
  infile                filepath with pattern to parse
  dim                   dimensions to form pattern

optional arguments:
  -h, --help            show this help message and exit
  -a, --alpha           transparency of the colors used in png output; 
                        set to 0 for b/w png
  -cm COLORMAP, --colormap COLORMAP
                        colormap used to differentiate tile parts; see
                        https://matplotlib.org/tutorials/colors/colormaps.html
  -o [DIR], --outdir [DIR]
                        destination file for png output; if omitted, png is
                        popup; if not arg, save png to cwd
  -p PREFIX, --prefix PREFIX
                        prefix used for png output; use alongside outdir
  -v, --verbose         print to stdout the array used for png creation

Still not satisfied that there would be much of an audience of math teachers who would use the then current incarnation as is, I began developing a Flask app to make anyone lacking programming knowledge able to reap the benefits of my work. For this I had to make a decision regarding how to store the generated images to display to the user. Initially I was thinking about taking advantage of SQLite’s in in-memory capabilities since the use case of this application does not need persitent storage. But then again I thought that while this could work, it seemed like over engineering for a single image generated at a time. It was time to reformulate my approach.

Most Python programmers are familiar with open(outfile, 'wb') and it was time to put working with files into overdrive. Right off the bat, I decided a io.BytesIO would be the perfect choice since they allow working with image files directly as an object rather than needed to write to disk. Alas things are never as simple as intially thought; HTML was designed to work with plain text. After some sluthing, I stumbled upon someone’s proposed workaround using a module that I have never looked twice at–base64. In a nutshell, Base64 encodes bytes into text which makes it ideal to act as a medium between the binary images and my web app.

Custom Tile Pattern Interface for producing a tile pattern. Need a different figure? Different color? Easy, just change the parameters and you’re good to go!

Despite needing to make the app more exciting with CSS/Bootstrap, those were secondary to getting my now fully functional app out to the public via Heroku. In the past, Heroku had never cooperated with me, and I was unsurprised when it invariably threw errors at me. Yet this time I was determined to get it to work and finally got it working. My Achilles heel?–thinking that gunicorn was embedded within Heroku and omitting it from the requirements.txt. As the saying goes: “Fool me once, shame on you; fool me twice, shame on me.”

Now all the remains is spicing up the front-end of the app. But that will have to wait until another day. In the meantime if you are interested in looking at the code source visit my GitHub.

Sparse Matrices

25 Apr 2019

Data Structures - Sparse Matrices

Introduction
Construction Matrices
Compressed Sparse Matrices
1. Compressed Sparse Row/Column
2. Block Sparse Row
Diagonal Matrix
Specialized Functions
Other Libraries
Final Thoughts

Introduction

A sparse matrix is a matrix that has a value of 0 for most elements. If the ratio of Number of Non-Zero (NNZ) elements to the size is less than 0.5, the matrix is sparse. While this is the mathematical definition, I will be using the term sparse for matrices with only NNZ elements and dense for matrices with all elements.

Sparse vs Dense

Storing information about all the 0 elements is inefficient, so we will assume unspecified elements to be 0. Using this scheme, sparse matrices can perform faster operations and use less memory than its corresponding dense matrix representation, which is especially important when working with large data sets in data science.

Today we will investigate all of the different implementations provided by the SciPy sparse package. This implementation is modeled after np.matrix opposed to np.ndarray, thus is restricted to 2-D arrays and having quirks like A * B doing matrix multiplication instead of element-wise multiplication.

In [0]: from scipy import sparse

In [1]: import numpy as np

In [2]: spmatrix = sparse.random(10, 10)

In [3]: spmatrix
Out[3]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
        with 1 stored elements in COOrdinate format>

In [4]: spmatrix.nnz / np.product(spmatrix.shape)  # sparsity
Out[4]: 0.01

Construction Matrices

Different sparse formats have their strengths and weaknesses. A good starting point is looking at formats that are efficient for constructing these matrices. Typically you would start with one of these forms and then convert to another when ready to do calculations.

Coordinate Matrix

Perhaps the simplest sparse format to understand is the COOrdinate (COO) format. This variant uses three subarrays to store the element values and their coordinate positions.

COO Matrix

In [5]: row = [1, 3, 0, 2, 4]

In [6]: col = [1, 4, 2, 3, 3]

In [7]: data = [2, 5, 9, 1, 6]

In [8]: coo = sparse.coo_matrix((data, (row, col)), shape=(6, 7))

In [9]: print(coo)  # coordinate-value format
# (1, 1)        2
# (3, 4)        5
# (0, 2)        9
# (2, 3)        1
# (4, 3)        6

In [10]: coo.todense()  # coo.toarray() for ndarray instead
Out[10]:
matrix([[0, 0, 9, 0, 0, 0, 0],
        [0, 2, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 5, 0, 0],
        [0, 0, 0, 6, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0]])

The savings on memory consumption is quite substantial as the matrix size increases. Managing data in a sparse structure is a fixed cost unlike the case for dense matrices. The overhead incurred from needing to manage the subarrays is becomes negligible as data grows making it a great choice for some datasets.

A word of caution: only use sparse arrays if they are sufficiently sparse enough; it would be counterproductive storing a mostly nonzero array using several subarrays to keep track of position and data.

In [11]: def memory_usage(coo):
    ...:    # data memory and overhead memory
    ...:    coo_mem = (sum(obj.nbytes for obj in [coo.data, coo.row, coo.col])
    ...:               + sum(obj.__sizeof__() for obj in [coo, coo.data, coo.row, coo.col]))
    ...:    print(f'Sparse: {coo_mem}')
    ...:    mtrx = coo.todense()
    ...:    mtrx_mem = mtx.nbytes + mtrx.__sizeof__()
    ...:    print(f'Dense: {mtrx_mem}')

In [12]: memory_usage(coo)
# Sparse: 480
# Dense: 448

In [13]: coo.resize(100, 100)

In [14]: memory_usage(coo)
# Sparse: 480
# Dense: 80112

Dictionary of Keys Matrix

Dictionary Of Keys (DOK) is very much like COO except that it subclasses dict to store coordinate-data information as key-value pairs. Since it uses a hash table as storage, identifying values at any given location has constant lookup time. Use this format if you need the functionality that come with builtin dictionaries, but be mindful that hash tables hog much more memory than arrays.

In [15]: dok = sparse.dok_matrix((10, 10))

In [16]: dok[(3, 7)] = 42  # store value 42 at coordinate (3, 7)

In [17]: dok[(9, 5)]  # zero elements are accessible
Out[17]: 0.0

In [18]: dok.keys() | dok.transpose().keys()  # union of key views
Out[18]: {(3, 7), (7, 3)}

In [19]: isinstance(dok, dict)
Out[19]: True

Note: be careful of potential problems using the methods inherited from dict; they don’t always behave.

In [20]: out_of_bounds = (999, 999)

In [21]: dok[out_of_bounds] = 1  # works as expected
IndexError: Index out of bounds.

In [22]: dok.setdefault(out_of_bounds)  # silently ignored...

In [23]: dok.toarray()  # ...until now
ValueError: row index exceeds matrix dimensions

In [24]: dok.pop(out_of_bounds)  # fix issue by removing bad point

In [25]: sparse.dok_matrix.fromkeys([..., ..., ...])  # don't get me started
TypeError: __init__() missing 1 required positional argument: 'arg1'

Linked List Matrix

The most flexible format to insert data is through usage of LInked List (LIL) matrices. Data can be set via indexing and slicing syntax of NumPy to quickly populate the matrix. In my opinion, LIL is the coolest sparse format for constructing sparse matrices from scratch.

LIL Matrix

LIL stores information in lil.rows where each list represents a row index and the elements inside the list match columns. In a parallel array, lil.data, the NNZ values are stored. But unlike other sparse formats, these subarrays cannot be explicitly passed to the constructor; LIL matrices must be made from either an empty state or from existing matrices, dense or sparse. Below is an illustration of various techniques used to build up a LIL matrix.

In [26]: lil = sparse.lil_matrix((6, 5), dtype=int)

In [27]: lil[(0, -1)] = -1  # set individual point

In [28]: lil[3, (0, 4)] = [-2] * 2  # set two points

In [29]: lil.setdiag(8, k=0)  # set main diagonal

In [30]: lil[:, 2] = np.arange(lil.shape[0]).reshape(-1, 1) + 1  # set entire column

In [31]: lil.toarray()
Out[31]:
array([[ 8,  0,  1,  0, -1],
       [ 0,  8,  2,  0,  0],
       [ 0,  0,  3,  0,  0],
       [-2,  0,  4,  8, -2],
       [ 0,  0,  5,  0,  8],
       [ 0,  0,  6,  0,  0]])

So what’s the drawback…? Well, it utilizes jagged arrays under the hood which requires np.dtype(object). This costs a lot more memory than a rectangular array, so if the data is big enough, you may be forced to work with COO instead of LIL. In short, LIL is mostly offered as a convenience, albeit an awesome one at that.

In [32]: lil.rows
Out[32]:
array([list([0, 2, 4]), list([1, 2]), list([2]), list([0, 2, 3, 4]),
       list([2, 4]), list([2])], dtype=object)

In [33]: lil.data[:, np.newaxis]  # expose jagged structure
Out[33]:
array([[list([8, 1, -1])],
       [list([8, 2])],
       [list([3])],
       [list([-2, 4, 8, -2])],
       [list([5, 8])],
       [list([6])]], dtype=object)

As an aside, Linked List Matrix is a misnomer since it does not use linked lists behind the scenes! LIL actually uses Python’s list which is a dynamic array, so it should really be called a List of Lists Matrix, in spite of what the documentation says. (A missed opportunity to christen it as LOL…)

In [34]: sparse.lil.__doc__  # module docstring
Out[34]: 'LInked List sparse matrix class\n'

Compressed Sparse Matrices

The formats described earlier are great for building sparse matrices, but they aren’t as computationally performant than more specialized forms. The reverse is true for compressed sparse matrix family, which should be treated as read-only rather than write-only. These are more difficult to understand, but with a little patience their structure can be grokked.

Compressed Sparse Row/Column

The Compressed Sparse Row/Column (CSR and CSC) formats are designed for computation in mind.

CSR Matrix

In [35]: indptr = np.array([0, 2, 3, 3, 3, 6, 6, 7])

In [36]: indices = np.array([0, 2, 2, 2, 3, 4, 3])

In [37]: data = np.array([8, 2, 5, 7, 1, 2, 9])

In [38]: csr = sparse.csr_matrix((data, indices, indptr))

In [39]: csr.todense()
Out[39]:
matrix([[8, 0, 2, 0, 0],
        [0, 0, 5, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 7, 1, 2],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 9, 0]])

Adjacent pairs of index pointers determine two things. First, their position in the pointer array is the row number. Second, these values represent the [start:stop] slice of the indices array, and their difference is the NNZ elements in each row. Using the pointers, look up the indices to determine the column for each element in the data.

CSC works exactly the same as CSR but has column based index pointers and row indices instead. Below is a diagram of the same data in this format. Notice for this particular case, CSC is slightly more compact with two fewer index pointers.

CSC Matrix

In [40]: csc = sparse.csc_matrix(csr)  # convert format

In [41]: csc.indptr
Out[41]: array([0, 1, 1, 4, 6, 7], dtype=int32)

In [42]: csc.indices
Out[42]: array([0, 0, 1, 4, 4, 6, 4], dtype=int32)

In [43]: csc.data
Out[43]: array([8, 2, 5, 7, 1, 9, 2], dtype=int64)

As promised, the compressed formats are indeed faster than their COO counterpart. For a modest-sized matrix, we see a 2x speed gain vs COO and 60x speedup vs dense!

In [44]: csr.resize(1000, 1000)

In [45]: %timeit csr @ csr
# 111 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [46]: coo = csr.tocoo()  # another way to convert

In [47]: %timeit coo @ coo
# 251 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [48]: arr = csr.toarray()

In [49]: %timeit arr @ arr  # order of magnitude slower!
# 632 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Block Sparse Row

Block Sparse Row (BSR) is like CSR but stores sub-matrices rather than scalar values at locations.

In [50]: ones = np.ones((2, 3), dtype=int)

In [51]: data = np.array([ones + i for i in range(4)])

In [52]: indices = [1, 2, 2, 0]

In [53]: indptr = [0, 2, 3, 4]

In [54]: bsr = sparse.bsr_matrix((data, indices, indptr))

In [55]: bsr.todense()
Out[55]:
matrix([[0, 0, 0, 1, 1, 1, 2, 2, 2],
        [0, 0, 0, 1, 1, 1, 2, 2, 2],
        [0, 0, 0, 0, 0, 0, 3, 3, 3],
        [0, 0, 0, 0, 0, 0, 3, 3, 3],
        [4, 4, 4, 0, 0, 0, 0, 0, 0],
        [4, 4, 4, 0, 0, 0, 0, 0, 0]])

This implementation requires all the sub-matrices to have the same shape, but there are more generalized constructs with block matrices that relax this constraint. These matrices do not have their unique data structure in SciPy, but can be indirectly made via the sparse.bmat constructor function.

In [56]: A = np.arange(8).reshape(2, 4)  # can use dense arrays

In [57]: T = np.tri(5, 4)

In [58]: L = [[8] * 4] * 2  # can use lists

In [59]: I = sparse.identity(4)  # can use sparse arrays

In [60]: Z = sparse.coo_matrix((2, 3))  # zeros to create column gap

In [61]: sp.bmat([[   A,    Z,    L],
    ...:          [None, None,    I],
    ...:          [   T, None, None]], dtype=int)
Out[61]:
<11x11 sparse matrix of type '<class 'numpy.int64'>'
        with 33 stored elements in COOrdinate format>

In [62]: _.toarray()  # ipython previous output
Out[62]:
array([[0, 1, 2, 3, 0, 0, 0, 8, 8, 8, 8],
       [4, 5, 6, 7, 0, 0, 0, 8, 8, 8, 8],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])

Diagonal Matrix

Perhaps the most specialized of the formats to store sparse data is the DIAgonal (DIA) variant. It is best suited for data that appears along the diagonals of a matrix.

DIA matrix

In [63]: data = np.arange(15).reshape(3, -1) + 1

In [64]: offsets = np.array([0, -3, 2])

In [65]: dia = sparse.dia_matrix((data, offsets), shape=(7, 5))

In [66]: dia.toarray()
Out[66]:
array([[ 1,  0, 13,  0,  0],
       [ 0,  2,  0, 14,  0],
       [ 0,  0,  3,  0, 15],
       [ 6,  0,  0,  4,  0],
       [ 0,  7,  0,  0,  5],
       [ 0,  0,  8,  0,  0],
       [ 0,  0,  0,  9,  0]])

The data is stored in an array of shape (offsets) x (width) where the offsets dictate the location of each row in the data array along diagonal. Offsets are below or above the main diagonal when negative or positive respectively. Note that if a row in the data matrix is cutoff, the excess elements can assume any value (but they must have placeholders).

In [67]: dia.data.ravel()[9:12] = 0  # replace cutoff data

In [68]: dia.data
Out[68]:
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9,  0],
       [ 0,  0, 13, 14, 15]])

In [69]: dia.toarray()  # same array repr as earlier
Out[69]:
array([[ 1,  0, 13,  0,  0],
       [ 0,  2,  0, 14,  0],
       [ 0,  0,  3,  0, 15],
       [ 6,  0,  0,  4,  0],
       [ 0,  7,  0,  0,  5],
       [ 0,  0,  8,  0,  0],
       [ 0,  0,  0,  9,  0]])

Specialized Functions

In addition to the multitude of formats, there is a plethora of functions specialized just for sparse matrices. Use these functions whenever possible rather than their NumPy counterparts, otherwise speed performances will be compromised. Even worse, the resulting calculations could be incorrect!

inherited methods
- scipy.sparse.spmatrix.mean
- scipy.sparse.spmatrix.getcol
- scipy.sparse.spmatrix.getmaxprint
- …
general functions
- scipy.sparse.load_npz
- scipy.sparse.isspmatrix_coo
- scipy.sparse.hstack
- …
linear algebra
- scipy.sparse.linalg.svds
- scipy.sparse.linalg.inv
- scipy.sparse.linalg.norm
- …
graph algorithms
- scipy.sparse.csgraph.dijkstra
- scipy.sparse.csgraph.minimum_spanning_tree
- scipy.sparse.csgraph.connected_components
- …

Looking into the details of these are left as an exercise to the avid reader.

Other Libraries

SciPy is not the only resource for working with sparse structures in the Python ecosystem. While most appear to use the SciPy package internally, they have all made it their own. I will present several libraries that I find most compelling, but this is not supposed to be the end all be all.

Pandas

Data science today wouldn’t be what it is without Pandas, so it doesn’t come as a surprise that it supports sparse variants of its data structures. A really neat feature is that inferred elements do not have to be forms of 0!

In [70]: import pandas as pd

In [71]: ss = pd.SparseSeries.from_coo(dia.tocoo())

In [72]: ss  # uses a MultiIndex for the coordinates
Out[72]:
0  0     1
   2    13
1  1     2
   3    14
2  2     3
   4    15
3  0     6
   3     4
4  1     7
   4     5
5  2     8
6  3     9
dtype: Sparse[int64, 0]
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([12], dtype=int32)

In [73]: data = dict(A=[np.nan, 1, 2], B=[np.nan, 3, np.nan])

In [74]: sdf = pd.DataFrame(data).to_sparse()

In [75]: type(sdf).mro()  # class inheritance hierarchy
Out[75]:
[pandas.core.sparse.frame.SparseDataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.generic.NDFrame,
 pandas.core.base.PandasObject,
 pandas.core.base.StringMixin,
 pandas.core.accessor.DirNamesMixin,
 pandas.core.base.SelectionMixin,
 object]

In [76]: sdtype = pd.SparseDtype(object, fill_value='e')  # not restricted to null values

In [77]: pd.SparseArray(list('abcdeeeeeeee'), dtype=sdtype)
Out[77]:
[a, b, c, d, e, e, e, e, e, e, e, e]
Fill: e
IntIndex
Indices: array([0, 1, 2, 3], dtype=int32)

Scikit-Learn

The machine learning powerhouse, Scikit-Learn, supports sparse matrices in many areas. This is important since big data thrives on sparse matrices (assuming enough sparsity). After all, who wouldn’t want to have performance gains from these number-crunching algorithms? It hurts having to wait on CPU intensive SVMs, not to mention discovering some data won’t fit into working memory!

Scikit-Learn’s term-document matrices produced by text vectorizers result in CSR matrices. This is crucial for NLP since most words are used sparingly if at all. Naively using a dense format might otherwise cause speed bottlenecks and lots of wasted memory.

In [78]: from sklearn.feature_extraction.text import CountVectorizer

In [79]: bow = CountVectorizer().fit_transform(['demo'])

In [80]: sparse.isspmatrix(bow)
Out[80]: True

In [81]: sparse.save_npz('bag_of_words.npz', bow)  # store for future use

Other areas where Scikit-Learn has the ability to output sparse matrices include:

sklearn.preprocessing.OneHotEncoder
sklearn.preprocessing.LabelBinarizer
sklearn.feature_extraction.DictVectorizer

Moreover there are utilities that play well with sparse matrices such as scalers, a handful of decompositions, some pairwise distances, train-test-split, and many estimators can predict and/or fit sparse matrices. In short embrace their usage whenever possible to make your machine learning models more efficient.

PyData Sparse

As another implementation, PyData’s sparse library provides an interface like np.ndarray instead of np.matrix, permitting creation of multidimensional sparse arrays. The caveat is that as of the writing of this article, only COO and DOK formats are supported.

In [82]: import sparse as sp  # avoid name clobbering with scipy.sparse

In [83]: sarr = sp.random((3, 4, 2), density=0.2)  # 3-D sparse array

In [84]: sarr
Out[84]: <COO: shape=(3, 4, 2), dtype=float64, nnz=4, fill_value=0.0>

In [85]: sarr += 1  # not possible in scipy.sparse

In [86]: sarr
Out[86]: <COO: shape=(3, 4, 2), dtype=float64, nnz=4, fill_value=1.0>  # fill_value updates!

In [87]: sarr.todense()
Out[87]:
array([[[1.        , 1.        ],
        [1.        , 1.        ],
        [1.        , 1.        ],
        [1.        , 1.        ]],

       [[1.        , 1.        ],
        [1.86024163, 1.        ],
        [1.37233162, 1.1114997 ],
        [1.        , 1.        ]],

       [[1.        , 1.        ],
        [1.        , 1.16850612],
        [1.        , 1.        ],
        [1.        , 1.        ]]])

In SciPy, logical operators are not directly implemented, but AND (&) and OR (|) can be emulated by constraining the dtype to bool:

In [88]: class LogicalSparse(sparse.coo_matrix):  # scipy COO
    ...:    def __init__(self, *args, **kwargs):
    ...:        super().__init__(*args, dtype=bool, **kwargs)  # leverage existing base class
    ...:
    ...:    def __and__(self, other):  # self & other
    ...:        return self.multiply(other)
    ...:
    ...:    def __or__(self, other):  # self | other
    ...:        return self + other

Unfortunately NOT (~) is impossible since it would make a sparse matrix into a dense one (theoretically self - 1). Until now, that is. As seen earlier, sparse will dynamically update the fill value to accommodate current states.

In [89]: mask = sp.eye(100, dtype=bool)

In [90]: mask
Out[90]: <COO: shape=(100, 100), dtype=bool, nnz=100, fill_value=False>

In [91]: ~mask
Out[91]: <COO: shape=(100, 100), dtype=bool, nnz=100, fill_value=True>

Final Thoughts

Hopefully this article has enlightened how to use sparse data structures properly so you can go forth and use them with confidence for future projects. Knowing the pros and cons of each format (including dense) will aid in selecting the optimal one for a given task. Be mindful that while sparse matrices are are great tool, they are not necessarily a replacement for arrays. If a matrix is not sufficiently sparse, the multitude of storage arrays behind the scenes will actually take up more resources than a regular dense array would. Furthermore if you need to regularly mutate an array, perform computations in between, and display output, then sparsity simply isn’t worth the trouble. But all these concerns aside, hopefully sparse matrices can help “lighten” your load.

Comparison of Matrix Formats

	COO	DOK	LIL	CSR	CSC	BSR	DIA	Dense
indexing	no	yes	yes	yes	yes	no†	no	yes
“write-only”	yes	yes	yes	no	no	no	no	yes
“read-only”	no	no	no	yes	yes	yes	yes	yes
low memory‡	yes	no	no	yes	yes	yes	yes	no
PyData sparse	yes	yes	no	no	no	no	no	n/a

† BSR raises NotImplementedError: rather than explicitly raising TypeError: 'xxx_matrix' object is not subscriptable. So maybe in the future it will support indexing.

‡ Assuming enough NNZ.

In [92]: exit()  # the end :D

Code used to create the above animations is located at my GitHub.

Older Newer

Matt Eding Python & Data Science Blog:
About Archive Feed