I am thrilled to announce that “C++ Fundamentals” is finally ready! Francesco Zoffoli and I have been working on this for quite a while now.
Hit the ground running with C++, the language that supports tech giants globally.
If you’re a developer looking to learn a new powerful language, or are familiar with C++ but want to update your knowledge with the modern paradigms of C++11, C++14, and C++17, then this book is for you.
The answer is simple, knowing C++ can pay off for software developers more than any other language or skill. When looking at average salary by engineering skill, C++ typically comes on top.
There is more, C++ is one of the few modern languages that can actually teach you computer engineering. Learning C++ will make you understand what is really going on behind the scenes. Also, once you have learned C++ all the other languages will become easy to learn.
C++ Fundamentals begins by introducing you to the C++ syntax. You will study the semantics of variables, along with their advantages and trade-offs, and see how they can be best used to write safe and efficient code. With the help of this course, you’ll be able to compile fully working C++ programs and understand how variables, references, and pointers can be used to manipulate the state of a program. You will then explore functions and classes — the features that C++ offers to organize a program — and use them to solve more complex problems. You’ll also understand common pitfalls and modern best practices, especially the ones that diverge from the C++98 guideline.
As you advance through the chapters, you’ll study the advantages of generic programming and write your own templates to make generic algorithms that work with any type. This C++ course will help you to fully exploit standard containers and understanding how to pick the appropriate container for each problem. You will even work with a variety of memory management tools in C++.
By the end of this book, you will not only be able to write efficient code, but will also be equipped to improve the readability, performance, and maintainability of your programs using standard algorithms.
C++ Fundamentals is currently sold on the Packt website and on the Amazon US, UK, IT, ES, DE store
]]>The basic Nearest Neighbor (NN) algorithm is simple and can be used for classification or regression. NN is a non-parametric approach and the intuition behind it is that similar examples \(x^t\) should have similar outputs \(r^t\).
Given a training set, all we need to do to predict the output for a new example \(x\) is to find the “most similar” example \(x^t\) in the training set.
A slight variation of NN is k-NN where given an example \(x^t\) we want to predict we find the k nearest samples in the training set. The basic Nearest Neighbor algorithm does not handle outliers well, because it has high variance, meaning that its predictions can vary a lot depending on which examples happen to appear in the training set. The k Nearest Neighbor algorithm addresses these problems.
To do classification, after finding the \(k\) nearest sample, take the most frequent label of their labels. For regression, we can take the mean or median of the k neighbors, or we can solve a linear regression problem on the neighbors
Nonparametric methods are still subject to underfitting and overfitting, just like parametric methods. In this case, 1-nearest neighbors is overfitting since it reacts too much to the outliers. High \(k\), on the other hand, would underfit. As usual, cross-validation can be used to select the best value of \(k\).
The very word “nearest” implies a distance metric. How do we measure the distance from a query point \(x^i\) to an example point \(x^j\)?
Typically, distances are measured with a Minkowski distance or \(L^p\) norm, defined as:
\[L^p(x^i, x^j) = ( \sum_d |x_d^i - x_d^j |^p )^{\frac{1}{p}}\]With \(p = 2\) this is Euclidean distance and with \(p = 1\) it is Manhattan distance. With Boolean attribute values, the number of attributes on which the two points differ is called the Hamming distance.
For our purposes we will adopt Euclidean distance and since our dataset is made of two attributes we can use the following function where \(x^t = (x_t, y_t)\).
@staticmethod
def __euclidean_distance(x1, y1, x2, y2):
return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)
Instead of computing an average of the \(k\) neighbors, we can compute a weighted average of the neighbors. A common way to do this is to weight each of the neighbors by a factor of \(1/d\), where \(d\) is its distance from the test example. The weighted average of neighbors \(x_1 , \dots , x_k\) is then \((\sum_1^k (1/d^k) r^t)/(\sum_1^k (1/d^k))\), where \(d_t\) is the distance of the \(t\)th neighbor.
For our implementation, we chose to use weighted distance according to a paper^{1} which proposes another improvement to the basic k-NN where the weights to nearest neighbors are given based on Gaussian distribution.
@staticmethod
def gaussian(dist, sigma=1):
return 1./(math.sqrt(2.*math.pi)*sigma)*math.exp(-dist**2/(2*sigma**2))
Given a training set, we need first to store it as we will use it at prediction time. Clearly, k cannot be bigger than the training set itself.
class kNN(object):
def __init__(self, x, y, k, weighted=False):
assert (k <= len(x)
), "k cannot be greater than training_set length"
self.__x = x
self.__y = y
self.__k = k
self.__weighted = weighted
Predicting the output for a new example \(x\) is conceptually trivial. All we need to do is:
Depending if we are doing classification or regression we would treat those \(k\) examples differently. In this case, we will do regression, so our prediction will be just the average of the samples.
def predict(self, test_set):
predictions = []
for i, j in test_set.values:
distances = []
for idx, (l, m) in enumerate(self.__x.values):
dist = self.__euclidean_distance(i, j, l, m)
distances.append((self.__y[idx], dist))
distances.sort(key=operator.itemgetter(1))
v = 0
total_weight = 0
for i in range(self.__k):
weight = self.gaussian(distances[i][1])
if self.__weighted:
v += distances[i][0]*weight
else:
v += distances[i][0]
total_weight += weight
if self.__weighted:
predictions.append(v/total_weight)
else:
predictions.append(v/self.__k)
return predictions
If we are happy with an implementation that takes \(O(N)\) execution time, then that is the end of the story. If not, there are possible optimization using indexes based on additional data structures, i.e. k-d trees or hash tables, which I might write about in the future.
k Nearest Neighbor estimation was proposed sixty years ago, but because of the need for large memory and computation, the approach was not popular for a long time. With advances in parallel processing and with memory and computation getting cheaper, such methods have recently become more widely used. Unfortunately, it can still be quite computationally expensive when it comes to large training dataset as we need to compute the distance for each sample. Some indexing (e.g. k-d tree) may reduce this cost.
Also, when we consider low-dimensional spaces and we have enough data, NN works very well in terms of accuracy, as we have enough nearby data points to get a good answer. As the number of dimensions rises the algorithm performs worst, this is due to the fact that the distance measure becomes meaningless when the dimension of the data increases significantly.
On the other hand, k-NN is quite robust to noisy training data, especially when a weighted distance is used.
To test our k-NN implementation we will perform experiments using a version of the automobile dataset from the UC Irvine Repository. The problem will be to predict the miles per gallon (mpg) of a car, given its displacement and horsepower. Each example in the dataset corresponds to a single car.
Number of Instances: 291 in the training set, 100 in the test set
Number of Attributes: 2 continous input attributes, one continuous output
Attribute Information:
1. displacement: continuous
2. horsepower: continuous
3. mpg: continuous (output)
The following is an extract of how the dataset looks like:
displacement,horsepower,mpg
307,130,18
350,165,15
318,150,18
304,150,16
302,140,17
429,198,15
454,220,14
440,215,14
455,225,14
First, we read the data using pandas.
import pandas
training_data = pandas.read_csv("auto_train.csv")
x = training_data.iloc[:,:-1]
y = training_data.iloc[:,-1]
test_data = pandas.read_csv("auto_test.csv")
x_test = test_data.iloc[:,:-1]
y_test = test_data.iloc[:,-1]
Using the data in the training set, we predicted the output for each example in the test, for \(k = 1\), \(k = 3\), and \(k = 20\). Reported the squared error on the test set. As we can see the test error goes down while increasing \(k\).
from kNN import kNN
from sklearn.metrics import mean_squared_error
for k in [1, 3, 20]:
classifier = kNN(x,y, k)
pred_test = classifier.predict(x_test)
test_error = mean_squared_error(y_test, pred_test)
print("Test error with k={}: {}".format(k, test_error * len(y_test)/2))
Test error with k=1: 2868.0049999999997
Test error with k=3: 2794.729999999999
Test error with k=20: 2746.1914125
Using weighted k-NN we obtained better performance than with simple k-NN.
from kNN import kNN
for k in [1, 3, 20]:
classifier = kNN(x,y, k, weighted=True)
pred_test = classifier.predict(x_test)
test_error = mean_squared_error(y_test, pred_test)
print("Test error with k={}: {}".format(k, test_error * len(y_test)/2))
Test error with k=1: 2868.005
Test error with k=3: 2757.3065023859417
Test error with k=20: 2737.9437262401907
This is how the full implementation looks like after putting all the parts together.
You can find the whole source code and the dataset used here: https://github.com/amallia/kNN
#!/usr/bin/env python
import math
import operator
class kNN(object):
def __init__(self, x, y, k, weighted=False):
assert (k <= len(x)
), "k cannot be greater than training_set length"
self.__x = x
self.__y = y
self.__k = k
self.__weighted = weighted
@staticmethod
def __euclidean_distance(x1, y1, x2, y2):
return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)
@staticmethod
def gaussian(dist, sigma=1):
return 1./(math.sqrt(2.*math.pi)*sigma)*math.exp(-dist**2/(2*sigma**2))
def predict(self, test_set):
predictions = []
for i, j in test_set.values:
distances = []
for idx, (l, m) in enumerate(self.__x.values):
dist = self.__euclidean_distance(i, j, l, m)
distances.append((self.__y[idx], dist))
distances.sort(key=operator.itemgetter(1))
v = 0
total_weight = 0
for i in range(self.__k):
weight = self.gaussian(distances[i][1])
if self.__weighted:
v += distances[i][0]*weight
else:
v += distances[i][0]
total_weight += weight
if self.__weighted:
predictions.append(v/total_weight)
else:
predictions.append(v/self.__k)
return predictions
Sarma, T. Hitendra et al. An improvement to k-nearest neighbor classifier. 2011 IEEE Recent Advances in Intelligent Computational Systems (2011): 227-231. ↩
I finally found some time to do some machine learning. It is something I have always wanted to start practicing, as it is pretty clear that it is the future of complex problem solving. Indeed, for some tasks, we do not have an algorithm we can write and execute, so we make it up from the data.
Machine learning uses the theory of statistics in building mathematical models.^{1} - Ethem Alpaydin
A typical example of problem ML tries to solve is classification. It can be expressed as the ability, given some input data, to assign a ‘class label’ to a sample.
To make things clearer, let’s make an example. Imagine we performed analysis on samples of objects and we collected their specs. Now, given this information, we would like to know if that object is a window glass (from vehicle or building) or not a window glass (containers, tableware, or headlamps). Unfortunately, we do not have a formula which, given these values, will provide us with the answer.
Someone who has handled glasses might be able to tell just by looking or touching it if that is a window glass or not. That is because he has acquired experience by looking at many examples of different kind of glasses. That is exactly what happens with machine learning. We say that we ‘train’ the algorithm to learn from known examples.
We provide a ‘training set’ where we specify both the input specs of the class and its category. The algorithm goes through the examples, learns the distinctive features of a window glass and so it can infer the class of a given uncategorized example.
We will use a dataset titled ‘Glass Identification Database’, created by B. German from Central Research Establishment Home Office Forensic Science Service. The original dataset classified the glass into 7 classes: 4 types of window glass classes, and 3 types of non-window glass classes. Our version treats all 4 types of window glass classes as one class, and all 3 types of non-window glass classes as one class.
Every row is an example and contains 11 attributes as listed below.
The following is an extract of how the dataset looks like:
1,1.51824,12.87,3.48,1.29,72.95,0.6,8.43,0,0,1
2,1.51832,13.33,3.34,1.54,72.14,0.56,8.99,0,0,1
3,1.51747,12.84,3.5,1.14,73.27,0.56,8.55,0,0,1
...
196,1.52315,13.44,3.34,1.23,72.38,0.6,8.83,0,0,2
197,1.51848,13.64,3.87,1.27,71.96,0.54,8.32,0,0.32,1
198,1.523,13.31,3.58,0.82,71.99,0.12,10.17,0,0.03,1
199,1.51905,13.6,3.62,1.11,72.64,0.14,8.76,0,0,1
200,1.52213,14.21,3.82,0.47,71.77,0.11,9.57,0,0,1
One of the simplest yet effective algorithm that should be tried to solve the classification problem is Naive Bayes. It is a probabilistic method which is based on the Bayes’ theorem with the naive independence assumptions between the input attributes.
We define C as the class we are analyzing and x as the input data or observation. The following equation, which is Bayes’ theorem, is the probability of class C, given the observation x. This is equal to the fraction of the probability of class C (without considering at the input) multiplied by the probability of the observation given the class C over the probability of the observation.
P(C) is also called the ‘prior probability’ because it is the knowledge we have as to the value of C before looking at the observables x. We also know that P(C = 0) + P(C = 1) = 1.
P(x | C) is called the class likelihood, which is the probability that an event belonging to C has the associated observation value x. In statistical inference, we make a decision using the information provided by a sample. In this case, we assume that the sample is drawn from some distribution that obeys a known model, for example, Gaussian. Part of this task is to generate the Gaussian that describes our data, so we can use the probability density function to compute the probability for a given attribute ^{2}. As already mentioned, every attribute will be treated as independent from the others.
Finally, P(x), also called the evidence, is the probability that an observation x is seen, regardless of the class C of the example.
\[P(C|x) = \frac{P(C) \cdot P(x|C)}{P(x)}\]The above equation is the ‘posterior probability’, which is the probability of class C after have seen the observation x.
At this point, given the posterior probability of several classes, we are able to decide which one is the most likely. It is interesting to notice that the denominator would be the same for all the classes, so we can simplify the calculation by comparing only the numerator of the Bayes’ theorem.
First thing first, we want to read our dataset so we are able to perform analysis on it. It is a CSV file, so we could use the csv
Python library, but I personally prefer to use something more powerful like pandas
.
Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. ^{3}
pandas.read_csv
will read our CVS file into a DataFrame
, which is a two-dimensional tabular data structure with labeled axes. In this way, our dataset will be damn easy to manipulate.
I also decided to label my columns so everything will be much clearer.
ATTR_NAMES = ["RI", "Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe"]
FIELD_NAMES = ["Num"] + ATTR_NAMES + ["Class"]
data = pandas.read_csv(args.filename, names=FIELD_NAMES)
Now that we have our dataset in memory we want to split it into two parts: the training set and the test set. The former will be used to train our ML model, while the latter to check how accurate the model is.
The following code will split the data dividing the dataset in chunks (based on the number of blocks_num
) and choose as a test set the chunk at position test_block
which will also be removed from the training set.
If nothing is provided apart from the dataset, the function will just use the same data for both training and test sets.
def split_data(data, blocks_num=1, test_block=0):
blocks = numpy.array_split(data, blocks_num)
test_set = blocks[test_block]
if blocks_num > 1:
del blocks[test_block]
training_set = pandas.concat(blocks)
return training_set, test_set
Estimating the P(C) of a given training sample is pretty straightforward. Prior probabilities are based on previous experience, in this case, the percentage of a class in the dataset.
We want to count the frequency of each class and get the ratio by dividing by the number of examples. The code to do so is extremely concise, also because pandas library makes the calculation of frequencies trivial.
def __prior(self):
counts = self.__training_set["Class"].value_counts().to_dict()
self.__priors = {(k, v / self.__n) for k, v in counts.items()}
To calculate the ‘pdf’ (probability density function) we need to know how the distribution that describes our data looks like. To do that we need to compute the mean and the variance (or eventually the standard deviation) for each attribute for every single class. Since we have 9 attributes and 2 classes in our dataset, we will end up having 18 mean-variance pairs.
Again for this task we can use the helper functions provided by pandas
, where we select the column of interest and call their mean()
and std()
methods.
def __calculate_mean_variance(self):
self.__mean_variance = {}
for c in self.__training_set["Class"].unique():
filtered_set = self.__training_set[
(self.__training_set['Class'] == c)]
m_v = {}
for attr_name in ATTR_NAMES:
m_v[attr_name] = []
m_v[attr_name].append(filtered_set[attr_name].mean())
m_v[attr_name].append(
math.pow(filtered_set[attr_name].std(), 2))
self.__mean_variance[c] = m_v
The function to compute the ‘pdf’ is just a static method that takes as input the value of the attribute and the description of the Gaussian (mean and variance) and returns a probability according to the ‘pdf’ equation.
@staticmethod
def __calculate_probability(x, mean, variance):
exponent = math.exp(-(math.pow(x - mean, 2) / (2 * variance)))
return (1 / (math.sqrt(2 * math.pi * variance))) * exponent
Now that we have everything in place, it is time to predict our classes.
Basically, what the following does, is it iterating through the test set and for each sample calculates the probability of every class using the Bayes’ theorem. The only difference here is that we use log probabilities since the probabilities for each class given an attribute value are small and they could underflow.
So it becomes: \(\log[p(x|C) ∗ P(C)] = \log P(C) + \sum_{i=1}^9 \log p(x i |C)\)
def predict(self):
predictions = {}
for _, row in self.__test_set.iterrows():
results = {}
for k, v in self.__priors:
p = 0
for attr_name in ATTR_NAMES:
prob = self.__calculate_probability(row[attr_name], self.__mean_variance[
k][attr_name][0], self.__mean_variance[k][attr_name][1])
if prob > 0:
p += math.log(prob)
results[k] = math.log(v) + p
predictions[int(row["Num"])] = max([key for key in results.keys() if results[
key] == results[max(results, key=results.get)]])
return predictions
As a result, we need to take as a prediction the class with the highest probability. If two or more classes end up having the same probability we decided to take the class which comes earlier in reverse alphabetical order, but this was not really needed for the given dataset.
Once we obtain the predictions, we can compare them to the class value present in the test dataset, so we can calculate the ratio of correct ones over the total number of predictions. This measure is also called accuracy and allows to estimate the quality of the ML model used.
def calculate_accuracy(test_set, predictions):
correct = 0
for _, t in test_set.iterrows():
if t["Class"] == predictions[t["Num"]]:
correct += 1
return (correct / len(test_set)) * 100.0
In our tests, we obtained a 90% accuracy using the same dataset for both training and test.
Now that we know how to perform a prediction, let’s look at the data again. Does it really make any sense to train an algorithm on something and then test it on the same data? Probably not. We want to have two different sets then, but this is not always possible when you do not have enough data.
Our example dataset contains 200 records, ideally, we would like to squeeze it as much as we can and perform a test on the all 200 samples, but then we would not have anything to use to train the model.
The way ML people do this is called cross validation. The dataset is divided into chunks (as shown before), say 5 for example, and the model is trained against 4 of 5 chunks and the other chunk is used for the test. This operation is repeated as many times as the number of chunks so that the test is performed on every chunk. Finally, the accuracy values collected for every repetition is averaged.
Again, even using 5-fold cross validation we obtained the same accuracy equal to 90%.
Zero-R classifier simply predicts the majority class (the class that is most frequent in the training set). Sometimes a not-very-intelligent learning algorithm can achieve high accuracy on a particular learning task simply because the task is easy. For example, it can achieve high accuracy in a 2-class problem if the dataset is very imbalanced.
Running a Zero-R classifier on our dataset just as a comparison with Naive Bayes, it achieved 74.5% accuracy.
Here it is the trivial implementation:
class zero_r_classifier(object):
def __init__(self, training_set, test_set):
self.__test_set = test_set
classes = training_set["Class"].value_counts().to_dict()
self.__most_freq_class = max(classes, key=classes.get)
def predict(self):
predictions = {}
for _, row in self.__test_set.iterrows():
predictions[int(row["Num"])] = self.__most_freq_class
return predictions
Comparing the Zero-R classifier accuracy with the Naive Bayes one we realized that our model is pretty accurate when compared to simplistic ones. Indeed, Zero-R only achieves a 74.5% accuracy.
One of the most popular library in Python which implements several ML algorithms such as classification, regression and clustering is scikit-learn. The library also has a Gaussian Naive Bayes classifier implementation and its API is fairly easy to use. You can find the documentation and some examples here: http://scikit-learn.org/…/sklearn.naive_bayes.GaussianNB.html
This implementation is definitely not production ready, even though it obtains the same predictions of scikit-learn since what is actually happening under the hood is the same. On the other hand, it has not been engineered too much as its scope was only to play with Naive Bayes. Anyway, most of the times looking at a simple implementation might be easier and more effective. You can find the whole source code and the dataset used here: https://github.com/amallia/GaussianNB
Ethem Alpaydin. 2014. Introduction to Machine Learning. The MIT Press. ↩
This post instead is about compression of monotone non-decreasing integers lists by using Elias-Fano encoding. It may sound like a niche algorithm, something that solves such an infrequent problem, but it is not like this. Inverted indexes ^{1}, which is the most common data structure used by search engines to index their data, are made of lists of increasing integers corresponding to the documents of the collection. I might write again in the future about inverted indexes in a more comprehensive way if this is a topic of your interest, in that case please let me know with a comment.
Elias-Fano encoding has been proposed independently by Peter Elias and Robert Mario Fano during the 70s, but their usefulness has been rediscovered recently. Elias-Fano representation is an elegant encoding scheme to represent a monotone non-decreasing sequence of n integers from the universe [0 . . . m) occupying \(2n + n⌈\log{\frac{m}{n}}⌉\)bits, while supporting constant time access to the i-th element.
If we compare Elias-Fano encoding space requirement with the theoretical lower bound we realize that this structure is close to the bound, so it has been epithet quasi-succint index^{2}.
In the Elias-Fano representation each integer is first binary encoded using \(⌈\log{m}⌉\) bits. Each binary representation of the elements is split in two: the higher part consisting of the first (left to right) \(⌈\log{n}⌉\) bits and the lower part with the remaining \(⌈\log{m} - \log{n}⌉ = ⌈\log{m/n}⌉\). The concatenation of the lower part of each element of the list is the actual stored representation and takes trivially \(n \log{m}\) bits. The higher part, instead, is a unary representation, specifically a bit-vector of size \(n + m/2^{⌈\log{m/n}⌉} = 2n\) bits. It is constructed starting from and empty bit-vector, we add a 0 as a stop bit for each possible value representable with the bits of the higher part length, we add a 1 for each value actually present positioning it before the correct stop bit. This makes clearer why we use exactly 2n bits, one bit set to 1 for the n elements and one 0 bit for all the possible distinct values obtainable with \(⌈\log{n}⌉\) bits. Finally, the Elias-Fano representation is the bitvector resulting from the concatenation of the higher and the lower part.
As an example, lets take the sorted list of {2,3,5,7,11,13,24}
as shown in Figure 1. In this case we know that m (the universe of the list) is equal to 24 and to represent all the elements in fixed-length binary we need 5 bits per element.
Then we want to split the binary representation of each element in two parts, the higher and the lower. Since we have 7 elements in total, we will use 3 bits for the higher part and 2 for the lower one as explained previously. If we consider 2 => 0b00010
we will have 000
and 10
respectively.
We repeat this process for every element of the list and we concatenate all the lower parts together.
Regarding the higher bits, since we use 3 bits per element we can imagine to have \(2^3\) buckets and we associate a counter to each bucket corresponding to the cardinality of that bucket. For 2 we will increment the 000
bucket. To the same bucket goes 3, while 5 will increment 001
and so on and so forth. There might be cases where the counter of the bucket is equal to zero, as it is for 100
in Figure 1.
Finally, we use unary encoding to represent the buckets’ counters, specifically we append as many 1-bits as the counter value of each bucket followed by a 0-bit.
In the case of the 000
bucket we will add 2 set-bits and an unset one to separate the following bucket.
The final Elias-Fano encoding is obtained by concatenating higher and lower bits just obtained.
Now, we show how to get an element given the information we have. Interestingly, with this type of encoding, we can have random access for both Access and NextGEQ operations in nearly constant time.
Access(i) is the operation of retrieving the element at position i from the original list of elements.
To get the lower part we can simply jump to the corresponding bits since we know the length stored for each element. To compute the higher part we need to perform a select_1(i) - i
, where select_1(i)
is defined as the operation which returns the position of the i-th set-bit and there are techniques to perform it in nearly constant time ^{3}.
Another interesting operation is NextGEQ(x), which returns the next integer of the sequence that is greater or equal than x. We retrieve the position p by performing select_0(hx) − hx where hx is the bucket in which x belongs to. At this point, we start to scan the elements from position p and we stop at the first one greater than x. The scan can traverse at most the size of the bucket.
Elias-Fano is a very effective encoding algorithm as it allows to randomly access the sequence without decoding it and in constant time. As highlighted in academic literature ^{4} Elias-Fano demonstrates its power in particular in list intersection overcoming any other form of compression.
I would like to point out my Golang implementation (https://github.com/amallia/go-ef) of Elias-Fano, which is still in early stage. Feel free to get involved in the development.
A very good implementation is the one from Facebook present in Folly (https://github.com/facebook/folly/blob/master/folly/experimental/EliasFanoCoding.h).
Leave me a comment if you have written your own implementation and I will be more than happy to add it to the list.
Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM Comput. Surv. 38, 2, Article 6 (July 2006). ↩
Sebastiano Vigna. 2013. Quasi-succinct indices. In Proceedings of the sixth ACM international conference on Web search and data mining (WSDM ‘13). ACM, New York, NY, USA, 83-92. ↩
Sebastiano Vigna. 2008. Broadword implementation of rank/select queries. In Proceedings of the 7th international conference on Experimental algorithms (WEA’08), Catherine C. McGeoch (Ed.). Springer-Verlag, Berlin, Heidelberg, 154-168. ↩
Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2017. An Experimental Study of Bitmap Compression vs. Inverted List Compression. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ‘17). ACM, New York, NY, USA, 993-1008. ↩
A bitmap, also referred to as bit-vector or bit-array, is a sequence of 0s and 1s which typically encodes a more complex object.
A common example of this is a set of numbers where each of the elements are indicated as set bits in a bitmap of length equal to the greatest element plus one (as we count from zero), also commonly referred as the universe. As an example, the set {3,5,21,4,23,12}
can be represented as 101000000001000000111000, where - counting right to left - we have the 1-bit at the corresponding positions of the elements in the initial set.
The importance of bitmaps is irrefutable, this is why I recently started investigating which are the most effective techniques used to compress them. Being able to reduce their memory usage means being able to store more data or, possibly, fit it in a lower level of the cache hierarchy which immediately translates to faster access.
The technique I would like to discuss sets the base for more complex ones, which I will try to cover in a future blog post. The most important property of the following compression algorithm is the ability to query the bitmap without fully decompressing it. Considering the set we saw in the previous example, this would be extremely appealing, as we would be able to tell if an element i is present or not just by looking at the bit at position i.
The compression I am going to present falls into the category of data structures called succinct data structures, which allow efficient query operations while using an amount of space that is close to the information-theoretic lower bound.
Now we split the bitmap into fixed-length blocks. In the previous example, the bitmap was 24-bit long, if we split it into blocks of 3-bits each we obtain four distinct blocks.
The idea is to code each block independently from the the others, using a pair of values <C_{i},O_{i}> for the i-th block. The first element of the pair is the cardinality of the block, also referred as population count or just popcount; while the latter is the offset in the table that contains all the distinct permutations (so combinations) of the bits in that block ^{1}.
Let’s say we want to encode the first block 101. Calculating C is trivial as we need to count the number of bits set to 1. This can also be done in hardware by most of the modern CPUs (I will come back to this topic again in the future), but for now, we can rely on the following naive implementation.
Now lets imagine we have a table containing all the \(\binom{3}{2} = 3\) ordered permutations of the previous block. If we iterated over the rows of this table and stop when we reach the entry that matches our block, then we would have computed the offset for that block. In our example, the offset of the block in the following table would be 2.
0 | 011 |
1 | 101 |
2 | 110 |
In this way we can encode our block with the two integers C = 2 and O = 2.
Whenever we would like to decode the block from the given C and O we need to select the appropriate table of combinations using C and then move to the index O of that table to retrieve the original representation.
In this environment if we were only interested in the i-th bit, we would have decoded the entire block and applied a proper mask to filter it out.
So far we realized we need to store a pair for integers for each block, so for \(m/b\) blocks where m is the original bitmap length and b is the fixed block size. We know that the population count of the block cannot be greater then the block size its-self, since there cannot be more than b set bits in a block. Then we can state that the C coefficients can be stored in \(\log_2(b)\) bits. Regarding the offsets we know that they are indexes in a table, but the table size depends on two factors: the blocks size and the number of set bits in it. We know that the former is the same for each block, but the population count can vary.
There are two lucky cases where the cardinality gives us enough information to infer the offset:
In these two cases we can store the offset implicitly and so we would not sacrifice any extra space. For all the other possibilities we can always store the offset in \(\log_2\binom{b}{C}\) bits.
For instance, the original bitmap we used in the previous example used 24 bits of actual data in its uncompressed form. To store the cardinality of each block we would need 2 bits per block, for a total of 16 bits. Then, the blocks containing all-zeros or all-ones is encoded implicitly, while the blocks with C = 1 and C = 2 need 2 additional bits each. This sums up to 20 bits used to represent our uncompressed 24-bit bitmap, with a saving of 2 bits or ~16% of the initial size.
At this point we know how to encode and decode a block, what is actually missing is the way to generate the lookup table of the permutations. The answer is that we don’t do it, but instead ordered binary permutations are generated on-the-fly ^{2}.
For small blocks it would actually be doable and probably also convenient, but if the block gets bigger then it is just not feasible. What is needed is an algorithm to compute offsets for a given block and being able to go back from the offset to the original block representation in a reasonably effective way.
The aim of the computation of the offset is to find the index of a block, given its size and population count, in a table listing all the possible permutations. Moreover, it needs to be deterministic and without the overhead of an actual table. Basically what the following algorithm does is iterating over every bit in the block, if is unset it moves to the next bit (so the block size decreases but the cardinality does not vary), else if it is set then we increase the offset by a quantity equal to \(\binom{n-1}{count}\) (now both block size and cardinality decrease by one) where n is the position of the bit and count the missing set bits to encounter.
Now we need to reverse the encoding process. If \(offset \geq \binom{n-1}{count}\), then the first bit of the block was a 1 and we decrement both n and count and subtract \(\binom{t-1}{count}\) from the offset; otherwise it was a 0 and we decrement only n. Every iteration the block size decreases by one. We can stop when we have processed the whole block or when the count reaches 0 as it means that the remaining bits are all unset.
Since most of the times we are interested in a single bit of the block and blocks can be quite long and so slower to decode, we can perform two forms of optimization. The former is to stop the iteration when we reach the given position, the latter is to perform a binary search instead of the linear scan we already described. The combination of the two solutions is ideal, indeed a linear scan is still faster when the position we are interested in is within the first \(block\_size \cdot \log_2(block\_size)\) elements.
I feel this is a nice and elegant way to compress a bitmap while keeping the ability to decode a block in constant time as it on-the-fly decoding only depends on the the block size which is fixed. I am also sure that further improvements for faster decoding can be possible with the use of SIMD instructions.
Feel free to get in touch if you want to share any feedback or have any ideas about the topic and would like to dig more into it.
Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. 2002. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms (SODA ‘02). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 233-242. ↩
Gonzalo Navarro and Eliana Providel. 2012. Fast, small, simple rank/select on bitmaps. In Proceedings of the 11th international conference on Experimental Algorithms (SEA’12), Ralf Klasing (Ed.). Springer-Verlag, Berlin, Heidelberg, 295-306. ↩
I recently started using Twitter more often, but I realized that sometime is not enough when you want to express a longer or more complex concept. So, in those cases I will write here but please keep following on Twitter if you are interested in my brief opinions too.
]]>