IntSet Development
- Written by:
- Athan Clark
- Date:
- Keywords:
- IntSet, JavaScript, Bit Masking
Abstract
On 2023-11-21, I developed a set container for JavaScriptbigint
s. This allows for unions, intersections, and other set operations, but in a relatively fast manner. This blog post describes my development methodology, the issues I faced, and how I overcame them. Table of Contents
tl;dr, Don’t use more than 16k values, and don’t have a variance greater than 65k between the minimum and maximum values.
Preamble
The purpose of IntSet is to act as a container for bigint
s by utilizing bit masks and bitwise operators. This would improve the runtime performance and reduce memory use compared to traditional methods, assuming the contents of the sets are “relatively close” to one another.
“Relatively close”, meaning the spread of integers within the set, or the range of values (the difference between the minimum value and maximum value) is lower than 65k. The reason why is discussed in the benchmarks.
Traditionally, if you have a group of values you need to retain, the everyday programmer would choose something simple like an array and manually check for uniqueness, while others would look to a tree-like structure like a binary tree, or better, a red-black tree, to contain the data. These are easy solutions, and doesn’t require much thought, but the performance of these structures matters when dealing with high throughput (either data or operations). Arrays grant O(1) insertion, but harbor O(n) for search and deletion. A red-black tree will grant O(log n) for all three operations in the worst case, but we can do better.
This blog post is an account of my development experience for this library, and a slight technical interlude on how it works. The audiance reading this should be familiar with basic set theory, JavaScript, and bits.
Before I go any further into the implementation, I’d like to describe the bit masking and how it works.
Bit Masking
Assume you have some non-negative real integer (natural number) x ∈ ℕ. Also, assume you have a bit space (where you can freely write and read bits at certain indicies), represented as array of bits with length n ∈ ℕ. For instance, a 32-bit integer consumes 32 bits of space, and in that circumstance, n = 32.
If x < n, then x can be represented as a single bit flipped in the space provided by n; x = 0 is the first bit, x = 1 is the second bit, x = 2 is the third bit, and so on. If you were to turn the bit space into a (little-endian / right-to-left) integer m ∈ ℕ, and if x’s bit were the only bit flipped in the bit space, then m = 2x.
As an example, let’s say x = 0, and n = 8. In this case, the value of the first bit is 1, and the value of the of the rest of the bits would be 0. If we turn this bit space into a integer m, it would equal 1 – in other words, 20 = 1. You can try this in JavaScript – the bit space would be written (using binary notation) as 0b00000001
, and JavaScript will tell us this value is actually just 1
(our m).
const m = 0b00000001;
console.log(m);
// → 1
Or, let’s say that x = 7, and n = 8 again. In that instance, n’s 8th bit is 1, and the rest are 0’s - it would look like 0b10000000
in JavaScript. Again, if we were to evaluate this, it would return 128
, which is the same as 27 = 128.
const m = 0b10000000;
console.log(m);
// → 128
Bitwise Or and Union
Now lets imagine the union of these two examples – we’ll “zip” the bit spaces together, and if either of each compared bit has a value of 1, we’ll retain it. Doing this with our previous examples of 0b00000001
and 0b10000000
would return 0b10000001
, which when evaluated in JavaScript would reveal 129
:
const m = 0b10000001;
console.log(m);
// → 129
What we mean by this bit space is “both the values of x = 0 and x = 7 are present in the bitspace n = 8”.
Fortunately, JavaScript (and in fact, most CPU architectures) implement this bitwise union through the bitwise “OR” operator |
. We can try running 128 | 1
and 0b10000000 | 0b00000001
in JavaScript, and we’ll receive the result 129
:
console.log(128 | 1);
// → 129
console.log(0b10000000 | 0b00000001);
// → 129
What does this imply?
Given a contiguous bitspace with a size n ∈ ℕ, all natural numbers less than the bitspace size (∀x ∈ ℕ such that x < n) can be represented in the bitspace via the presence of its xth bit.
Furthermore, given a set A of natural numbers less than the bitspace size (A = {x|x < n}), where each member of the set is referred to as x1…xq (i.e., there are q elements in A), the bitspace can be represented as:
mA = 2x1 ∪ 2x2 ∪ … ∪ 2xq
Where mA is the whole integer that represents the set A, and ∪ is the bitwise or operator |
. These implications are expanded on the other operators as well:
Bitwise Operator | Set Operator |
---|---|
| / OR | ∪ / Union |
& / AND | ∩ / Intersection |
^ / XOR | ⊕ / Symmetric Difference |
With these primitive operations, the difference operation can be defined as follows:
X ∖ Y = X ∪ (X⊕Y)
Unbounded Data
Now that we have a mechanism to create sets of natural numbers up to size n ∈ ℕ, we need one that could be (mostly) unbounded.
A first stab – Arrays
Initially, I imagined am array of m’s to be the most intuitive solution – the bit space representation of a set A of any value x ∈ ℕ is an array ψ of length pψ = ⌈y/n⌉ elements, each of which holds an m, with bit size of n. The value of an ith bit in a jth bitspace mj is calculated as follows:
Note, ⌈...⌉ is the ceiling operation.
i + (j×n)
Where j is the (zero-based) index in the array that contains mj.
Implementations of setwise union, intersection, and the like would be performed through zip-with – where in the case of union and symmetric difference, the elements not present in the other are simply retained as-is – the resulting array between sets with arrays ψ and ϕ would be the maximum of pψ and pϕ.
Issues with Arrays
The critical issue with this solution is the case where the set only holds large values. In that circumstance, there will be wasted 0-valued m elements denoting orders of magnitude, up until the relevant ip value – essentially, this makes the size of the set linear with respect to the maximum value of y/n regardless of the count of elements contained within the set, and likewise the setwise uperations would be O(p).
This is unacceptable, and a better solution exists.
A better stab – Maps
Rather than store empty m values, we can sparsely store m values if they are greater than 0 by utilizing the built-in Map
object in ECMAScript 6 – Now, we can store the m values with the ip index as its key:
ip ↦ m
This causes for a great deal of efficiency where x values are relatively near to each other. Particularly, if the average difference between x values are below n, then they’ll (likely) be stored in the same m value, regardless of how large the x values are.
Implementation
The code lives in my personal GitHub, and can be inspected very easily - it’s only a few hundred lines.
I tried to be thorough with the tests and benchmarks, however the latter are a bit difficult to parse, but they did help me choose a good default n value. I am confident with the default settings of IntSet
, but if you’d like to customize n, it can be supplied as an argument to new IntSet(n)
.
Tests
The tests are built using property-based testing techniques, namely via a fuzz tester library for JavaScript called fast-check. The following invariants are tested:
Name | Expression |
---|---|
Existence | ∀ X ∈ IntSet, x ∈ ℕ. X ∪ {x} ∋ x |
Non-Existence | ∀ X ∈ IntSet, x ∈ ℕ. X ∖ {x} ∌ x |
Union Commutativivty | ∀ X ∈ IntSet, Y ∈ IntSet. X ∪ Y = Y ∪ X |
Union Identity | ∀ X ∈ IntSet. ∅ ∪ X = X |
Union Associativity | ∀ X ∈ IntSet, Y ∈ IntSet, Z ∈ IntSet. (X∪Y) ∪ Z = X ∪ (Y∪Z) |
Intersection Commutativivty | ∀ X ∈ IntSet, Y ∈ IntSet. X ∩ Y = Y ∩ X |
Intersection Absorption | ∀ X ∈ IntSet. X ∩ X = ∅ |
Intersection Associativity | ∀ X ∈ IntSet, Y ∈ IntSet, Z ∈ IntSet. (X∩Y) ∩ Z = X ∩ (Y∩Z) |
Symmetric Difference Commutativivty | ∀ X ∈ IntSet, Y ∈ IntSet. X ⊕ Y = Y ⊕ X |
Symmetric Difference Identity | ∀ X ∈ IntSet. ∅ ⊕ X = X |
Symmetric Difference Absorption | ∀ X ∈ IntSet. X ⊕ X = ∅ |
Symmetric Difference Associativity | ∀ X ∈ IntSet, Y ∈ IntSet, Z ∈ IntSet. (X⊕Y) ⊕ Z = X ⊕ (Y⊕Z) |
Difference Identity | ∀ X ∈ IntSet. X ∖ ∅ = X |
Difference Absorption | ∀ X ∈ IntSet. X ∖ X = ∅ |
Difference / Union Intersection Equivalence | ∀ X ∈ IntSet, Y ∈ IntSet. X ∖ Y = Y ⊕ X |
Difference / Union Symmetric Difference Equivalence | ∀ X ∈ IntSet, Y ∈ IntSet. X ∪ (X⊕Y) = X ∖ Y |
I’ve deemed this to be a pretty thorough test suite. If you feel like more properties should be represented in the test suite, please feel free to file a bug report.
Note: IntSet in the above contexts is defined as the set of all
IntSet
s.The empty set is defined as
∅ ∈ IntSet Such that ∀x ∈ ℕ. x ∉ ∅
Benchmarks
The benchmark suite is, in my opinion, sub-standard. I originally built it to try and find a good default value for n; my assumption was that 64 would be a good value, but I wanted to verify.
I have two forms of benchmarking suites - the first uses Benchmark.js to find a “fastest” version of the union
function, and the second takes a heap snapshot of the union
function of the same two sets, and also measures the approximate total amount of data stored in that resulting set. The data being generated are two sets, each of which are filled with 216 random bigint
s, with maximum value y of 28 ≤ y ≤ 232, and implementations of n varying between 25 ≤ n ≤ 215. The following figures show the results:
Interpretations
A few observations can be made by these figures. First, let’s start with Operations Per Second:
- Sets maximize speed for random numbers generated up to 28 when n ≥ 28.
- Sets maximize speed for random numbers generated up to 212 when n ≥ 212.
- Before the speed is maximized, the approach to obtaining higher speed at a particular max random value comes at a curve.
- The entire space past 220 for maximum size is pretty flat - no drastic speed increases.
Secondly, Heap Used:
- The dominating factor is the maximum value of the generated numbers, not n.
- n’s increase does have a minor factor past n ≥ 212, especially in the extreme cases of y = 32.
Lastly, Total Size:
- Total size appears to gain exponential growth past y ≥ 24, and can be directly observed with an increase in n.
Unfortunately, the last figure doesn’t give us a great deal of insight to the (potential) total amount of memory used, but rather a philosophical perspective, which is why I included the “Heap Used” figure as well.
Given that concern, I’ve opted to reduce the chart’s ranges a bit for these next figures - n will not exceed 212, and the max number generated will not exceed 224.
Further Interpretations
From the latter two figures, a reasonable conclusion can be observed – space consumed by sets are relatively constant up until the maximum number generated becomes greater than 216. This, however, may be indirectly related to the fact that there are only 216 numbers in each set; smaller generated values would likely exhaust the sample space.
Additionally, although the assumed spatial consumtion of the sets (last figure) is assumed to increase as n increases, we can see that for cases where the maximum generated number is greater than 216, the heap is actually larger for smaller values of n. I assume this to be due to the operational overhead, i.e. callbacks and the like, because of the need to “mesh” more values together.
Conclusion
There are a few key points to receive from this investigation:
- Potential gains from this library are drastically diminished when the set contents can be larger than 216, both in terms of speed and space.
- For values generated up to 212, I recommend a default n value of 28, due to the clef in the Reduced Operations Per Second at x = 8, y = 12.
- For values generated up to 216, I recommend an n value of 212.
- For larger values, maybe consider a different library.
Considerations
A few potential deficiencies should also be identified:
- I only benchmarked
union
, not the other set-wise operations. - I only benchmarked sets that have 216 elements, not fewer ones.
- I used
Math.random()
to generate the random numbers – there may be a lack of uniformity in the random numbers being generated.
However, I don’t want to address those at this time. I hope you enjoyed this exploration! It was very insightful for me.