Building a shared library in C and using it in a Python program


Figure 1

How do old-time languages such as C, Fortran and others survive in a world with Python, Ruby and so on?

There is plenty legacy code still around which need maintaining, of course. And there are (will always be?) a few specific applications where low level is needed. But one of the great things with software is building upon old stuff using new tools, which brings us to our topic today: building a shared library containing some of our C stuff and using it in nice and comfy Python. Figure 1 shows an example of what we can achieve by using graphical tools available in Python to improve our existing code’s text-based output. More on that later on.

For our purposes, we consider shared libraries as a collection of compiled objects condensed into a single file, which may then be called by other software. This is, of course, a simplification. A longer discussion about shared and static libraries can be found in [1].

Continue reading “Building a shared library in C and using it in a Python program”


Trees, part IV – Benchmarking Red-black and AVL trees

In our previous installments we implemented two of the most well-known self-balancing binary search trees: AVL and Red-black trees.

We had a few classes on AVL trees in our basic data structures & algorithms class back in college, which made its implementation far less of a challenge than the Red-black tree. So besides the fundamental guidance of CLRS I had to do quite some googling to get it working. While googling I noticed there were quite a lot of questions about which (AVL or RB) tree was “better” in some sense, be it insertion, search time, deletion time, etc. Most textbooks and articles dismiss this question just by stating the factor differences in either trees’ worst case heights, as we briefly mentioned in the past installment. If you’re anything like me, however, you’ll want to see some comparisons where the trees are actually tested. So I decided to do some simple benchmarking to test those theoretical worst-cases. Here’s what I found out.

First off, we need at least two cases: worst and average case. As we know from the previous installments, the worst possible case for BST insertion is when you are inserting continuously increasing or decreasing values, e.g. 1, 2, 3, 4, … . In this case, a pure BST would behave exactly like a (doubly) linked list, while self-balancing trees should should spread out node distribution. The worst possible searches would be the top or bottom values, i.e. those close to the end of the “list”: a pure BST would have to traverse the entire list (n time), while self-balancing trees should enjoy a k~log(n) time with some factor k.

What would an “average case” look like? Hard to say; depend on what is average for your application. It might just be the case that sequences are the average case. Since we can’t define a “universal” average case and for the sake of simplicity, we will define the average case as a sequence of random numbers drawn from C’s rand() function (one might argue that this is actually the “best” case since on the long run the BST will “naturally” become quite reasonably balanced, but let’s not get picky about terminology).

Average case_search Average case_insert
Figure 1

Continue reading “Trees, part IV – Benchmarking Red-black and AVL trees”

Trees, part III – Red-black tree

In our last installment on trees, we studied and implemented the AVL tree. The AVL tree is one of many self-balancing binary search trees, a special kind of BST that enforces sub-linear operation costs by maintaining tree height close to the theoretical minimum of log_{2}(n). This is usually done by what is called tree rotation, which is basically moving around tree nodes (and updating some special node properties).

As you can see in the Wikipedia page¹, AVL trees guarantee that the tree height is strictly less than \approx 1.44~log_{2}(n), while Red-black trees have a slightly worse threshold of \approx 2~log_{2}(n); thus, AVL trees will provide significantly better search times than Red-black trees. However, while AVL trees may need to do O(log(n)) rotations after each insertion, Red-black trees must do at most 2 rotations per insertion. So either one may be your tree of choice depending on the application: if search time is critical but data doesn’t get updated too often, an AVL tree will perform better; whereas a Red-black tree will perform better in scenarios where data is constantly being changed.

Self-balancing BSTs add some kind of property to tree nodes that make way for tree balancing: with AVL trees, it was the “balance factor”. With Red-black trees, a “color” property is added to each node. This leads us to the Red-black tree properties:

1. Every node is either red or black
2. Every leaf is black
3. If a node is red, then both its children are black
4. Every path from a node to any of its descendant leafs contains the same number of black nodes

Continue reading “Trees, part III – Red-black tree”

Ruby DSL & metaprogramming, part II

In the previous installment we built a simple text generator using some Ruby meta-programming tricks. It was still far from being our desired context-free grammar (CFG) generator, though, since it lacked many CFG prerequisites. Most flagrantly, we had no rule recursion and only one production (rule definition) per rule. Here’s the what a script that would use both features:

  noun 'dog', 'bus'
  verb 'barked', 'parked'
  preposition 'at'

rule 'phrase'
  opt 'The', noun, verb, preposition, 'a', noun
  opt 'Here goes some', phrase, 'recursion.'
  opt 'Meet me', preposition, 'the station.'

grammar phrase: 10

The dictionary section is just as we left it. Let’s see what changed in the rule section.

Continue reading “Ruby DSL & metaprogramming, part II”

Ruby DSL & metaprogramming, part I

I’ve been working with Ruby for nearly a year now, which means I’m starting to feel the urge to tell people how awesome the language is. One of the most interesting aspects of Ruby to me is metaprogramming, which it seems to have quite a vocation for.

Since college I have a fondness for automata and formal languages theory. One of the topics I particularly like is text generation (if you haven’t already, check out the excellent SCIgen and the Dada engine), so I thought that building a Context-free grammar (CFG)-like text generator in Ruby would be a nice little exercise and an opportunity to use some of the language’s coolest features. Also I’ve implemented one of those using Java several years ago, and it was a mess, so I was curious as to how much of an improvement would Ruby offer.

Suppose the following script:

dictionary 'noun', 'dog', 'bus'
dictionary 'verb', 'barked', 'parked'
dictionary 'preposition', 'at'

rule 'phrase', 'noun', 'verb', 'preposition', 'noun'

codex 'phrase'

We’d like dictionary to store some words according to their classes, and rule to define a specific ordering of words. For now let’s not worry about codex (it’s just a collection of rules).

At this point the seasoned programmer is mentally sketching some kind of text parser. It’s an okay solution, but isn’t there something nicer we can do? Well, there is: DSLs! In fact, Ruby is quite an excellent tool to build a DSL, and many famed Ruby-powered applications such as Rspec (and many others) define some kind of DSL.

Continue reading “Ruby DSL & metaprogramming, part I”

Trees, Part II: AVL Tree

Masters classes started a few weeks ago, taking their toll on my productivity here. Sorry about that!

So we (pardon the nosism, but I think it sounds less egocentric than writing “I” all the time) hinted at AVL trees back on our Trees, Part I post. Specifically, we learned that:

a binary search tree (BST), provides O(h) time search, insert and delete operations (h is the tree height.

Linear time (O(h)) doesn’t sound very good – if h is close to n, we’ll have the same performance as a linked list. What if there were a way to bound the tree height to some sub-linear factor? As it turns out, there are several ways to do so, and the general idea of somehow keeping the tree height limited to a certain factor of the number of elements it holds is called height balancing. Ergo we’ll want to look into (height) balanced/self-balancing binary search trees (BBST)


                        .   .
                      .       .
                    .           .
                  .               .
                E .                 P .
              .     .                   .
            .         .                   .
          .             .                   .
      D .                 I                   Y

AVL tree

Since binary search trees have at most two children, the best tree height (i.e. smallest) we can achieve is log2 n (n being the number of elements in the tree). There are several self-balancing BSTs developed over the years. It seems that up there in the US college professors tend to prefer the red-black tree when studying BBSTs, whilst over here AVL is preferred. In any case, AVL tree was the first BBST ever devised, so we’ll adopt it as our BBST model.

AVL trees (named after its two Soviet inventors Adelson-Velsky and Landis) use a series of rotations to keep the tree balanced. To keep track of when a certain subtree rooted at some node needs to be rotated, we maintain (or calculate) a balance factor variable for each node, which is the difference between the node’s left and right children’s heights, i.e.:

balance_factor(n) = n.left_child.height – n.right_child.height

Continue reading “Trees, Part II: AVL Tree”

Shortest path, part I – Dijkstra’s algorithm

Now that we have a way to represent graphs, we can discuss one of the most important problems in graph theory: the shortest path problem (SPP). More or less formally, we’ll define SPP as:

Given a weighted graph G(V,E), find the sequence P = {v0, v1, v2, …, v(n-1)}, vi ∈ V, from vertex V0 to vertex V(n-1), such that the list of edges EP = {(v0,v1), (v1,v2), … (v(n-2), v(n-1))} exists and the summation of costs of all elements e ∈ EP is the smallest possible.

In other words, find the less expensive (ergo “shortest”) path between two vertices.

The trivial solution is using BFS starting at vertex A and stopping when it reaches vertex B. However, BFS doesn’t look at the edge costs: it calculates the path with least edges, not the path with least total cost.

Although not necessarily the fastest, Dijkstra’s algorithm is probably the most popular way to solve the shortest path problem due to its simplicity and elegance. The algorithm relies heavily on priority queues, so make sure to take a look at that if you haven’t already.


dist[from] = 0
for v : G 
      if v != source 
            dist[v] = infinity          
      prev[v] = -1
      PQ.add(v, dist[v])
while PQ.hasNext()                
      u = PQ.pop()             
      for each neighbor v of u
            alt = dist[u] + length(u, v) 
            if alt < dist[v]             
                  dist[v] = alt 
                  prev[v] = u
return prev

Continue reading “Shortest path, part I – Dijkstra’s algorithm”

Trees – Part I

tree Bright green tree - Waikato

We used trees to build the heap data structure before, but we didn’t bother with the theory behind trees, which are abstract and concrete data structures themselves. There’s a huge range of material to cover so I’ll split this in several posts.

In this first post we’ll cover the basic theory and implement a binary search tree (BST), which provides O(h) time search, insert and delete operations (h is the tree height). First, the basics:

Trees are graphs with a few extra properties and interpretations/conventions.

  • Trees have height (longest branch length) and depth (distance to root).
  • The uppermost level consists of at most one node (the tree root).
  • All nodes may have children.
  • There are no edges other than parent-child edges.

Trees are classified according to some of those properties above and some others we’ll mention later. Most commonly, there is a constraint to the maximum number of children per node -e.g. the binary tree limits children to 2 per node.
Continue reading “Trees – Part I”


Mathematically, a graph is a set of vertices and edges, thus a graph G is usually written as G(V,E). Besides linking vertices in the graph, edges can also carry a specific value which may be interpreted as cost, weight, distance etc.

graph viewed with BurgerGFX
graph viewed with BurgerGFX

In computer science, we’re interested in the (abstract) data structure used to implement the graph mathematical concept. Let’s first discuss the basic elements in a graph – vertices and edges:

typedef struct vertex
 unsigned long id;
 int status;
 double x,y;
 void* data;
} vertex;

Vertices should be able to hold any kind of data, so we’ll just throw in a void pointer for that. Other than that we have an id, status (marked or unmarked – more on that later) and 2D coordinates so we can draw the vertices somewhere.

typedef struct edge
 vertex* from, *to;
 int cost;
} edge;

Edges consist of just pointers to the vertices they link and an optional value used as weight, distance, cost etc. Strictly speaking we could use a void pointer for that value as well, as long as we also defined a comparison function. But let’s save the hassle and just use an integer instead – most algorithms will be fine with that.

Continue reading “Graph”

Heap & Priority Queues

Priority queues (PQs) are abstract data types that work just like regular stacks, but the popping order depends on each element’s priority instead of the sequence they were pushed onto the queue (FIFO or LIFO).

The naïve way of implementing a PQ consists of using an unsorted list or array and searching for the highest-priority element at each pop, which takes O(n) time. There are several more efficient implementations, of which the most usual is the heap.

Heaps are complete (i.e. all levels except possibly the last are filled) binary trees that work as PQs by maintaining the following property: children nodes always have a smaller priority than their parent, i.e. for any node A with children B and C, priority(B) < priority(A) && priority(C) < priority(A). Note that there is no assumed relation between siblings or cousins.

max-heap and corresponding array.
max-heap and corresponding array.

Each element of a heap has two pieces of information: a key and a value, hence we call them key-value (KV) pair. The key identifies the specific element, and the value determines the element’s priority within the heap. Heaps can be min-heaps (low value = high priority) or max-heaps (high value = high priority).

Continue reading “Heap & Priority Queues”