Author Archives: juangallostra

Procedural Racetrack Generation

This post explains my personal take (in Python) of the work done by Gustavo Maciel and explained in this article.


I have been thinking for some time to develop a game that mimics the mechanics of vector racer (nothing new here). When I finally started to work on it, I thought it would also be a good opportunity to give a try at procedurally generating racetracks.

I mainly wanted to achieve two things:

  1. Procedurally generate racetrack layouts
  2. Implement a drawing system generic enough that could draw recognisable racetracks from the layouts obtained in 1.

Figure 1: Procedurally generated racetracks. (Clickbait: As you may notice, some corners do not have kerbs… Keep reading to find out why!)

A quick search led me to Gustavo’s article, linked above, and after skimming through it decided to give it a go.

Racetrack Generation

The track generation algorithm is well explained in Gustavo’s article. I will just present the general idea, but if you want to dive into the details of the algorithm I suggest reading the original post. The image below shows some of the obtained layouts obtained from my implementation.


Figure 2: Layouts of some of the procedurally generated racetracks.

The main steps followed to obtain the track layout are:

  1. Generate a set of random points (white dots in Figure 1).
  2. Get their convex hull (red lines in Figure 1).
  3. Compute the midpoint of each pair of points in the convex hull and displace it by a random amount.
  4. Given both a minimum angle and a minimum distance threshold, push apart the set of points resulting from 2 and 3 to try to guarantee that: a) there are not two adjacent points whose distance is less than the minimum distance threshold and b) there is no set of three adjacent points where the angle of the vectors that connect them is less than the minimum angle threshold. (blue lines in Figure 1).
  5. Obtain the final layout by interpolating all the points obtained after step 4 with splines forcing them to pass through all the input points (black curve in Figure 1).

There are quite a few parameters (more than I would like) that can be modified and tuned to alter the results, and after playing a bit with them I was quite happy with the obtained layouts. Some of them are presented in Figure 2.

Drawing the Racetrack

Since tracks are procedurally generated, the track rendering system has to be generic enough to be able to handle a wide variety of layouts. This made the rendering interesting pretty quickly, and ended up being harder than expected. My results are far from ideal, but anyway, here’s what I came up with. Any suggestions on how to improve it are more than welcome!

Growing the Track

This is the easiest part. To obtain something that begins to look like a racetrack we can simply get a predefined number of evenly spaced points (I chose 1000 points) from the curve obtained at the end of the previous section. Then, at the position of each of these points we simply draw a circle of a given radius (20 pixels in my case).

If the points are close enough there will be no sign of the individual circles and we will have a constant-width racetrack with smooth corners. The result does indeed already look like a racetrack (at least, that is what I like to think). Some results after this stage are presented in the Figure below.


Figure 2: Partial racetrack drawing.

Placing Kerbs

This is where it began to get interesting. My approach is probably not the optimal one, so I am open to give a try at any suggestions on how to improve it.

The first thing we have to do is to detect where the corners of the racetrack are. A solution that balances complexity and good results is to use the set of points from which the splines were computed (points connected with blue lines in Figures 1 and 3)  to roughly estimate where the corners of the track are.

Given three consecutive points we can compute the angle between the two vectors that connect the point in the middle with both extremes. If this angle is inside a range, we add it to the list of corners requiring kerbs. We can play with this range to increase or reduce the amount of kerbs that will be placed in the track. Figures 3 and 4 show results obtained for different angle ranges. One drawback of this approach is that we will miss any corner in the spline which does not satisfy the detection method.


Figure 3: Complete racetrack with kerbs. Note that all the dots connected with blue lines have resulted in a corner with a kerb in the final racetrack.


After this step we will have the list of single points, representing a corner in the track, that require kerbs. We have the location of the corner but, before drawing anything, we should know the length of the corner (where does it start and where does it end). To simplify, I decided to set a fixed offset that represents the length of the corner and assume that the points indicating the corner location are the center of the corner. With this approach all the corners will be assumed to have the same length, which is not true but will make our lifes easier.

Now, since we have the guarantee that the points selected in the previous step are part of the spline, we can find their index in the set of points that we used to draw the circles that resulted in the basic racetracks shown in Figure 2. Once having this indices we can, for each corner, build the list of points from the spline that represent the full corner. This list is made by the points inside the range [corner center – offset, corner center + offset]. I had to play a bit with the offset value, but I ended finding a value I was satisfied with.

Figure 4: Complete racetrack with kerbs. Note that, in this case, not all the dots connected with blue lines have resulted in a corner with a kerb in the final racetrack.

We now have, for each corner, a list of points that indicate its location and length. It is time to draw the kerbs. The main restriction here is that kerbs should outline the corners and be located at the limits of the track.  I thought that the best approach would be to draw a simple building block and place it several times along the corner. After a bit of trial and error I ended up with the following tile (althoug it cannot be seen, there is another white stripe at the right):


Now we just need a position and orientation to place them in the track. Getting the orientation is as simple as computing the angle between the vector obtained from two consecutive points in the corner and the horizontal axis. Instead of two consecutive points I defined yet another offset n and, for each point in the corner, computed the angle between (i, i+n) and the horizontal axis.

As for the location, first we should compute the unitary vector perpendicular to the vector used to determine the orientation. Multiplying this vector by the radius of the circles used to draw the track gives us the displacement from the track center to the place where the kerb should be drawn. With this values we are set to draw the kerb.

To prevent the tiles from ovelapping I repated this process for every 4th (another parameter to be adjusted) point in the list of corner points.

I did the same for the indicator of the track start, but this time placing it in the middle of the track and perpendicular to the track direction. The tile I used is shown below.


Results and Code

The images below show some example racetracks obtained with the algorithm explained above. The code (a bit messy) can be found here.

Results showing the underlying process





Final Racetracks





If you have any doubts or suggestions, do not hesitate to leave a comment below!


Lane detection via Hough Transform

This work was developed in conjunction with Carlos Trapiello


During my Master’s degree I was lucky enough to work on several projects which turned out to be both interesting and fun. One of these projects was to develop a simple road lane detector via the Hough transform. When I first read about the Hough transform I was truly amazed by the simplicity and cleverness of the idea. My hope is that, after reading this post, we will both agree on that.

Although the final code is available on GitHub it is not a piece of ( warning: MATLAB) code I am proud of. This is why we will mainly focus on the algorithm and not on its implementation.

Jump to the end of the post if you want to see how the algorithm performs.


Before diving into the actual Hough transform and how can it be used to detect lanes, it is worth stopping for a second to think about the general picture and the different parts that will be included in our solution to get the best results we can. Since we already know which method we will be using to detect lanes, we can use this information to design and implement a pre-processing step that takes a raw frame and:

  1. Prepares it to be fed to the Hough transform.
  2. Tries to get rid of any  irrelevant information and unwanted noise that may affect negatively the performance of the Hough transform.

This will be the first step of our solution. After that, the pre-processed image will be fed to the Hough transform and, hopefully, lanes will be detected. This two-step process (pre-processing + Hough transform) will be repeated iteratively until the end of the video. A part from that, we can expect some sort of relation/continuity between consecutive frames. This two facts can be used in our favor. It enables us to include, in the lane detection calculations for the current frame, information of where lanes were located in past frames. This should improve, at least in theory, the accuracy of our algorithm. Hence, the last step of our lane detection algorithm will be a tracking system that helps in deciding, from all the detected lines, which are the actual road lanes.


Figure 1. Example input frame. We will use this image to illustrate the different steps of the algorithm.

In the light of the above, we decided that our lane detector would consist in the three stages already listed:

  1. Frame pre-processing
  2. Lane detection via Hough transform
  3. Lane tracking
Figure 1: Lane detection algorithm outline.

Figure 1: Lane detection algorithm outline.

The rest of the post provides an explanation of each of this three stages. We will start with the Hough transform and see how it works, how can we use it to detect lanes and which is its output and expected input. Once we know the expected input of the Hough transform, we will move on to the pre-processing stage and see how can we convert a raw frame into something the Hough transform can work with. Last but not least, we will figure out one of many ways in which a tracking system can be implemented to make use of past lane location information.

Hough Tansform

As with many Computer Vision tasks, the main problem we face for successfully implementing our lane detection algorithm is getting the computer to make sense of the information contained in the image which is relevant to achieve the desired goal. In other words, we should devise a way of extracting from the image those features that give us the information needed to solve the task at hand.

Since road lanes are indeed lines, it makes sense to think that the information we need to extract from the image in this particular case are lines. Once the lines present in our image, as well as their position, have been detected it will be easier to decide which of those lines are the actual road lanes. Hence, the features we have to get from the video frames are lines. We need a method that detects lines in an image.

The Hough transform is based on a rather simple but genius idea. You will probably know that a straight line can be characterized by the equation:

y = m·x+n

Where m and n are the parameters of the straight line, m being the slope and n the y-intercept (the value of y at the point where the line intercepts the y-axis). Given the values of m and  we are able to know all the points that lie in the line. Its as easy as insterting the specific values of x and y in the equation and checking if the equality is satisfied. If it is satisfied, the point is part of the line. If it is not, the point is not part of the line. As simple as that. What we just said is that any line can be characterized with this two parameters, m and n.

The Hough transform uses this basic idea to detect lines. The basic procedure of the algorithm (which can be further optimised) is as follows: Given a set of interest points from an image, it takes two at a time and assumes they are part of a straight line. It computes m and n for that match and places a vote for them in the parameter space. This procedure is repeated for all possible matches. After that it checks which pairs of parameters have more than votes, where is a threshold value, and returns those that meet the criteria as the detected lines.

A simplified version of the algorithm with a small number of points and a discretised parameter space is ilustrated (poorly) in the image below.


Simplified visual version of the Hough Transform.

This approach has just one flaw, which is that vertical lines have an infinite slope (m equals inifnity). To solve this, instead of using m and n, parameters are represented in polar coordinates.

The basic idea explained above can be optimised by only checking, for each point, a subset of all its possible matches. Since we are looking for lines, which have a continuous nature, it only makes sense to check matches in a small neighbourhood.

The two images seen below are examples of the output obtained when applying the Hough Transform to two already pre-processed frames of a dash cam video. The white squares mark the pairs of parameters above the defined threshold.


Hough Transform output.


Hough Transform output.

In our case, the input of the Hough Transform will be a black and white image where white pixels are the interest points that will be processed by the algorithm.

As a final note, I just wanted to point out that the Hough transform is not only restricted to detect straight lines. It can be used to detect more general shapes. If interested, you can find more here.

Okay, now we know what the Hough Transform is and how it works. What we should cover now is how we prepare a raw frame so that it can be fed to the algorithm.

Frame Pre-processing

When developing a Computer Vision algorithm (with the permission of machine learning), it is normally required to implement a first pre-processing stage that takes a raw video frame or image as input and transforms it in such a way that it becomes a suitable input for the core algorithm. At least from my experience, image pre-processing is not a mechanical task but rather an applied art. Although a series of general criteria exist, much is left to the developer when it comes to finding the optimal pre-processing procedure for the task at hand.

Four our particular case, we have already seen that the Hough Transform will perform at its best when its input is a binarized image that only contains the edges of the figures found in the original image. Our pre-processing step should then aim to produce this kind of image that can be used as the input of the Hough Transform, being the binarization and edge detection the key steps of the process. The pre-processing routine we developed consisted in the 5 steps listed below:

  1. Grayscaling: The image is transformed from RGB to grayscale space so it is easier to work with.
  2. Region of Interest (ROI) selection: There are some regions of the image were we can be almost a 100% sure that road lanes will not be found. For this specific application, the top half of the frames will rarely contain road lanes (see Figure 1). This is why a Region Of Interest (ROI) is defined and selected and only this region will be further processed.
  3. De-noising: A median filter is applied to remove noise and reduce sharpness. This will hopefully improve the results of the edge detector.
  4. Binarization: The ROI is binarized to a black and white image. The key parameter here is the threshold value that determines if a pixel will be mapped to black or white.
  5. Edge detection: A Sobel operator is used to detect the edges in the image. There many other methods to perform edge detection, but the Sobel operator is efficient and yields good enough results.

This pre-processing routine applied to each frame is exemplified in the images below.


Original frame.


Region of interest.


Denoised ROI.


Binarized ROI.


Detected edges.

After the edge detection has been performed the resulting image is ready to be processed with the Hough Transform. After applying the Hough transform we will have a set of lines that have been detected in the frame and we will have to, somehow, decide which of them are the actual road lanes. This where a tracking system becomes handy.

Lane tracking

As previously explained, the main idea behind the lane mark tracking system is to help in the decision of selecting which line should be chosen as the line that best represents the lane mark for a given frame k. (The method explained below is just one of many. There are other methods, such as Kalman filtering, which would probably yield best results.)

We implemented a really simple method – with many disadvantages -. Keeping a record of which was the line selected as the lane mark on frame k − 1 and boldly assuming this line is actually the real lane mark. This information is then used to predict where will be the lane mark of frame k. Once the prediction of the lane mark position of frame k is computed then it is checked which is the line among all the obtained candidate lines of frame k that best fits the prediction. The set of candidate lines is the set of lines obtained after performing the Hough transform on the frame.

This line that best ressembles the predicted lane mark is then selected as the actual lane mark of frame k and will be used to predict the lane mark position of frame k + 1. Since, as already stated, we are using the straight lane mark assumption it has been decided to predict the position of the lane mark on frame k with the line that was selected as the lane mark on frame k − 1. Once knowing which is the predicted lane mark of frame k the best fit for that same frame is calculated bySelection_055

The line that satisfies the above condition will be selected then as the line which best represents the road lane. It has to be pointed out here that, since there are two lane marks to be detected (left and right) and the Hough transforms returns the set of all detected lines in the frame there is a previous step that has to be done before applying the lane tracking algorithm. It consits in dividing the set of lines in two subsets: left lane candidates and right lane candidates. The filtering method that we applied is, for each candidate line


Where the notation used is the same as the one used in the expression that computes the best fit line. Note that, since there is no equal sign in any of the two expressions the previous filtering method automatically discards vertical lines. It is true that this method will yield wrong results if both lanes point in the same direction. However, due to camera perspective this is not a common phenomenon and to keep things simple we skipped handling this special case. Maybe it would have been better to divide lines in left and right candidates based on x,y coordinates. Another drawback of this simple method that only checks for the past frame is that if one estimation is completely off it is hard to drive the system back to correctly estimating road lanes.


There is one last thing we have to address. The the tracking system works with the information provided from past frames but at frame 0 there is no past information. Hence, the tracking system needs some sort of initialization when it is executed for the first time. It basically needs a guess of where the road lines are supposed to be to get things rolling. We did not overthink it too much. We assumed lanes met at the top of the ROI and at the center of the frame. We also assumed that both lanes started at the bottom of the ROI. The left lane, at 1/4 of the frame width and the right lane at 3/4 of the frame width.


Tracking system initialization. The blue lines are the guess the system makes of road lanes’ position in its first iteration.


By applying the simple steps explained above we can get a simple lane detection and tracking system running. It is by no means perfect and can be improved in many ways, but it is enough to get us started in the topic. Sample results in different situations are presented in the videos below:

  • Straight line:

  • Turn:

  • Turn with all detected lines:

  • Tunnel exit (notice how the change in contrast makes the algorithm go crazy):

  • Tunnel exit with all detected lines:


To see the performance of the algorithm in real time I decided to test how well it performed on a driving simulator. On the left screen you can see me uncautiously driving and, on the right screen, the performance of the algorithm in real time. I captured the game window and used the frames to feed the algorithm. Enjoy!


Final notes

My current work is making me have to deal pretty intensively with Autodesk’s Forge. I plan on writting about it in the near future. If you are also interested in the topic let me know!

Augmented Reality with Python and OpenCV (part 2)

The code of this project can be found here.

For those of you that have found this post before part 1 or that want to refresh what we have done up to this point, here you can catch up with the current state of the project so far. For the rest, we’ll keep going from where we left it. I’ve been really busy lately, so sorry if this second part is not as detailed as the first one.

We left the project in a state where we were able to estimate the homography between our reference surface and a frame that contained that same surface in an arbitrary position. To do so we were using RANSAC. Well,  how do we keep building our augmented reality application from there?

Pose estimation from a plane

What we should achieve to project our 3D models in the frame is, as we have already said, to extend our homography matrix. We have to be able to project not only points contained in the reference surface plane (z-coordinate is 0), which is what we can do now, but any point in the reference space (with a z-coordinate different than 0).

If we go back to the first paragraphs of the section Homography Estimation, on part 1, we reached the conclusion that the 3×3 homography matrix was the product of the camera calibration matrix (A) by the external calibration matrix – which is an homogeneous transformation-. We dropped the third column (R3) of the homogeneous transformation because the z-coordinate of all the points we wanted to map was 0 (since all of them were contained in the reference surface plane). Figure 18 shows again the last steps of how we obtained the final expression of the homography matrix.


Figure 18: Derivation of the homography matrix. Source: F.Moreno

This meant that  we were left with the equation presented in Figure 19.


Figure 19: Components of the homography matrix. Source: F. Moreno

However, we now want to project points whose z-coordinate is different than 0, which is what will allow us to project 3D models. To do so we should find a way to compute, from what we know, the value of R3. Why? Because once z is different than 0 we can no longer drop R3 from the transformation (see Figure 18) since it is needed to map the z-value of the point we want to project. The problem of extending our transformation from 2D to 3D will be solved then when we find a way to obtain the value of R3 (see Figure 18, again). But, how can we get the value of R3?

We have already estimated the homography (H) , so its value is known. Furthermore, either by looking up the camera parameters or with a bit of common sense, we can easily know or make an educated guess of the values of the camera calibration matrix (A). Remember that the camera calibration matrix was:


Figure 20: Camera calibration matrix. Source: F. Moreno

Here you can find a nice article (part 3 of a recommended series) that talks in detail about the camera calibration matrix and each of its values, and you can even play with them. Since all I am building is a prototype application I just made an educated guess of the values of this matrix. When it comes to the projection of the optical center, I just set u0 and v0 to be half the resolution of the frames I am capturing with OpenCV (u0=320 and v0=240). As for the focal length, this article provides some useful information on how to estimate the focal length of a webcam or a cellphone camera. I set fu and fv to the same value, and found that a focal length of 800 worked quite well for me. You may have to adjust these parameters to your actual set up or even go on calibrating your camera.

Now that we have estimates of both the homography matrix H and our camera calibration matrix A, we can easily recover R1, R2 and by multiplying the inverse of by H:


Figure 21: Recovering the external calibration matrix values from the estimated homography and the camera calibration matrix. Source: F.Moreno

Were the values of G1, G2 and G3 can be regarded as:


Figure 22: Extracting the values of the external calibration matrix. Source: F.Moreno

Now, since the external calibration matrix [R1 R2 R3 t] is an homogeneous transformation that maps points amongst two different reference frames we can be sure that [R1 R2 R3] have to be orthonormal. Hence, theoretically we can compute R3 as:


Figure 23: Computation of R3. Source: F. Moreno

Unluckily for us, getting the value of R3 is not as simple as that. Since we obtained G1, G2 and G3 from estimations of A and there is no guarantee that [R1=G1 R2=G2 R3=G1xG2] will be orthonormal. The problem is then to get a pair of vectors that are close to G1 and G2 (since G1 and G2 are estimates of the real values of R1 and R2) and that are orthonormal. Once this pair of vectors has been found (R1′ and R2′) then it will be true that R3 = R1’xR2′, so finding the value of R3 will be trivial. There are many ways in which we can find such a basis, an I will explain on of them. Its main benefit, from my point of view, is that it does not directly include any angle-related computation and that, once you get the hang of it, it is quite straightforward.

Finally and before diving into the explanation, the fact that the vectors we are looking for have to be close to G1 and G2 and not just any orthonormal basis in the same plane as G1 and G2 is important in understanding why some of the next steps are required so make sure you understand it before moving on. I will try my best in explaining the process by which we will get this new basis but if don’t succeed in doing so do not hesitate to tell me and I will try to rephrase the explanation and make it clearer. It will be useful to have at hand Figure 24 since it provides visual information that can help in understanding the process. Note that what I am calling G1 and G2 are called R1 and R2 respectively in Figure 24. Let’s go for it!

We start with the reasonable assumption that, since G1 and G2 are estimates of the real R1 and R2 (which are orthonormal), the angle between G1 and G2 will be approximately 90 degrees (in the ideal case it will be exactly 90 degrees). Furthermore, the modulus of each of this vectors will be close to 1. From G1 and G2 we can easily compute an orthogonal basis -this meaning that the angle between the basis vectors will exactly be 90 degrees- that will be rotated approximately 45 degrees clockwise with respect to the basis formed by G1 and G2. This basis is the one formed by c=G1+G2 and  d = c x p = (G1+G2) x (G1 x G2) in Figure 24. If the vectors that form this new basis (c,d) are made unit vectors and rotated 45 degrees counterclockwise (note that once the vectors have been transformed into unit vectors – v / ||v|| – rotating the basis is as easy as d’ = c / ||c|| + d / ||d|| and  c’ = c / ||c|| – d / ||d||), guess what? We will have an orthogonal basis which is pretty close to our original basis (G1G2). If we normalize this rotated basis we will finally get the pair of vectors we were looking for. You can see this whole process on Figure 24.


Figure 24: Normalization of [R1 R2 R3] to guarantee that they are orthonormal. Source: F.Moreno

Once this basis (R1′, R2′) has been obtained it is trivial to get the value of R3 as the cross product of R1′ and R2′.  This was tough, but we are all set now to finally obtain the matrix that will allow us to project 3D points into the image. This matrix will be the product of the camera calibration matrix by [R1′ R2′ R3 t] (where t has been updated as shown in Figure 24). So, finally:

3D projection matrix = A · [R1′ R2′ R3 t]

Note that this 3D projection matrix will have to be computed for each new frame. With numpy we can, in a few lines of code, define a function that computes it for us:


def projection_matrix(camera_parameters, homography):
 From the camera calibration matrix and the estimated homography
 compute the 3D projection matrix
# Compute rotation along the x and y axis as well as the translation
homography = homography * (-1)
rot_and_transl =, homography)
col_1 = rot_and_transl[:, 0]
col_2 = rot_and_transl[:, 1]
col_3 = rot_and_transl[:, 2]
# normalise vectors
l = math.sqrt(np.linalg.norm(col_1, 2) * np.linalg.norm(col_2, 2))
rot_1 = col_1 / l
rot_2 = col_2 / l
translation = col_3 / l
# compute the orthonormal basis
c = rot_1 + rot_2
p = np.cross(rot_1, rot_2)
d = np.cross(c, p)
rot_1 = / np.linalg.norm(c, 2) + d / np.linalg.norm(d, 2), 1 / math.sqrt(2))
rot_2 = / np.linalg.norm(c, 2) - d / np.linalg.norm(d, 2), 1 / math.sqrt(2))
rot_3 = np.cross(rot_1, rot_2)
# finally, compute the 3D projection matrix from the model to the current frame
projection = np.stack((rot_1, rot_2, rot_3, translation)).T
return, projection)


Note that the sign of the homography matrix is changed in the first line of the function. I will let you think why this is required.

As a summary, let me shortly recap our thought process to estimate the 3D matrix projection.

  1. Derive the mathematical model of the projection (image formation). Conclude that, at this point, everything is an unknown.
  2. Heuristically estimate the homography via keypoint matching and RANSAC. -> H is no longer unknown.
  3. Estimate the camera calibration matrix. -> is no longer unknown.
  4. From the estimations of the homography and the camera calibration matrix along with the mathematical model derived in 1, compute the values of G1, G2 and t.
  5. Find an orthonormal basis in the plane (R1′, R2′) that is similar to (G1,G2), compute R3 from it and update the value of t.



Figure 25: Thought process to recover the 3D projection matrix.

Model projection


Figure 26: Fox projection.

We are now reaching the final stages of the project. We already have all the required tools needed to project our 3D models. The only thing we have to do now is get some 3D figures and project them!

I am currently only using simple models in Wavefront .obj format. Why OBJ format? Because I found them easy to process and render directly with bare Python without having to make use of other libraries such as OpenGL. The problem with complex models is that the amount of processing they require is way more than what my computer can handle. Since I want my application to be real-time, this limits the complexity of the models I am able to render.

I downloaded several (low poly) 3D models format from such as this fox. Quoting Wikipedia, a .obj file is a geometry definition file format. If you download it and open the .obj file with your favorite text editor you will get an idea on how the model’s geometry is stored. And if you want to know more, Wikipedia has a nice in-detail explanation.

The code I used to load the models is based on this OBJFileLoader script that I found on Pygame’s website. I stripped out any references to OpenGL and left only the code that loads the geometry of the model. Once the model is loaded we just have to implement a function that reads this data and projects it on top of the video frame with the projection matrix we obtained in the previous section. To do so we take every point used to define the model and multiply it by the projection matrix. One this has been done, we only have to fill with color the faces of the model. The following function can be used to do so:

def render(img, obj, projection, model, color=False):
    vertices = obj.vertices
    scale_matrix = np.eye(3) * 3
    h, w = model.shape

    for face in obj.faces:
        face_vertices = face[0]
        points = np.array([vertices[vertex - 1] for vertex in face_vertices])
        points =, scale_matrix)
        # render model in the middle of the reference surface. To do so,
        # model points must be displaced
        points = np.array([[p[0] + w / 2, p[1] + h / 2, p[2]] for p in points])
        dst = cv2.perspectiveTransform(points.reshape(-1, 1, 3), projection)
        imgpts = np.int32(dst)
        if color is False:
            cv2.fillConvexPoly(img, imgpts, (137, 27, 211))
            color = hex_to_rgb(face[-1])
            color = color[::-1] # reverse
            cv2.fillConvexPoly(img, imgpts, color)

    return img


There are two things to be highlighted from the previous function:

  1.  The scale factor: Since we don’t know the actual size of the model with respect to the rest of the frame, we may have to scale it (manually for now) so that it haves the desired size. The scale matrix allows us to resize the model.
  2.  I like the model to be rendered on the middle of the reference surface frame. However, the reference frame of the models is located at the center of the model. This means that if we project directly the points of the OBJ model in the video frame our model will be rendered on one corner of the reference surface. To locate the model in the middle of the reference surface we have to, before projecting the points on the video frame, displace the x and y coordinates of all the model points by half the width and height of the reference surface.
  3. There is an optional color parameter than can be set to True. This is because some models also have color information that can be used to color the different faces of the model. I didn’t test it enough and setting it to True might result in some problems. It is not 100% guaranteed this feature will work.

And that’s all!


Here you can find some links to videos that showcase the current results. As always, there are many things that can be improved but overall we got it working quite well.


Final notes


Figure 27: Done. Let’s ship the code!

Especially for Linux users, make sure that your OpenCV installation has been compiled with FFMPEG support. Otherwise, capturing video will fail. Pre-built OpenCV packages such as the ones downloaded via pip are not compiled with FFMPEG support, which means that you will have to build it manually.

As usual, you can find the code of this project on GitHub. I would have liked to polish it a bit more and add a few more functionalities, but this will have to wait. I hope the current state of the code is enough to get you started.

The might not work directly as-is (you should change the model, tweak some parameters, etc.), but with some tinkering you can surely make it work! I’ll try to help in any issues you find along the way!




Augmented reality with Python and OpenCV (part 1)

You may (or may not) have heard of or seen the augmented reality Invizimals video game or the Topps 3D baseball cards. The main idea is to render in the screen of a tablet, PC or smartphone a 3D model of a specific figure on top of a card according to the position and orientation of the card. 


Figure 1: Invizimal augmented reality cards. Source: 

Well, this past semester I took a course in Computer Vision where we studied some aspects of projective geometry and thought it would be an entertaining project to develop my own implementation of a card based augmented reality application. I warn you that we will need a bit of algebra to make it work but I’ll try to keep it as light as possible. To make the most out of it you should be comfortable working with different coordinate systems and transformation matrices.


First, this post does not pretend to be a tutorial, a comprehensive guide or an explanation of the Computer Vision techniques involved and I will just mention the essentials required to follow the post. However, I encourage you to dig deeper in the concepts that will appear along the way.

Secondly, do not expect some professional looking results. I did this just for fun and there are plenty of decisions I made that could have been done better. The main idea is to develop a proof of concept application.


With that said, here it goes my take on it.

Where do we start?

Looking at the project as a whole may make it seem more difficult than it really is. Luckily for us, we will be able to divide it into smaller parts that, when combined one on top of another, will allow us to have our augmented reality application working. The question now is, which are these smaller chunks that we need? 

Let’s take a closer look into what we want to achieve. As stated before, we want to project in a screen a 3D model of a figure whose position and orientation matches the position and orientation of some predefined flat surface. Furthermore, we want to do it in real time, so that if the surface changes its position or orientation the projected model does so accordingly.

To achieve this we first have to be able to identify the flat surface of reference in an image or video frame. Once identified, we can easily determine the transformation from the reference surface image (2D) to the target image (2D). This transformation is called homography. However, if what we want is to project a 3D model placed on top of the reference surface to the target image we need to extend the previous transformation to handle cases were the height of the point to project in the reference surface coordinate system is different than zero. This can be achieved with a bit of algebra. Finally, we should apply this transformation to our 3D model and draw it on the screen. Bearing the previous points in mind our project can be divided into:

          1.  Recognize the reference flat surface.

          2.  Estimate the homography.

          3.  Derive from the homography the transformation from the reference surface coordinate system to the target image coordinate system.

          4.  Project our 3D model in the image (pixel space) and draw it.

AR - Page 1(1)

Figure 2: Overview of the whole process that brings to life our augmented reality application.

The main tools we will use are Python and OpenCV because they are both open source, easy to set up and use and it is fast to build prototypes with them. For the needed algebra bit I will be using numpy.

Recognizing the target surface

From the many possible techniques that exist to perform object recognition I decided to tackle the problem with a feature based recognition method. This kind of methods, without going into much detail, consist in three main steps: feature detection or extraction, feature description and feature matching.

Feature extraction

Roughly speaking, this step consists in first looking in both the reference and target images for features that stand out and, in some way, describe part the object to be recognized. This features can be later used to find the reference object in the target image. We will assume we have found the object when a certain number of positive feature matches are found between the target and reference images. For this to work it is important to have a reference image where the only thing seen is the object (or surface, in this case) to be found.  We don’t want to detect features that are not part of the surface. And, although we will deal with this later, we will use the dimensions of the reference image when estimating the pose of the surface in a scene.

For a region or point of an image to be labeled as feature it should fulfill two important properties: first of all, it should present some uniqueness at least locally. Good examples of this could be corners or edges. Secondly, since we don’t know beforehand which will be, for example, the orientation, scale or brightness conditions of this same object in the image where we want to recognize it a feature should, ideally, be invariant to transformations; i.e, invariant against scale, rotation or brightness changes. As a rule of thumb, the more invariant the better.

Blank Diagram - Page 1

Figure 3: On the left, features extracted from the model of the surface I will be using. On the right, features extracted from a sample scene. Note how corners have been detected as interest points in the rightmost image.

Feature description

Once features have been found we should find a suitable representation of the information they provide. This will allow us to look for them in other images and also to obtain a measure of how similar two detected features are when being compared. This is were descriptors roll in.  A descriptor provides a representation of the information given by a feature and its surroundings. Once the descriptors have been computed the object to be recognized can then be abstracted to a feature vector,  which is a vector that contains the descriptors of the keypoints found in the image with the reference object.

This is for sure a nice idea, but how can it actually be done? There are many algorithms that extract image features and compute its descriptors and, since I won’t go into much more detail (a whole post could be devoted only to this) if you are interested in knowing more take a look at SIFT, SURF, or Harris. The one we will be using was developed at the OpenCV Lab and it is called ORB (Oriented FAST and Rotated BRIEF). The shape and values of the descriptor depend on the algorithm used and, in our case,  the descriptors obtained will be binary strings.

With OpenCV, extracting features and its descriptors via the ORB detector is as easy as:

img = cv2.imread('scene.jpg',0)

# Initiate ORB detector
orb = cv2.ORB_create()

# find the keypoints with ORB
kp = orb.detect(img, None)

# compute the descriptors with ORB
kp, des = orb.compute(img, kp)

# draw only keypoints location,not size and orientation
img2 = cv2.drawKeypoints(img, kp, img, color=(0,255,0), flags=0)

Feature matching

Once we have found the features of both the object and the scene were the object is to be found and computed its descriptors it is time to look for matches between them. The simplest way of doing this is to take the descriptor of each feature in the first set, compute the distance to all the descriptors in the second set and return the closest one as the best match (I should state here that it is important to choose a way of measuring distances suitable with the descriptors being used. Since our descriptors will be binary strings we will use Hamming distance). This is a brute force approach, and more sophisticated methods exist.

For example, and this is what we will be also using, we could check that the match found as explained before is also the best match when computing matches the other way around, from features in the second set to features in the first set. This means that both features match each other. Once the matching has finished in both directions we will take as valid matches only the ones that fulfilled the previous condition. Figure 4 presents the best 15 matches found using this method.

Another option to reduce the number of false positives would be to check if the distance to the second to best match is below a certain threshold.  If it is, then the match is considered valid.


Figure 4: Closest 15 brute force matches found between the reference surface and the scene

Finally, after matches have been found, we should define some criteria to decide if the object has been found or not. For this I defined a threshold on the minimum number of matches that should be found. If the number of matches is above the threshold, then we assume the object has been found. Otherwise we consider that there isn’t enough evidence to say that the recognition was successful.

With OpenCV all this recognition process can be done in a few lines of code:

cap = cv2.imread('scene.jpg', 0)    
model = cv2.imread('model.jpg', 0)
# ORB keypoint detector
orb = cv2.ORB_create()              
# create brute force  matcher object
bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)  
# Compute model keypoints and its descriptors
kp_model, des_model = orb.detectAndCompute(model, None)  
# Compute scene keypoints and its descriptors
kp_frame, des_frame = orb.detectAndCompute(cap, None)
# Match frame descriptors with model descriptors
matches = bf.match(des_model, des_frame)
# Sort them in the order of their distance
matches = sorted(matches, key=lambda x: x.distance)

if len(matches) > MIN_MATCHES:
    # draw first 15 matches.
    cap = cv2.drawMatches(model, kp_model, cap, kp_frame,
                          matches[:MIN_MATCHES], 0, flags=2)
    # show result
    cv2.imshow('frame', cap)
    print "Not enough matches have been found - %d/%d" % (len(matches),


On a final note and before stepping into the next step of the process I must point out that, since we want a real time application, it would have been better to implement a tracking technique and not just plain recognition. This is due to the fact that object recognition will be performed in each frame independently without taking into account previous frames that could add valuable information about the location of the reference object. Another thing to take into account is that, the easier to found the reference surface the more robust detection will be. In this particular sense, the reference surface I’m using might not be the best option, but it helps to understand the process.

Homography estimation

Once we have identified the reference surface in the current frame and have a set of valid matches we can proceed to estimate the homography between both images. As explained before, we want to find the transformation that maps points from the surface plane to the image plane (see Figure 5). This transformation will have to be updated each new frame we process.


Figure 5: Homography between a plane and an image. Source: F. Moreno.

How can we find such a transformation? Since we have already found a set of matches between both images we can certainly find directly by any of the existing methods (I advance we will be using RANSAC) an homogeneous transformation that performs the mapping, but let’s get some insight into what we are doing here (see Figure 6). You can skip the following part (and continue reading after Figure 10) if desired, since I will only explain the reasoning behind the transformation we are going to estimate.

What we have is an object (a plane in this case) with known coordinates in the, let’s say, World coordinate system and we take a picture of it with a camera located at a certain position and orientation with respect to the World coordinate system. We will assume the camera works following the pinhole model, which roughly means that the rays passing through a 3D point p and the corresponding 2D point u intersect at c, the camera center. A good resource if you are interested in knowing more about the pinhole model can be found here.


Figure 6: Image formation assuming a camera pinhole model.  Source: F. Moreno.

Although not entirely true, the pinhole model assumption eases our calculations and works well enough for our purposes. The u, v coordinates (coordinates in the image plane) of a point p expressed in the Camera coordinate system if we assume a pinhole camera can be computed as (the derivation of the equation is left as an exercise to the reader):


Figure 7: Image formation assuming a pinhole camera model. Source: F. Moreno.

Where the focal length is the distance from the pinhole to the image plane, the projection of the optical center is the position of the optical center in the image plane and k is a scaling factor. The previous equation then tells us how the image is formed. However, as stated before, we know the coordinates of the point p in the World coordinate system and not in the Camera coordinate system, so we have to add another transformation that maps points from the World coordinate system to the Camera coordinate system. The transformation that tells us the coordinates in the image plane of a point p in the World coordinate system is then:


Figure 8: Computation of the projection matrix. Source: F. Moreno.

Luckily for us, since the points in the reference surface plane do always have its z coordinate equal to 0 (see Figure 5) we can simplify the transformation that we found above. It can be easily seen that the product of the z coordinate and the third column of the projection matrix will always be 0 so we can drop this column and the z coordinate from the previous equation. By renaming the calibration matrix as A and taking into account that the external calibration matrix is an homogeneous transformation:


Figure 9: Simplification of the projection matrix. Source: F. Moreno.

From Figure 9 we can conclude that the homography between the reference surface and the image plane, which is the matrix we will estimate from the previous matches we found is:


Figure 10: Homography between the reference surface plane and the target image plane. Source: F. Moreno.

There are several methods that allow us to estimate the values of the homography matrix, and you maight be familiar with some of them. The one we will be using is RANdom SAmple Consensus (RANSAC).  RANSAC is an iterative algorithm used for model fitting in the presence of a large number of outliers, and Figure 12 ilustrates the main outline of the process. Since we cannot guarantee that all the matches we have found are actually valid matches we have to consider that there might be some false matches (which will be our outliers) and, hence, we have to use an estimation method that is robust against outliers. Figure 11 illustrates the problems we could have when estimating the homography if we considered that there were no outliers.


Figure 11: Homography estimation in the presence of outliers. Source: F. Moreno.


Figure 12: RANSAC algorithm outline. Source: F. Moreno.

As a demonstration of how RANSAC works and to make things clearer, assume we had the following set of points for which we wanted to fit a line using RANSAC:


Figure 13: Initial set of points. Source: F. Moreno

From the general outline presented in Figure 12 we can derive the specific process to fit a line using RANSAC (Figure 14).


Figure 14: RANSAC algorithm to fit a line to a set of points. Source: F. Moreno.

A possible outcome of running the algorithm presented above can be seen in Figure 15. Note that the first 3 steps of the algorithm are only shown for the first iteration (indicated by the bottom right number), and from that on only the scoring step is shown.


Figure 15: Using RANSAC to fit a line to a set of points. Source: F. Moreno.

Now back to our use case, homography estimation. For homography estimation the algorithm is presented in Figure 16. Since it is mainly math, I won’t go into details on why 4 matches are needed or on how to estimate H. However, if you want to know why and how it’s done, this is a good explanation of it.


Figure 16: RANSAC for homography estimation. Source: F. Moreno.

Before seeing how OpenCV can handle this for us we should  discuss one final aspect of the algorithm, which is what does it mean that a match is consistent with H. What this mainly means is that if after estimating an homography we project into the target image the matches that were not used to estimate it then the projected points from the reference surface should be close to its matches in the target image. How close they should be to be considered consistent is up to you.

I know it has been tough to reach this point, but thankfully there is a reward. In OpenCV estimating the homography with RANSAC is as easy as:

# assuming matches stores the matches found and 
# returned by bf.match(des_model, des_frame)
# differenciate between source points and destination points
src_pts = np.float32([kp_model[m.queryIdx].pt for m in matches]).reshape(-1, 1, 2)
dst_pts = np.float32([kp_frame[m.trainIdx].pt for m in matches]).reshape(-1, 1, 2)
# compute Homography
M, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)


Where 5.0 is the threshold distance to determine if a match is consistent with the estimated homography. If after estimating the homography we project the four corners of the reference surface on the target image and connect them with a line we should expect the resulting lines to enclose the reference surface in the target image. We can do this by:

# Draw a rectangle that marks the found model in the frame
h, w = model.shape
pts = np.float32([[0, 0], [0, h - 1], [w - 1, h - 1], [w - 1, 0]]).reshape(-1, 1, 2)
# project corners into frame
dst = cv2.perspectiveTransform(pts, M)  
# connect them with lines
img2 = cv2.polylines(img_rgb, [np.int32(dst)], True, 255, 3, cv2.LINE_AA) 
cv2.imshow('frame', cap)


which results in:


Figure 17: Projected corners of the reference surface with the estimated homography.

I think this is enough for today. On the next post we will see how to extend the homography we already estimated to project not only points in the reference surface plane but any 3D point from the reference surface coordinate system to the target image. We will then use this method to compute in real time, for each video frame, the specific projection matrix and then project in a video stream a 3D model of our choice from an .obj file. What you can expect at the end of the next post is something similar to what you can see in the gif below:


As always, I will upload the whole code of the project as well as some 3D models to GitHub for you to test when publishing part 2.

Part 2 is already here!

Converting images to ASCII art (Part 2)

Today’s post is the second and last on how to convert images to ASCII art (in case you missed it and want to quickly catch up, here’s part 1). We already have a grayscale version of the image we want to convert and we have also developed a method that allows the user to adjust the contrast of the image. What we have to implement now is the mapping from the grayscale version of the input image to the ASCII art image. This is what, without further ado, we will discuss below.


Execution flow of the program as explained in part 1.

Mapping pixels to ASCII characters

Since what defines the grey level of a pixel in a grayscale image is its intensity value, we have to find a way that allows us to relate or match ASCII characters and intensity values.

It is obvious then that the inputs of the mapping will be intensity values and the outputs ASCII characters. I hope you agree in that a good way to do this mapping is to:

  1. Associate an intensity value to each of the 95 printable ASCII characters. The “darker” the character, the lower its associated intensity value (remember that 0 is black and 255 white). This means that the intensity value associated to the character “#” should be closer to 0 than the intensity value of the character “-” since “#” is “darker” than “-”. We’ll see what “darker” means in a moment. Note that this step has to be done only once and that it can be done at the initialization of the program.
  2. Iterate over the pixels of the image and, for each pixel’s intensity, look up which is the intensity value from the table created in step 1 that is closer to the current pixel’s intensity.
  3. Once this match is found we simply have to retrieve the ASCII character associated to the intensity and substitute the pixel for the ASCII character.

A simplified version of this process can be seen in the Figure below.


Simple example of the desired mapping.

The pseudocode for the transformation will be then:

For i <= Image Height
	For j <= Image Width                     
                Retrieve grayscale intensity at (i,j) -> G(i,j)
		Find min(abs(G(i,j) – ASCII intensity table))
		Get the associated ASCII character
		Draw the ASCII character at (i,j)


Bearing this in mind, what we have to do now before implementing the actual mapping is to develop a method by which we can associate an intensity to each printable ASCII character.

Assigning intensities to ASCII characters

The method I chose for assigning an intensity value to each printable ASCII character is explained below, but let me introduce first an important consideration. This is that, to obtain a good and aesthetic result, the proportions of the original image have to be maintained on the generated ASCII image. To achieve this I decided that each drawn character, independently of its size (“X” needs more space to be drawn that ”.”) would have the same amount of space (a square patch of cxc pixels) assigned in the output image. And that the amount of space assigned to all characters would be that of the character that needs more space to be drawn. By doing this we guarantee that proportions are maintained through the transformation (although the image increases it size by a factor of c2. We’ll see how to solve this later.) We can obtain the size of the patch via:

private static SizeF GetGeneralSize()
        SizeF generalsize = new SizeF(0, 0);
        for (int i = 32; i <= 126; i++) // Iterate through contemplated characters calculating necessary width
           char c = Convert.ToChar(i);
           // Create a dummy bitmap just to get a graphics object
           Image img = new Bitmap(1, 1);
           Graphics drawing = Graphics.FromImage(img);

           // Measure the string to see its dimensions using the graphics object
           SizeF textSize = drawing.MeasureString(c.ToString(), System.Drawing.SystemFonts.DefaultFont);
           // Update, if necessary, the max width and height
           if (textSize.Width > generalsize.Width)
               generalsize.Width = textSize.Width;
           if (textSize.Height > generalsize.Height)
               generalsize.Height = textSize.Height;
           // free all resources
        // Make sure generalsize is made of integers 
        generalsize.Width = ((int)generalsize.Width);
        generalsize.Height = ((int)generalsize.Height);
        // and that size defines a square to maintain image proportions
        if (generalsize.Width > generalsize.Height)
            generalsize.Height = generalsize.Width;
            generalsize.Width = generalsize.Height;

        return generalsize;


Where generalsize will hold the size of the patch. Now, knowing the occupied space in the output image of any character also eases our way of implementing a method that associates an intensity to any character. We can define a ratio that measures the greyness of each character by dividing the number of black pixels over the above computed patch size.

Character greyness = (amount of black pixels)/( c2)

Once we have the ratio computed for all printable ASCII characters we just have to remap the values to the range 0-255 so that they can be matched with pixel intensities. The following two functions compute this ratio and perform the above-mentioned mapping. The first one computes the ratio of each character:

private static double GetWeight(char c, SizeF size)
        var CharImage = HelperMethods.DrawText(c.ToString(), Color.Black, Color.White, size);

        Bitmap btm = new Bitmap(CharImage);
        double totalsum = 0;

        for (int i = 0; i < btm.Width; i++)
            for (int j = 0; j < btm.Height; j++)
                totalsum = totalsum + (btm.GetPixel(i, j).R 
                                    + btm.GetPixel(i, j).G 
                                    + btm.GetPixel(i, j).B)/3;
        // Weight = (sum of (R+G+B)/3 for all pixels in image) / Area. (Where Area = Width*Height )
        return totalsum / (size.Height * size.Width);


And this second one, once the all weights have been computed, maps them to the desired range:

private static List LinearMap(List characters)
   double max = characters.Max(c => c.Weight);
   double min = characters.Min(c => c.Weight);
   double range = 255;
   // y = mx + n (where y c (0-255))
   double slope = range / (max - min);
   double n = -min * slope;
   foreach(WeightedChar charactertomap in characters)
	   charactertomap.Weight = slope * charactertomap.Weight + n;
   return characters;


Finally, we can define another function that will be called when the program starts that performs all these tasks and returns a list of all the printable characters with its associated weights already mapped to the range 0-255. The class WeightedChar stores the character, its weight and the image of the character above a white patch of cxc pixels.

public static List GenerateFontWeights() // Collect chars, their Images and weights in a list of WeightedChar
	List WeightedChars = new List();

	SizeF commonsize = GetGeneralSize(); // Get standard size (nxn square), which will be common to all CharImages

	for (int i = 32; i <= 126; i++) // Iterate through Chars
		var forweighting = new WeightedChar(); // New object to hold Image, Weight and Char of new character
		char c = Convert.ToChar(i);
		if (!char.IsControl(c))
			forweighting.Weight = GetWeight(c, commonsize); // Get character weight
			forweighting.Character = c.ToString(); // Get character representation (the actual char)
			forweighting.CharacterImage = (Bitmap)HelperMethods.DrawText(c.ToString(), Color.Black, Color.White, commonsize); // Get character Image

	WeightedChars = LinearMap(WeightedChars); // Linearly map character weights to be in the range 0-255 -> mapping linearly from: MinCalcWeight - MaxCalcWeight to 0-255; 
											  // This is done to be able to directly map pixels to characters
	return WeightedChars;

One character per pixel?

Now, there is at least one problem with the method we have developed thus far. It is that, for a mxn image, the resulting ASCII image dimensions will be, in pixels, mxnxc2, which might be many times bigger than the original image. As a consequence, when the image gets squeezed to fit in the screen the characters might not be seen clearly. This is, from my point of view, something that should not be left ignored, since the whole point is that the characters are visible.

Luckily, this can be easily solved by, instead of performing a one to one map, mapping a neighbourhood of pxq pixels to a single character by, instead of using the intensity of one pixel as the input for the mapping, using the average intensity of the pxq neighbourhood. This is totally a subjective point as the first mapping (one to one) will obviously maintain more details of the original image. So, why not let the user play with this parameters and choose the ones he prefers?


Updated program GUI to allow the user select the pxq neighbourhood size. I’m by no means a graphical designer, but I tried my best to make it look good.

Taking this point into consideration, the final function that generates the ASCII image can be easily defined as shown below:

public static Image Convert2ASCII(Image BW_Image, int? w, int? h, List<WeightedChar> characters, out List<List<string>> ResultText)

	 *  1- Get target Image size (w=Width,h=Height)
	 *  2- Create Result Image with white background and size W = w*character_image_width
	 *                                                        H = h*character_image_height
	 *  3- Create empty string to hold the text   
	 *  4- for (int j=0;j=target_Image_Height;j++) --> ALL ROWS 
	 *       5- Create row text string
	 *       for (int i=0;i=target_Image_Width;i++) --> ALL COLUMNS
	 *          6- Get target pixel weight
	 *          7- Get closest weight from character list
	 *          8- Paste character image in position w = i*character_image_width
	 *                                               h = j*character_image_height
	 *            ¡¡ Be careful with coordinate system when placing Images !!
	 *          9- Add (string)character to row text string
	 *       10- Add row text string to text holding string
	 *  11 - return resulting Image & Text

	if (w == null & h == null)
		w = 1;
		h = 1;
	int width = (int)w;
	int height = (int)h;

	// Be careful when Image.Width/widh or Image.Height/height are not integer values (coherent approximations)
	Image ResultImage = new Bitmap(BW_Image.Width * characters[0].CharacterImage.Width/width, BW_Image.Height * characters[0].CharacterImage.Height/height);
	Graphics drawing = Graphics.FromImage(ResultImage);
	ResultText = new List<List<string>> { };
	Bitmap BlackAndWhite = (Bitmap)BW_Image;

	for (int j = 0; j < BW_Image.Height; j+=height) // ROW
		List<string> RowText = new List<string> { };
		for (int i=0; i <BW_Image.Width; i+=width) // COLUMN
			double targetvalue = 0;

			for (int nj=j; nj<j+height; nj++)
				for (int ni = i; ni < i+width; ni++)
					// Sum all the pixels in neighbourhood and average
						targetvalue += BlackAndWhite.GetPixel(ni, nj).R;
					catch (Exception)
						targetvalue += 128;
			targetvalue /= (width * height);
			WeightedChar closestchar = characters.Where(t=>Math.Abs(t.Weight-targetvalue)==characters.Min(e => Math.Abs(e.Weight - targetvalue))).FirstOrDefault();
			drawing.DrawImage(closestchar.CharacterImage, (i/width) * closestchar.CharacterImage.Width, (j/height) * closestchar.CharacterImage.Height);
	return (Image)ResultImage;    


The following Figure shows the obtained results for different neighbourhood sizes.


The bigger the size of the neighbourhood, the more visible the characters are. However, less detail is preserved.

Final notes

There are many improvements that could be done to make the program more efficient and more functionalities could be implemented. However, I will let those improvements to you, and what has been explained here should be enough to get you started.

Finally, any positive/constructive criticism is welcome and, for those interested, this version of the project can be found here and the code can also be found here.

Converting images to ASCII art (Part 1)

In the last post we discussed a simple method for procedurally generating 2D terrain. In today’s post, we will also generate an image that will be the final output of our program, but with a totally different purpose and method. Today we will try to generate ASCII art from images.  For those of you who are not familiar with ASCII art, what we will achieve will look (more or less) like the image below.


Original image (left) and output image (right).

Not as impressive as you thought? If still not yet convinced (I promise this point will be revisited later on) take a look at the image at full resolution and stare at it after moving a few feet away from the screen. And yes, it is only made of ASCII characters! This can be quickly checked by zooming in any section of the image.


Zooming a region of the image until characters are visible.

It should be clear from the previous example that the output image has two main characteristics: it is composed solely of ASCII characters and it looks like the original image. This is our main goal. However, we will implement and discuss some extra features or alternative implementations along the way in order to improve the results. Finally, before we embark I would like to point out three things:

  1. The ideas that will be explained can be extended to any kind of process were the resulting image is a composition of other images and were the aim is to resemble a target or input image.
  1. The post is about the algorithms and ideas that make the program work and not about a detailed explanation of the whole program.
  2. The example code presented is written in C#. However, the main ideas can be easily implemented in other programming languages. (Disclaimer: I am currently learning C# and this was my first project – done with learning purposes -. Sorry if the code isn’t as clean or efficient as it could be or if the project architecture isn’t the ideal one. Any suggestions on how to improve it are welcome. I am not teaching how to code in C#.)


One of the (first) things I tend to do when developing software is to devise which are the steps the program should take in order to achieve the desired outcome. It is not the only method to develop a general scheme of what the program has to do, but in this case we will work our way backwards. Starting from the ASCII art image we will try to guess from where did it come and how it was obtained.

Clearly there has been a point in the process of generating the ASCII art image where each pixel or region of pixels from the original image has been mapped to a certain character. Since the final output is the ASCII image we won’t be too wrong if we assume that this is the last step of our process.


Last step of the process.

Now we have to deduce if the image used as input for the mapping is the original image or if there has been some pre-processing applied to it before this mapping step. Since the ASCII art image is made only of black characters over a white background, which from a certain distance looks like a grayscale image, a good guess is that a grayscale version of the original image is the image used as a reference for the mapping.


Getting closer to know the full the process.

Since a grayscale version of the original image can be easily obtained from the original image the final scheme of the process is:


Overview of the process.

A more general way of defining the “Convert to grayscale” step would be “Image pre-processing”. This would include, in the case we wanted to perform more operations than just a grayscale conversion, all the different operations and transformations applied to the image before the final mapping of the image to ASCII characters.

Image pre-processing

Grayscale conversion

There are many methods that can be applied to obtain a grayscale version of an RGB image and I will only focus on the one I decided to use. What I applied is a conversion that preserves the luminance (or luminous intensity) of the original image in the grayscale image. Each pixel’s intensity or grey value is simply the weighted sum of the values of its RGB components computed as follows:

I = 0.299·R+0.587·G+0.114·B ≈ 0.3·R+0.59·G+0.11·B

It is not the most efficient way, but in C# this can be easily achieved with:

public static Bitmap Grayscale(Image image)
Bitmap btm = new Bitmap(image);
      for (int i = 0; i < btm.Width; i++)
      	for (int j = 0; j < btm.Height; j++)
            	int ser = (int)(btm.GetPixel(i, j).R*0.3 + btm.GetPixel(i, j).G*0.59 + btm.GetPixel(i, j).B*0.11);

                    btm.SetPixel(i, j, Color.FromArgb(ser, ser, ser));
       return btm;


Were we are simply traversing all the pixels in the original image, computing the weighted sum and then assigning the result as the pixel’s new intensity. Now, we could keep going from here, but since the final results depend largely on the grayscale version of the original image, wouldn’t it be nice if we allowed the user to have some control over the grayscale conversion? One way to do this is to let the user adjust the contrast of the original image and when done, convert it to grayscale. We could let the user adjust directly the contrast of the grayscale image but I preferred to develop a simple interface that only displays two images: the original and the one made of ASCII characters which is why I decided to implement the contrast adjustment on the original image.


Updated program execution flow adding the contrast adjustment step.

Contrast adjustment

So, what is the contrast of an image and how can we modify it? Basically, the contrast of an image is the difference between the maximum and minimum pixel intensity, the “amount” of separation between its darkest and brightest areas. It is then determined by the colour and brightness of the different objects or shapes that appear in the image. This means that the higher the contrast, the bigger the difference between pixel intensities, and the easier it will be to recognize the different objects in the image (due to this bigger range of intensities). This can be appreciated in the image below.


Contrast variation. The contrast increases from bottom to top on the left column and from top to bottom on the right column. Source: Wikipedia.

Once knowing this it’s time to tackle the big question: How can we modify (increase or decrease) the contrast of an image?

We will achieve this by applying to the image a pixel by pixel operation (as we did when grayscaling) that modifies each pixel’s intensity with the objective of expanding or contracting the range of intensity values or luminance an image has. The value of the output image at a particular pixel will depend only on the value of the pixel located at that same position on the original image. This kind of operations are called in Computer Vision point operations.


Schematic of a point operation.

Note that this implies that two pixels that have the same intensity in the original image will be mapped to the same, and usually different from the original, intensity value in the output image. The transformation is then a single variable function

g(x,y) = T( f(x,y) )

Where g(x,y) is the intensity of the pixel (x,y) in the output image, f(x,y) is the intensity of the original image at the same position and T is the applied transformation. The pixel position in the image doesn’t play any role at all and its intensity is the only interesting property when varying the contrast. Point operations can be plotted in the same way single variable functions are. On the x axis, we will have the range of values the input property can take (I) and, on the y axis, the mapped values that result from the transformation (I’). We will make use of this to derive intuitively our contrast adjustment transformation. As an example, how would a transformation that didn’t modify the image at all look like?


Point operations identity transformation.

Yep, that’s right. Each intensity value in the original image would be mapped to its same value in the output image. Recall now that what we want to achieve when increasing the contrast is to make wider the range of intensities in the image. Any ideas on the shape of the transformation that would achieve this? You probably guessed right. If we increase the slope of the line presented above, say double it, any considered range of intensities in the input image will be mapped to a range in the output image twice as big. And if we decrease the slope, say to a half, any considered range of intensities in the input image will half its size in the output image. This can be appreciated in the image below.


Deriving our contrast adjustment transformation.

We are getting close… Note that all the lines presented above have a point in common, the origin in this case. This is nice because if we apply the transformation directly to the original image, without creating a copy of it, having a fixed point will allow us to recover the original image. This wouldn’t be possible without having a fixed point. However, is the origin the best fixed point we can get? I don’t think so. Note that increasing the contrast in this way will result in many intensities being mapped to 255, the maximum allowed intensity in images that store each channel’s intensity with 8 bits. I hope that you agree in that it would be better to divide this region of “out of bounds” intensities between the max and min allowed intensities. This would make the loss of information symmetric. This results in choosing 127.5 as the fixed point of the transformation. Then our contrast transformation so far looks like

I’ = (I-127.5)·c + 127.5

where 127.5 can be rounded to 127 or 128.


Finally, our contrast adjustment transformation.

Not that difficult eh? See how in the previous figure half of the values “out of bounds” get mapped to 0 and the other half to 255?  The value of c will be the one that decides how much we increase or decrease the contrast. Its value can be decided in many ways. In my case I am using a quadratic function to determine the value of c. A scrollbar in the GUI of the program reads a value between -100 and 100. This value is then mapped to the range [0,2] and squared, so that c finally varies between 0 and 4. Finally, note that once knowing the value of c the mapping of intensities can be already computed so storing the transformations in a lookup table will improve the efficiency and speed of this step.


Example of the implemented contrast adjustment in action. For anyone who’s wondering, yeah, on the bottom right image you can see a nrf24L01+ module connected to a Pyboard.


The code below, which is in charge of varying the contrast should now be easy to follow. It is based on this BitBank’s  stackoverflow answer.

public unsafe static void AdjustContrast(Bitmap bmp, double contrast)
      	byte[] contrast_lookup_table = new byte[256];
            double newValue = 0;
            double c = (100.0 + contrast) / 100.0;

            c *= c;

            for (int i = 0; i < 256; i++)
            	newValue = (double)i;
                  newValue /= 255.0;
                  newValue -= 0.5;
                  newValue *= c;
                  newValue += 0.5;
                  newValue *= 255;

                  if (newValue < 0)
                  	newValue = 0;
                  if (newValue > 255)
                  	newValue = 255;
                  contrast_lookup_table[i] = (byte)newValue;

            var bitmapdata = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height),
            System.Drawing.Imaging.ImageLockMode.ReadWrite, System.Drawing.Imaging.PixelFormat.Format32bppArgb);

            int PixelSize = 4;

            for (int y = 0; y < bitmapdata.Height; y++)
            	byte* destPixels = (byte*)bitmapdata.Scan0 + (y * bitmapdata.Stride);
                  for (int x = 0; x < bitmapdata.Width; x++)
                  	destPixels[x * PixelSize] = contrast_lookup_table[destPixels[x * PixelSize]]; // B
                  	destPixels[x * PixelSize + 1] = contrast_lookup_table[destPixels[x * PixelSize + 1]]; // G
                  	destPixels[x * PixelSize + 2] = contrast_lookup_table[destPixels[x * PixelSize + 2]]; // R


I think that’s enough for today. We now have finished the pre-processing of the image (contrast adjustment + grayscaling) and on the next post we’ll see how to finally get the ASCII image. If you enjoyed it so far be sure to stay tuned for part 2!

PS: If you want to play with the code or take a look at it, it is available hereI wanted to upload the final program (.exe) somewhere so you can download it and play with it. However, I have no idea on the best way to achieve this so it is easy for you to get it. If you have any ideas please tell me. In the meantime, you can reach me and ask for it via e-mail at

Update: You can download the compiled program now here. (Thanks to Campbell for pointing out that this could be easily done via Github releases.)

PS2: I know I haven’t revisited yet the point I made at the beginning but I will, for sure, revisit it on the next post.

Landscape generation using midpoint displacement

Today I will present how to implement in Python a simple yet effective algorithm for proceduraly generating 2D landscapes. It is called Midpoint Displacement (or Diamond-square algorithm, which seems less intuitive to me) and, with some tweaking it can also be used for creating rivers, lighting strikes or (fake) graphs. The final output may look like the following image.


Example terrain generated with the presented algorithm.

Algorithm overview

The main idea of the algorithm is as follows: Begin with a straight line segment, compute its midpoint and displace it by a bounded random value. This displacement can be done either by:

  1. Displacing the midpoint in the direction perpendicular to the line segment.
  2. Displacing only the y coordinate value of the midpoint.

Different methods for displacing the midpoint.

This first iteration will result in two straight line segments obtained from the displacement of the midpoint of the original segment.  The same process of computing and displacing the midpoint can be then applied to each of this new two segments and it will result in four straight line segments. Then we can repeat the process for each of this four segments to obtain eight and so on. The process can be repeated iteratively or recursively as many times as desired or until the segments cannot be reduced more (for graphical applications this limit would be two pixel’s width segments). The following image may help to clarify what I just said.


From top to bottom, successive iterations of the algorithm.

And that’s it! This is the core idea of the midpoint displacement algorithm. In pseudocode it looks like:

Initialize segment
While iterations < num_of_iterations and segments_length > min_length:
	For each segment:
		Compute midpoint
		Displace midpoint
		Update segments
        Reduce displacement bounds

However, before implementing the algorithm we should dig deeper in some of the concepts that have arisen so far. These are mainly:

  • How much should we displace the midpoint?
  • How much should the displacement bounds be reduced after each iteration?

How much should we displace the midpoint?

Sadly, there is no general answer for this question because it greatly depends on two aspects:

  1. The application the algorithm is being used for
  2. The desired effect

Since in this post our scope is terrain generation I will limit the explanation to the effects that this parameter has in this area. However, the ideas that I will explain now can be extrapolated to any other application where the Midpoint Displacement algorithm may be used. As I see it, there are two key considerations that should be taken into account when deciding the initial displacement value.

First of all, we should consider which is the desired type of terrain. Hopefully it makes sense to say that the bigger the mountains we want to generate the bigger the initial displacement value should be and viceversa. With a bit of trial and error it is easy to get an idea of the average profiles generated by different values and how do they look. The point here is that bigger mountains need bigger initial displacement values.

Secondly,  the overall dimensions (width and height) of the generated terrain. The initial displacement should be regarded as a value which depends on the generated terrain dimensions.  What I want to say is that an initial displacement of 5 may be huge when dealing with a 5×5 image but will hardly be noticed in a 1500×1500 image.


The same initial displacement value may be to big for a certain image size but may suit well another image size.

How much should the bounds be reduced after each iteration?

Well, the answer again depends on which is the desired output. It should be intuitive that the smaller the displacement reduction the more jagged the obtained profile will be and viceversa. The two extremes are no displacement reduction at all and setting the displacement to 0 after the first iteration. This two cases can be observed in the figure below.


On the left: no displacement reduction. On the right: Displacement reduction to 0 after first iteration.

Somewhere in between is the displacement reduction that will yield the desired output. There are plenty of ways to reduce the displacement bounds each iteration (linearly, exponentially, logarithmically, etc.) and I encourage you to try different ones and see how the results vary.

What I did was define a standard displacement reduction of 1/2, which means that the displacement is reduced by half each new iteration,  and a displacement decay power such that the displacement reduction is

displacement_reduction = 1/(2^i)


displacement_bounds(k+1) = displacement_bounds(k)*displacement_reduction

were is the current iteration and k+1 the next iteration. We can then talk about the obtained terrain profiles in terms of this decay power i. Below you can see how the algorithm performs for different decay powers.


Obtained profiles for different values of  the displacement decay powers.

Bear in mind that the two factors we just saw, the bounds reduction and initial displacement are related one to the other and that they do not affect the output independently. Smaller initial displacements may look good with smaller decay powers and viceversa. Here we have talked about some guidelines that may help when deciding which values to use but there will be some trial and error until the right parametres for the desired output are found. Finally, the number of iterations is another factor that also affects the output in relation with the initial displacement and the bounds reduction.

Python implementation

Finally it is time to, with all the ideas explained above, code our 2D terrain generator. For this particular implementation I have decided to:

  • Displace only the y coordinate of the midpoints (Second of the two displacement methods explained at the begining).
  • Use symmetric bounds with respect to zero for the displacement (if b is the upper bound then –b will be the lower bound.)
  • Choose the displacement value to be either the upper bound or the lower bound, but never allow values in between.
  • Reduce the bounds after each iteration by multiplying the current bounds by 1/(2^i)

We will have three functions: one that will generate the terrain, one that will draw the generated terrain and one that will handle the above processes.

Before implementing the functions we should first import the modules that we will use:

import os                             # path resolving and image saving
import random                         # midpoint displacement
from PIL import Image, ImageDraw      # image creation and drawing
import bisect                         # working with the sorted list of points

Terrain generation

For the terrain generation we need a function that, given a straight line segment returns the profile of the terrain. I have decided to provide as inputs the initial segment and displacement, the rate of decay or roughness of the displacement and the number of iterations:

# Iterative midpoint vertical displacement
def midpoint_displacement(start, end, roughness, vertical_displacement=None,
    Given a straight line segment specified by a starting point and an endpoint
    in the form of [starting_point_x, starting_point_y] and [endpoint_x, endpoint_y],
    a roughness value > 0, an initial vertical displacement and a number of
    iterations > 0 applies the  midpoint algorithm to the specified segment and
    returns the obtained list of points in the form
    points = [[x_0, y_0],[x_1, y_1],...,[x_n, y_n]]
    # Final number of points = (2^iterations)+1
    if vertical_displacement is None:
        # if no initial displacement is specified set displacement to:
        #  (y_start+y_end)/2
        vertical_displacement = (start[1]+end[1])/2
    # Data structure that stores the points is a list of lists where
    # each sublist represents a point and holds its x and y coordinates:
    # points=[[x_0, y_0],[x_1, y_1],...,[x_n, y_n]]
    #              |          |              |
    #           point 0    point 1        point n
    # The points list is always kept sorted from smallest to biggest x-value
    points = [start, end]
    iteration = 1
    while iteration <= num_of_iterations:
        # Since the list of points will be dynamically updated with the new computed
        # points after each midpoint displacement it is necessary to create a copy
        # of the state at the beginning of the iteration so we can iterate over
        # the original sequence.
        # Tuple type is used for security reasons since they are immutable in Python.
        points_tup = tuple(points)
        for i in range(len(points_tup)-1):
            # Calculate x and y midpoint coordinates:
            # [(x_i+x_(i+1))/2, (y_i+y_(i+1))/2]
            midpoint = list(map(lambda x: (points_tup[i][x]+points_tup[i+1][x])/2,
                                [0, 1]))
            # Displace midpoint y-coordinate
            midpoint[1] += random.choice([-vertical_displacement,
            # Insert the displaced midpoint in the current list of points         
            bisect.insort(points, midpoint)
            # bisect allows to insert an element in a list so that its order
            # is preserved.
            # By default the maintained order is from smallest to biggest list first
            # element which is what we want.
        # Reduce displacement range
        vertical_displacement *= 2 ** (-roughness)
        # update number of iterations
        iteration += 1
    return points

The initial line segment is specified by the coordinates of the points where it begins and ends. Both are a list in the form:

point = [x_coordinate, y_coordinate]

And the output is a list of lists containing all the points that should be connected to obtain the terrain profile in the form:

points = [[x_0, y_0], [x_1, y_1], ..., [x_n, y_n]]

Terrain drawing

For the graphical output we need a function that returns an image of the drawn terrain and that takes as inputs at least the profile generated by the midpoint displacement algorithm. I have also included as inputs the desired width and height of the image and the colors it should use for painting. What I did for drawing several layers of terrain was start with the layer in the background and draw each new layer on top of the previous one.

For drawing each layer I first infer the value of every value in the range (0, image width) based on the assumption that the known points, the ones obtained from the midpoint displacement, are connected with straight lines. Once knowing the value of each value in the range (0, image width) I traverse all the values iteratively and for each x value draw a line from its value to the bottom of the image.

def draw_layers(layers, width, height, color_dict=None):
    # Default color palette
    if color_dict is None:
        color_dict = {'0': (195, 157, 224), '1': (158, 98, 204),
                      '2': (130, 79, 138), '3': (68, 28, 99), '4': (49, 7, 82),
                      '5': (23, 3, 38), '6': (240, 203, 163)}
        # len(color_dict) should be at least: # of layers +1 (background color)
        if len(color_dict) < len(layers)+1:
            raise ValueError("Num of colors should be bigger than the num of layers")

    # Create image into which the terrain will be drawn
    landscape ='RGBA', (width, height), color_dict[str(len(color_dict)-1)])
    landscape_draw = ImageDraw.Draw(landscape)
    # Draw the sun
    landscape_draw.ellipse((50, 25, 100, 75), fill=(255, 255, 255, 255))
    # Sample the y values of all x in image
    final_layers = []
    for layer in layers:
        sampled_layer = []
        for i in range(len(layer)-1):
            sampled_layer += [layer[i]]
            # If x difference is greater than 1
            if layer[i+1][0]-layer[i][0] > 1:
                # Linearly sample the y values in the range x_[i+1]-x_[i]
                # This is done by obtaining the equation of the straight
                # line (in the form of y=m*x+n) that connects two consecutive
                # points
                m = float(layer[i+1][1]-layer[i][1])/(layer[i+1][0]-layer[i][0])
                n = layer[i][1]-m*layer[i][0]
                r = lambda x: m*x+n  # straight line
                for j in range(layer[i][0]+1, layer[i+1][0]):  # for all missing x
                    sampled_layer += [[j, r(j)]]  # Sample points
        final_layers += [sampled_layer]

    final_layers_enum = enumerate(final_layers)
    for final_layer in final_layers_enum:
        # traverse all x values in the layer
        for x in range(len(final_layer[1])-1):
            # for each x value draw a line from its y value to the bottom
            landscape_draw.line((final_layer[1][x][0], height-final_layer[1][x][1],
                                 final_layer[1][x][0], height),

    return landscape

The PIL module (and mostly all the modules that allow working with images) sets the origin of coordinates on the top left corner of the image. Also, the x value increases when moving right and the value when moving down. The values that have to be passed to the function that draws the lines have to be expressed in this system of reference and that is why for drawing the desired line its y values have to be transformed from our reference system (origin lower left) to PIL’s reference system.


Differences between PIL’s reference system and the one we have been using so far.

With these two functions we are now able to actually compute and draw our 2D proceduraly generated terrain.

Our main function

The final step is to define our main function. This function will compute the profiles of the desired number of layers, draw them and save the obtained terrain as a .png image:

def main():
    width = 1000  # Terrain width
    height = 500  # Terrain height
    # Compute different layers of the landscape
    layer_1 = midpoint_displacement([250, 0], [width, 200], 1.4, 20, 12)
    layer_2 = midpoint_displacement([0, 180], [width, 80], 1.2, 30, 12)
    layer_3 = midpoint_displacement([0, 270], [width, 190], 1, 120, 9)
    layer_4 = midpoint_displacement([0, 350], [width, 320], 0.9, 250, 8)

    landscape = draw_layers([layer_4, layer_3, layer_2, layer_1], width, height)'\\testing.png')

To call the main() function when we run the program we finally add the lines:

if __name__ == "__main__":

And we’re done! Now it’s your turn to code your own terrain generator and play with its different parametres for modifying the results (you can also change the colors). If you have any doubts do not hesitate to contact me. You can find the whole code at github.

As a bonus, some more terrain images obtained with the previous code:


Midpoint Displacement 2D generated landscapes.