Monday, February 22, 2016

Building a kernel for ubuntu/debian

I regularly need to compile and modify kernels on ubuntu server.
There are a lot of different tutorials on the web, each one giving a different method.

Here what I need when I want to build a new kernel:
  • I want to build packages (here deb, so it is compatible for ubuntu and debian). I found methods that propose to only build vmlinux and initrd image and copy them not easily maintainable.
  • I want to keep it simple regarding the configuration : copy what's work and modify explicitly what I need.
  • I want debug symbols and kernel source, to install kprobes via the perf tool user interface.

Here the script I use to do what I want.

I did yet not found a way to build the linux-tools package (all precedent attempts failed).
If you know how to do it, please share it in the comments.

Happy kernel hacking!

Wednesday, January 20, 2016

Hooking into the kernel: real-time code execution at kernel level

It has been a long time since I did not write a blog post here.
Today I will explain how to execute a piece of code every time a given location into the kernel is reached, to compute in-kernel statistics for example.

Linux Kernel Inspection Mechanisms:

There are several mechanisms inside the kernel that allow instrumentation and debug (see this talk for a complete view of the tools and mechanisms available on Linux).
Today, we will dive into tracepoints, one of the most used mechanism.

Tracepoints are interest points placed at specific kernel code locations by kernel developers. A tracepoint is composed of a name, which is the kind of event it represents, alongside meta-data, which are elements specific to this event.
When a tracepoint is hit by a task at a specific time, a tool can capture the hit (an event) with all the meta-data, the time (with nanosecond precision), which task hits it, and on which core. There is also the possibility to capture the call-chains (i.e.,the sequence of nested function calls leading to the event) that led to the tracepoint.

Examples of tracepoints are :

  • Scheduling events: we can capture when a task is scheduled, another descheduled, on which core and the time (in nanoseconds) when the event happens. It is also possible to have the call-chain of the descheduled task, helping to answer questions like: is the task descheduled due to a preemption by the scheduler or due to a blocking action (and if so, which one, e.g., waiting for an I/O operation, sleeping on a lock, etc.).
  • Syscalls enter/exit: one can capture each time a task do a specific syscall, and when it returns from the syscall. We can thus know for example how many times a task issues a syscall, at which frequency or how much time it has spent inside the syscall in average.
Tools like perf or ftrace allows to create a trace of all tracepoints hits with the meta-data associated. These tools are really useful to, for example, understand which code-paths are executed, in which context and at which frequencies.
A well-known example of tracepoints usage are FlameGraphs. FlameGraphs allows to visualize how much time a set of tasks (e.g., all threads of a process) spends in each code-path.

Unfortunately, tools like perf and ftrace only allow offline analysis: we cannot, for example, compute statistics about tracepoints in real-time (i.e., each time a tracepoint is hit). It is not a problem when you do performance troubleshooting, but it is not adapted to dynamic tools that needs real-time in-kernel statistics to make decisions.

For the rest of the article, let us assume that our goal is to compute statistics about tasks scheduling.

Our own kernel module:

To achieve our goal, we will need to execute our own function each time a tracepoint is hit. To this end, we will need to get our hands dirty and write our own kernel module.
Below, a template of such a module, with the Makefile to build it.

This example shows how to capture sched_switch and sched_wakeup tracepoints. I made the code simple enough to be understood without any additional explanations. However, if you have any questions, please ask in the comment and I will try to answer.
To capture a new tracepoint, you have to create a function named probe_{tracepoint}, and add an entry inside the tracepoints_table_interests table. The tracepoint function's prototypes are available inside the files stored in include/trace/events/ of the kernel source code (see here).
Be careful when developing your module : your code will be executed inside the kernel, which means that you will probably crash or freeze your machine (it happened to me with dynamic memory allocation). I advise your to use a virtual machine (like kvm/qemu) to test your module (it will be the subject of a future article).

Please post links to your module in the comments, I am interested in seeing what can be made with tracepoints.

Existing alternatives:

There are several alternatives to avoid creating your own module.
Dtrace and SystemTap are two tools allowing you to execute code on each tracepoint's hit
The first one is not available on Linux (or is considered very unstable), the second will generate automatically a module from a language-specific script.
One of the main advantage of SystemTap is that if your script compiles, then it will not crash the kernel.
However, this scripting language is limited, and if you target performance, a carefully crafted module written in C will be more efficient than a SystemTap generated module.

A more interesting alternative is eBPF. This system, part of the kernel, allows you to run your code inside a process virtual machine (like the JVM), with good performance and the assurance that it will not crash or freeze the kernel. However, this mechanism is still experimental and is not available before Linux 4.0.

Finally, it is interesting to know that you are not limited to tracepoints (which are inserted by kernel developer), but you can dynamically insert probes (to run your code on each hit) inside the kernel anywhere you want : these are kprobes. If one of you is disposed to create a module template to install hooks on kprobes, I will be happy to add a note in this article.

I hope you will find this article useful. Happy kernel hacking!

Sunday, November 8, 2015

Clang optimization bug

While refining my C++ skill, I encountered a Clang optimization bug.
The original post was on stackoverflow and I also submitted a bug report.

The code is the following :

The result of this source code is unexpected:
clang++ -O3 -Wall -o sizeof_vs_strlen sizeof_vs_strlen.cpp

Strlen 36 ret=4
Sizeof 62092396 ret=5

Indeed, when you execute the code, you observe that the first loop is removed, but the second is not. This is unexpected because as sizeof  is known at compile-time, you expect that clang will also remove the second loop.

The next step is to look at the generated assembly code:

mov    %edx,%r14d
shl    $0x20,%r14
mov    $0x3b9aca01,%ecx
xchg   %ax,%ax
add    $0xffffffed,%ecx // 0x400ad0
jne    0x400ad0 <main+192>
mov    %eax,%eax
or     %rax,%r14
Clang let an empty loop (add / jne) that is totally useless, and this is the bug.
According to Mats Petersson, the bug is related to the reuse of the same ret variable.

I will keep you updated when the bug will be fixed.

Friday, September 25, 2015

Backup Overleaf git repository

I often use Overleaf when I need to collaboratively write an academic document. But for obvious safety reasons, I like to backup the document at a regular basis on a personal remote server.

Overleaf offers a git repository for any document. The git repository URL can be retrieve when clicking on "Share" on the top-menu bar. A git commit is created each time you save a new version of your document ("Versions" on the top-menu bar).

I mirror the git repository on a remote server and synchronize it each night.To create a git mirror repository, you need to git clone the Overleaf repository on your remote server:

git clone --mirror directory_name.git

If your remote server needs to use a proxy to reach the internet, you can use the following command (instead of the previous one):

https_proxy=https://cache_url git clone --mirror directory_name.git 

The next step is to add a cron task to nightly synchronize your git repository with the Overleaf repository. This cron command updates the repository each night at midnight:

# Open cron:
crontab -e

# Without proxy (copy the line at the end of the file):
0 0 * * * cd directory_name.git &&  git fetch -q --all -p

# With proxy
0 0 * * * cd directory_name.git && https_proxy=https://cache_url git fetch -q --all -p
# Save the file

You can repeat the process (clone then crontab) for each of your Overleaf repository. Do not forget to create git commits by creating new version of your document on Overleaf (last non-versionned modifications seem to not be saved on the git repository).

Happy writing!

Sunday, September 20, 2015

Managing my research sources with Zotero

As a new PhD student, I read a lot of papers. I like to keep notes of what I found interesting about a paper, along with a short summary of the paper (e.g., context, problem, solution).

To manage all my notes, I use Zotero. Zotero is a cross-platform solution enabling me to collect and organize my readings.
My method is pretty straightforward:

  1. Print out the paper and read it: I find reading on paper easier and it allows me to be more focused.
  2. Search the paper's metadata on Google: I generally go to the ACM library.
  3. Add the paper's metadata to Zotero using the browser extension. For example, when you go to an ACM web page, the extension adds a small icon inside the URL bar. When you click on it, it automatically adds the paper's metadata to Zotero.
  4. Download and add the paper's PDF to the bibliographic entry inside Zotero (to avoid searching the paper all the time).
  5. Write my notes inside Zotero: the notes are attached to the bibliographic entry.
  6. Automatically synchronize the metadata, the notes and the PDF to my Zotero account: this way, I can access my bibliography from any device.
With Zotero, I can easily export an entry in several format (e.g., bibtex). I have also the possibility to synchronize my account with Overleaf. This way, I can cite a paper without the need to export the entry and copy it in a bibtex file.

Wednesday, September 2, 2015

Master Thesis: Improving the Performance of Multi-Tier Applications on Multicore Architectures

During the last year, I was an intern inside the ERODS team, under the supervision of Renaud Lachaize and Vivien Quéma.

My research subject was the evaluation of the impact of different task placement strategies for multi-tier applications deployed on a single multicore machine.
The idea was to understand how to execute (e.g., how many cores? which cores?)  multi-tier applications on a modern multicore machine (e.g., 48, 64 cores). I answered some of following questions:

  • What are the parameters (machine topology, memory affinity, multi-tiers architecture, etc.) that have the most impact on the performance of the application?
  • For a given multi-tier application, which tier is the performance bottleneck tier (i.e., the one that slow down the entire application) and why?
  • Why some task placement strategies perform better than others?

Thanks to this study, we understood the importance of the different factors, and conclude that there is no "one-size-fits-all" solution: the best task placement strategy depends on too many independent factors.
Besides, we also observed real performance improvements, even while not using all available resources (500% performance improvement while using online 40% of available CPU resources). 

These conclusions motivate the need of a system that automatically and dynamically select the best task placement strategy for any given multi-tier applications deployed on a multicore machine.
The design of such a tool is described in the report and its implementation is a planned future work.

My Master thesis is available here.


In this report, we are interested in the performance of multi-tier applications deployed on a single modern multicore machine. This context brings new challenges that need to be studied. The contribution made during this Master Internship is twofold. First, we evaluate the impact of task placement strategies (i.e., how the application is executed on the machine) on three different use cases. The results show that the performance of the best strategies depends on different factors, such as the application architecture and the workload mix. We also manually understand the performance problems in each use case.
Second, provided the difficulty to understand performance problems and the fact that the best strategies are never the same, we propose the design of a system that would dynamically mitigate performance problems of multi-tier applications. We also make preliminary validations of some key aspects of the proposed design.

Friday, August 28, 2015

Docker experiment: packaging Phoenix-2.0

It is a long time since I want to try Docker. In this article, I explain how I created a container for Phoenix-2.0.

Docker and Reproducible Research

I will not present Docker in detail, there are plenty of descriptions available.
To summarize, Docker allows to containerize applications, allowing to isolate them, control allocated resources and ship them easily.

One of my side interests is reproducible research, and Docker seems to be a way to implement it [1].
As a starting point, I wanted to package all the code and the data-sets of a benchmark suite, to be able at least run experiments inside a (almost) controlled software environment.


Phoenix-2.0 is an implementation of the MapReduce programming model especially designed to be executed efficiently on a multicore architecture.
I personally use Phoenix-2.0 and its provided examples in my day-to-day research, to evaluate performance of applications running on multicore.

Detailed information about Phoenix-2.0 are available on the official repository.

Constructing the Container

To construct a Docker container, you need to create a Dockerfile.
A Dockerfile explains step-by-step which packages install, files to download, and commands to execute.
The following Dockerfile construct the container for Phoenix-2.0:

  1. Install git to get the official repository, wget to get the data-sets and the tools to compile C applications.
  2. Clone the official repository.
  3. Build Phoenix-2.0.
  4. Upload a list of data-sets' URL stored inside a file to the container.
  5. Launch wget to download all the URLs from the uploaded file.
  6. Un-archive the data-sets.
  7. Delete the archive.
The Dockerfile and the list of URLs are available on GitHub at

Building and Uploading the Container

Building the container is quite simple. You have to get the Dockerfile and the associated files, and run docker build (it can take time to download the data-sets).

git clone
cd docker-phoenix-2.0.git
docker build .

Once built, you can upload it to a public or private DockerHub.
I used the information available on the official documentation to upload the image on the public DockerHub.
This way, everyone can download the container, without the need to build the whole thing (see the next section).


You can either build the image, or retrieve it from the public DockerHub.
docker pull ghugo/phoenix-2.0

For example, to run the word_count example, run :
docker run ghugo/phoenix-2.0 /phoenix/phoenix-2.0/tests/word_count/word_count /data/word_count_datafiles/word_100MB.txt

More information on the docker repository.


Without any previous experience with Docker container build, it took me around 2 hours to create the image and upload it on the public DockerHub.
I was surprised by the simplicity and rapidity of the process. I plan to containerize more applications and benchmark suites in the future.

Yet, some questions remain regarding Docker and the usage of Docker for reproducible research:
  • What about the overhead? Docker adds a layer between the operating system and the application, not present when the machine runs without being containerized. A research report by IBM reports that the overhead is negligible except for network-intensive applications. However, authors indicate that it should be "considered on a case-by-case basis".
  • What about measuring performance? The current container is based on a debian stable, without any additional tools except the one mentioned earlier (e.g., build-essential). It would be interesting to build a base image containing all the required tools (e.g., top, iotop, vmstat, sar) allowing a user to assess performance.
  • What about reproducible research tools? Again, the docker container does not contain any executable helping to perform reproducible experiments (which tools? which scripts?)

[1] Carl Boettiger. 2015. An introduction to Docker for reproducible research. SIGOPS Oper. Syst. Rev. 49, 1 (January 2015), 71-79. DOI=10.1145/2723872.2723882