Overview of Netzob

Netzob has been initiated by security auditors of AMOSSYS and the CIDre research team of Supélec to address the reverse engineering of communication protocols.

Originaly, the development of Netzob has been initiated to support security auditors and evaluators in their activities of modeling and simulating undocumented protocols. The tool has then been extended to allow smart fuzzing of unknown protocol.

The following picture depicts the main modules of Netzob:

Architecture of Netzob

Architecture of Netzob

  • Import module: Data import is available in two ways: either by leveraging the channel-specific captors (currently network and IPC – Inter-Process Communication), or by using specific importers (such as PCAP files, structured files and OSpy files).
  • Protocol inference modules: The vocabulary and grammar inference methods constitute the core of Netzob. It allows both passive and active reverse engineering of communication flows through automated and manuals mechanisms.
  • Simulation module: Given vocabulary and grammar models previously inferred, Netzob can understand and generate communication traffic between multiple actors. It can act as either a client, a server or both.
  • Export module: This module permits to export an inferred model of a protocol in formats that are understandable by third party software or by a human. Current work focuses on export format compatible with main traffic dissectors (Wireshark and Scapy) and fuzzers (Peach and Sulley).

And here is a screenshot of the main graphical interface:

The following sections will describe in more details the available mechanisms.

Import and capture data

The first step in the inferring process of a protocol in Netzob is to capture and to import messages as samples. There are different methods to retrieve messages depending of the communication channel used (files, network, IPC, USB, etc.) and the format (PCAP, hex, raw binary flows, etc.).

The figure below describes the multiple communication channels and therefore possible sniffing point’s Netzob aims at addressing.

Multiple communication flows arround an application

Multiple communication flows arround an application

The current version (version 0.4) of Netzob deals with the following data sources :

  • Live network communications
  • Captured network communications (PCAPs)
  • Inter-Process Communications (IPCs)
  • Text and binary files
  • API flows through oSpy file format support

Otherwise, if you plan to reverse a protocol implemented over an supported communication channel, Netzob’s can manipulates any communications flow through an XML representation. Therefore, this situation only requires a specific development to capture the targeted flow and to save it using a compatible XML.

Importing data from an unknown communication channel using the XML definition

Importing data from an unknown communication channel using the XML definition

Inferring message format and state machine with Netzob

The vocabulary of a communication protocol defines all the words which are integrated in it. For example, the vocabulary of a malware’s communication protocol looks like a set of possible commands : {“attack www.google.fr”, “dnspoison this.dns.server.com”, “execute ‘uname -a’”, ...}. Another example of a vocabulary is the set of valids words in the HTTP protocol : { “GET /images/logo.png HTTP/1.1 ...”, “HTTP/1.1 200 OK ...”, ...}.

Netzob’s vocabulary inferring process has been designed in order to retrieve the set of all possible words used in a targeted protocol and to identify their structures. Indeed words are made of different fields which are defined by their value and types. Hence a word can be described using the structure of its fields.

We describe the learning process implemented in Netzob to semi-automatically infer the vocabulary and the grammar of a protocol. This process, illustrated in the following picture, is performed in three main steps:

  1. Clustering messages and partitioning these messages in fields.
  2. Characterizing message fields and abstracting similar messages in symbols.
  3. Inferring the transition graph of the protocol.
The main functionalities

The main functionalities

Step 1: clustering Messages and Partitioning in Fields

To discover the format of a symbol, Netzob supports different partitioning approaches. In this article we describe the most accurate one, that leverages sequence alignment processes. This technique permits to align invariants in a set of messages. The Needleman-Wunsh algorithm performs this task optimally. Needleman-Wunsh is particularly effective on protocols where dynamic fields have variable lengths (as shown on the following picture).

Sequence alignment with Needleman-Wunsh algorithm

Sequence alignment with Needleman-Wunsh algorithm

When partitioning and clustering processes are done, we obtain a relevant first approximation of the overall message formats. The next step consists in determining the characteristics of the fields.

If the size of those fields is fixed, as in TCP and IP headers, it is preferable to apply a basic partitioning, also provided by Netzob. Such partitioning works by aligning each message by the left, and then separating successive fixed columns from successive dynamic columns.

To regroup aligned messages by similarity, the Needleman-Wunsh algorithm is used in conjunction with a clustering algorithm. The applied algorithm is UPGMA.

Step 2 : characterization of Fields

The field type identification partially derives from the partitioning inference step. For fields containing only invariants, the type merely corresponds to the invariant value. For other fields, the type is automatically materialized, in first approximation, with a regular expression, as shown on next figure. This form allows to easily validate the data conformity with a specific type. Moreover, Netzob offers the possibility to visualize the definition domain of a field. This helps to manually refine the type associated with a field.

Characterization of field type

Characterization of field type

Some intra-symbol dependencies are automatically identified. The size field, present in many protocol formats, is an example of intra-symbol dependency. A search algorithm has been designed to look for potential size fields and their associated payloads. By extension, this technique permits to discover encapsulated protocol payloads.

Environmental dependencies are also identified by looking for specific values retrieved during message capture. Such specific values consist of characteristics of the underlying hardware, operating system and network configuration. During the dependency analysis, these characteristics are searched in various encoding.

Step 3: inferring the Transition Graph of the Protocol

The third step of the learning process discovers and extracts the transition graph from a targeted protocol (also called the grammar). More formally, the grammar of a communication protocol defines the set of valid sentences which can be produced by a communication. A sentence is a sorted set of words which may be received or emmited by a protocol handler. An exemple of a simple sentence is :

["attack www.google.fr", "attack has failed", "attack www.kernel.org", "root access granted."]

which can be described using the following simple automata with S0 the initial state :

Schema of a simple grammar

Schema of a simple grammar

The learning process step is achieved by a set of active experiments that stimulate a real client or server implementation using successive sequences of input symbols and analyze its responses.

In Netzob, the automata used to represent or model a communication protocol is an extended version of a Mealy automata which includes semi-stochastic transitions, contextualized and parametrized inputs and outputs. The first academic presention of this model is included in a dedicated scientific paper provided in the documentation section.

The model is inferred through a dedicated active process which consists in stimulating an implementation and to analyze its responses. In this process, we use the previously infered vocabulary to discover and to learn the grammar of the communication protocol. Each stimulation is computed following an extension of the Angluin L algorithm*.

Protocol simulation

One of our main goal is to generate realistic network traffic from undocummented protocols. Therefore, we have implemented a dedicated module that, given vocabulary and grammar models previously infered, can simulate a communication protocol between multiple bots and masters. Besides their use of the same model, each actors is independent from the others and is organized around three main stages.

The first stage is a dedicated library that reads and writes from the network channel. It also parses the flow in messages according to previous protocols layers. The second stage uses the vocabulary to abstract received messages into symbols and vice-versa to specialize emitted symbols into messages. A memory buffer is also available to manage dependency relations. The last stage implements the grammar model and computes which symbols must be emitted or received according to the current state and time.

Smart fuzzing with Netzob

A typical example of dynamic vulnerability analysis is the robustness tests. It can be used to reveal software programming errors which can leads to software security vulnerabilities. These tests provide an efficient and almost automated solution to easily identify and study exposed surfaces of systems. Nevertheless, to be fully efficient, the fuzzing approaches must cover the complete definition domain and combination of all the variables which exist in a protocol (IP adresses, serial numbers, size fields, payloads, message identifer, etc.). But fuzzing typical communication interface requires too many test cases due to the complex variation domains introduced by the semantic layer of a protocol. In addition to this, an efficient fuzzing should also cover the state machine of a protocol which also brings another huge set of variations. The necessary time is nearly always too high and therefore limits the efficiency of this approach.

With all these contraints, achieving robustness tests on a target is feasible only if the expert has access to a specially designed tool for the targeted protocol. Hence the emergence of a large number of tools to verify the behavior of an application on one or more communication protocols. However in the context of proprietary communications protocols for which no specifications are published, fuzzers do not provide optimal results.

Netzob helps the security evaluator by simplifying the creation of a dedicated fuzzer for a proprietary or undocumented protocol. It allows the expert to execute a semi-automated inferring process to create a model of the targeted protocol. This model can afterward be refined by the evaluator. Finally, the created model is included in the fuzzing module of Netzob which considers the vocabulary and the grammar of the protocol to generate optimized and specific test cases. Both mutation and generation are available for fuzzing.

Export protocol model

The following export formats are currently provided by Netzob:

  • XML format
  • human readable (Wireshark like)
  • Peach fuzzer export: this allows to combine efficiency of Peach Fuzzer on previously undocummented protocols.

Besides, you can write your own exporter to manipulate the inferred protocol model in your favorite tool.