Bencoding

Author: Arvid Norberg, arvid@rasterbar.com
Version: 1.0.0

Bencoding is a common representation in bittorrent used for for dictionary, list, int and string hierarchies. It's used to encode .torrent files and some messages in the network protocol. libtorrent also uses it to store settings, resume data and other state between sessions.

Strings in bencoded structures are not necessarily representing text. Strings are raw byte buffers of a certain length. If a string is meant to be interpreted as text, it is required to be UTF-8 encoded. See BEP 3.

There are two mechanims to decode bencoded buffers in libtorrent.

The most flexible one is bdecode(), which returns a structure represented by entry. When a buffer is decoded with this function, it can be discarded. The entry does not contain any references back to it. This means that bdecode() actually copies all the data out of the buffer and into its own hierarchy. This makes this function potentially expensive, if you're parsing large amounts of data.

Another consideration is that bdecode() is a recursive parser. For this reason, in order to avoid DoS attacks by triggering a stack overflow, there is a recursion limit. This limit is a sanity check to make sure it doesn't run the risk of busting the stack.

The second mechanism is lazy_bdecode(), which returns a bencoded structure represented by lazy_entry. This function builds a tree that points back into the original buffer. The returned lazy_entry will not be valid once the buffer it was parsed out of is discarded.

Not only is this function more efficient because of less memory allocation and data copy, the parser is also not recursive, which means it probably performs a little bit better and can have a higher recursion limit on the structures it's parsing.

invalid_encoding

Declared in "libtorrent/bencode.hpp"

thrown by bdecode() if the provided bencoded buffer does not contain valid encoding.

struct invalid_encoding: std::exception
{
   virtual const char* what () const throw();
};

type_error

Declared in "libtorrent/entry.hpp"

thrown by any accessor function of entry if the accessor function requires a type different than the actual type of the entry object.

struct type_error: std::runtime_error
{
   type_error (const char* error);
};

entry

Declared in "libtorrent/entry.hpp"

The entry class represents one node in a bencoded hierarchy. It works as a variant type, it can be either a list, a dictionary (std::map), an integer or a string.

class entry
{
   data_type type () const;
   entry (list_type const&);
   entry (integer_type const&);
   entry (dictionary_type const&);
   entry (string_type const&);
   entry (data_type t);
   void operator= (string_type const&);
   void operator= (entry const&);
   void operator= (integer_type const&);
   void operator= (lazy_entry const&);
   void operator= (dictionary_type const&);
   void operator= (list_type const&);
   const integer_type& integer () const;
   const string_type& string () const;
   const dictionary_type& dict () const;
   string_type& string ();
   list_type& list ();
   dictionary_type& dict ();
   integer_type& integer ();
   const list_type& list () const;
   void swap (entry& e);
   entry& operator[] (std::string const& key);
   const entry& operator[] (std::string const& key) const;
   entry& operator[] (char const* key);
   const entry& operator[] (char const* key) const;
   entry const* find_key (char const* key) const;
   entry* find_key (char const* key);
   entry* find_key (std::string const& key);
   entry const* find_key (std::string const& key) const;
   std::string to_string () const;

   enum data_type
   {
      int_t,
      string_t,
      list_t,
      dictionary_t,
      undefined_t,
   };

   mutable boost::uint8_t m_type_queried:1;
};

type()

data_type type () const;

returns the concrete type of the entry

entry()

entry (list_type const&);
entry (integer_type const&);
entry (dictionary_type const&);
entry (string_type const&);

constructors directly from a specific type. The content of the argument is copied into the newly constructed entry

entry()

entry (data_type t);

construct an empty entry of the specified type. see data_type enum.

operator=()

void operator= (string_type const&);
void operator= (entry const&);
void operator= (integer_type const&);
void operator= (lazy_entry const&);
void operator= (dictionary_type const&);
void operator= (list_type const&);

copies the structure of the right hand side into this entry.

string() dict() integer() list()

const integer_type& integer () const;
const string_type& string () const;
const dictionary_type& dict () const;
string_type& string ();
list_type& list ();
dictionary_type& dict ();
integer_type& integer ();
const list_type& list () const;

The integer(), string(), list() and dict() functions are accessors that return the respective type. If the entry object isn't of the type you request, the accessor will throw libtorrent_exception (which derives from std::runtime_error). You can ask an entry for its type through the type() function.

If you want to create an entry you give it the type you want it to have in its constructor, and then use one of the non-const accessors to get a reference which you then can assign the value you want it to have.

The typical code to get info from a torrent file will then look like this:

entry torrent_file;
// ...

// throws if this is not a dictionary
entry::dictionary_type const& dict = torrent_file.dict();
entry::dictionary_type::const_iterator i;
i = dict.find("announce");
if (i != dict.end())
{
        std::string tracker_url = i->second.string();
        std::cout << tracker_url << "\n";
}

The following code is equivalent, but a little bit shorter:

entry torrent_file;
// ...

// throws if this is not a dictionary
if (entry* i = torrent_file.find_key("announce"))
{
        std::string tracker_url = i->string();
        std::cout << tracker_url << "\n";
}

To make it easier to extract information from a torrent file, the class torrent_info exists.

swap()

void swap (entry& e);

swaps the content of this with e.

operator[]()

entry& operator[] (std::string const& key);
const entry& operator[] (std::string const& key) const;
entry& operator[] (char const* key);
const entry& operator[] (char const* key) const;

All of these functions requires the entry to be a dictionary, if it isn't they will throw libtorrent::type_error.

The non-const versions of the operator[] will return a reference to either the existing element at the given key or, if there is no element with the given key, a reference to a newly inserted element at that key.

The const version of operator[] will only return a reference to an existing element at the given key. If the key is not found, it will throw libtorrent::type_error.

find_key()

entry const* find_key (char const* key) const;
entry* find_key (char const* key);
entry* find_key (std::string const& key);
entry const* find_key (std::string const& key) const;

These functions requires the entry to be a dictionary, if it isn't they will throw libtorrent::type_error.

They will look for an element at the given key in the dictionary, if the element cannot be found, they will return 0. If an element with the given key is found, the return a pointer to it.

to_string()

std::string to_string () const;

returns a pretty-printed string representation of the bencoded structure, with JSON-style syntax

enum data_type

Declared in "libtorrent/entry.hpp"

name value description
int_t 0  
string_t 1  
list_t 2  
dictionary_t 3  
undefined_t 4  
m_type_queried
in debug mode this is set to false by bdecode to indicate that the program has not yet queried the type of this entry, and sould not assume that it has a certain type. This is asserted in the accessor functions. This does not apply if exceptions are used.

pascal_string

Declared in "libtorrent/lazy_entry.hpp"

this is a string that is not NULL-terminated. Instead it comes with a length, specified in bytes. This is particularly useful when parsing bencoded structures, because strings are not NULL-terminated internally, and requiring NULL termination would require copying the string.

see lazy_entry::string_pstr().

struct pascal_string
{
   pascal_string (char const* p, int l);
   bool operator< (pascal_string const& rhs) const;

   int len;
   char const* ptr;
};

pascal_string()

pascal_string (char const* p, int l);

construct a string pointing to the characters at p of length l characters. No NULL termination is required.

operator<()

bool operator< (pascal_string const& rhs) const;

lexicographical comparison of strings. Order is consisten with memcmp.

len
the number of characters in the string.
ptr
the pointer to the first character in the string. This is not NULL terminated, but instead consult the len field to know how many characters follow.

lazy_entry

Declared in "libtorrent/lazy_entry.hpp"

this object represent a node in a bencoded structure. It is a variant type whose concrete type is one of:

  1. dictionary (maps strings -> lazy_entry)
  2. list (sequence of lazy_entry, i.e. heterogenous)
  3. integer
  4. string

There is also a none type, which is used for uninitialized lazy_entries.

struct lazy_entry
{
   entry_type_t type () const;
   void construct_int (char const* start, int length);
   boost::int64_t int_value () const;
   char const* string_ptr () const;
   char const* string_cstr () const;
   pascal_string string_pstr () const;
   std::string string_value () const;
   int string_length () const;
   lazy_entry const* dict_find_string (char const* name) const;
   lazy_entry* dict_find (char const* name);
   lazy_entry const* dict_find (char const* name) const;
   pascal_string dict_find_pstr (char const* name) const;
   std::string dict_find_string_value (char const* name) const;
   boost::int64_t dict_find_int_value (char const* name, boost::int64_t default_val = 0) const;
   lazy_entry const* dict_find_int (char const* name) const;
   lazy_entry const* dict_find_list (char const* name) const;
   lazy_entry const* dict_find_dict (char const* name) const;
   std::pair<std::string, lazy_entry const*> dict_at (int i) const;
   int dict_size () const;
   lazy_entry* list_at (int i);
   lazy_entry const* list_at (int i) const;
   std::string list_string_value_at (int i) const;
   pascal_string list_pstr_at (int i) const;
   boost::int64_t list_int_value_at (int i, boost::int64_t default_val = 0) const;
   int list_size () const;
   std::pair<char const*, int> data_section () const;
   void swap (lazy_entry& e);

   enum entry_type_t
   {
      none_t,
      dict_t,
      list_t,
      string_t,
      int_t,
   };
};

type()

entry_type_t type () const;

tells you which specific type this lazy entry has. See entry_type_t. The type determines which subset of member functions are valid to use.

construct_int()

void construct_int (char const* start, int length);

start points to the first decimal digit length is the number of digits

int_value()

boost::int64_t int_value () const;

requires the type to be an integer. return the integer value

string_ptr()

char const* string_ptr () const;

the string is not null-terminated! use string_length() to determine how many bytes are part of the string.

string_cstr()

char const* string_cstr () const;

this will return a null terminated string it will write to the source buffer!

string_pstr()

pascal_string string_pstr () const;

if this is a string, returns a pascal_string representing the string value.

string_value()

std::string string_value () const;

if this is a string, returns the string as a std::string. (which requires a copy)

string_length()

int string_length () const;

if the lazy_entry is a string, returns the length of the string, in bytes.

dict_find() dict_find_string()

lazy_entry const* dict_find_string (char const* name) const;
lazy_entry* dict_find (char const* name);
lazy_entry const* dict_find (char const* name) const;

if this is a dictionary, look for a key name, and return a pointer to its value, or NULL if there is none.

dict_find_pstr() dict_find_string_value()

pascal_string dict_find_pstr (char const* name) const;
std::string dict_find_string_value (char const* name) const;

if this is a dictionary, look for a key name whose value is a string. If such key exist, return a pointer to its value, otherwise NULL.

dict_find_int_value() dict_find_int()

boost::int64_t dict_find_int_value (char const* name, boost::int64_t default_val = 0) const;
lazy_entry const* dict_find_int (char const* name) const;

if this is a dictionary, look for a key name whose value is an int. If such key exist, return a pointer to its value, otherwise NULL.

dict_find_list() dict_find_dict()

lazy_entry const* dict_find_list (char const* name) const;
lazy_entry const* dict_find_dict (char const* name) const;

these functions require that this is a dictionary. (this->type() == dict_t). They look for an element with the specified name in the dictionary. dict_find_dict only finds dictionaries and dict_find_list only finds lists. if no key with the corresponding value of the right type is found, NULL is returned.

dict_at()

std::pair<std::string, lazy_entry const*> dict_at (int i) const;

if this is a dictionary, return the key value pair at position i from the dictionary.

dict_size()

int dict_size () const;

requires that this is a dictionary. return the number of items in it

list_at()

lazy_entry* list_at (int i);
lazy_entry const* list_at (int i) const;

requires that this is a list. return the item at index i.

list_string_value_at() list_pstr_at()

std::string list_string_value_at (int i) const;
pascal_string list_pstr_at (int i) const;

these functions require this to have the type list. (this->type() == list_t). list_string_value_at returns the string at index i. list_pstr_at returns a pascal_string of the string value at index i. if the element at i is not a string, an empty string is returned.

list_int_value_at()

boost::int64_t list_int_value_at (int i, boost::int64_t default_val = 0) const;

this function require this to have the type list. (this->type() == list_t). returns the integer value at index i. If the element at i is not an integer default_val is returned, which defaults to 0.

list_size()

int list_size () const;

if this is a list, return the number of items in it.

data_section()

std::pair<char const*, int> data_section () const;

returns pointers into the source buffer where this entry has its bencoded data

swap()

void swap (lazy_entry& e);

swap values of this and e.

enum entry_type_t

Declared in "libtorrent/lazy_entry.hpp"

name value description
none_t 0  
dict_t 1  
list_t 2  
string_t 3  
int_t 4  

bdecode() bencode()

Declared in "libtorrent/bencode.hpp"

template<class InIt> entry bdecode (InIt start, InIt end);
template<class OutIt> int bencode (OutIt out, const entry& e);
template<class InIt> entry bdecode (InIt start, InIt end, int& len);

These functions will encode data to bencoded or decode bencoded data.

If possible, lazy_bdecode() should be preferred over bdecode().

The entry class is the internal representation of the bencoded data and it can be used to retrieve information, an entry can also be build by the program and given to bencode() to encode it into the OutIt iterator.

The OutIt and InIt are iterators (InputIterator and OutputIterator respectively). They are templates and are usually instantiated as ostream_iterator, back_insert_iterator or istream_iterator. These functions will assume that the iterator refers to a character (char). So, if you want to encode entry e into a buffer in memory, you can do it like this:

std::vector<char> buffer;
bencode(std::back_inserter(buf), e);

If you want to decode a torrent file from a buffer in memory, you can do it like this:

std::vector<char> buffer;
// ...
entry e = bdecode(buf.begin(), buf.end());

Or, if you have a raw char buffer:

const char* buf;
// ...
entry e = bdecode(buf, buf + data_size);

Now we just need to know how to retrieve information from the entry.

If bdecode() encounters invalid encoded data in the range given to it it will throw libtorrent_exception.

operator<<()

Declared in "libtorrent/entry.hpp"

inline std::ostream& operator<< (std::ostream& os, const entry& e);

prints the bencoded structure to the ostream as a JSON-style structure.

lazy_bdecode()

Declared in "libtorrent/lazy_entry.hpp"

int lazy_bdecode (char const* start, char const* end
   , lazy_entry& ret, error_code& ec, int* error_pos = 0
   , int depth_limit = 1000, int item_limit = 1000000);

This function decodes bencoded data.

Whenever possible, lazy_bdecode() should be preferred over bdecode(). It is more efficient and more secure. It supports having constraints on the amount of memory is consumed by the parser.

lazy refers to the fact that it doesn't copy any actual data out of the bencoded buffer. It builds a tree of lazy_entry which has pointers into the bencoded buffer. This makes it very fast and efficient. On top of that, it is not recursive, which saves a lot of stack space when parsing deeply nested trees. However, in order to protect against potential attacks, the depth_limit and item_limit control how many levels deep the tree is allowed to get. With recursive parser, a few thousand levels would be enough to exhaust the threads stack and terminate the process. The item_limit protects against very large structures, not necessarily deep. Each bencoded item in the structure causes the parser to allocate some amount of memory, this memory is constant regardless of how much data actually is stored in the item. One potential attack is to create a bencoded list of hundreds of thousands empty strings, which would cause the parser to allocate a significant amount of memory, perhaps more than is available on the machine, and effectively provide a denial of service. The default item limit is set as a reasonable upper limit for desktop computers. Very few torrents have more items in them. The limit corresponds to about 25 MB, which might be a bit much for embedded systems.

start and end defines the bencoded buffer to be decoded. ret is the lazy_entry which is filled in with the whole decoded tree. ec is a reference to an error_code which is set to describe the error encountered in case the function fails. error_pos is an optional pointer to an int, which will be set to the byte offset into the buffer where an error occurred, in case the function fails.