The data: BSON documents

MongoDB stores both user collections' data and also its system collections' data in one and only one binary format - the BSON format.

JSON equivalent Serialized form in hexadecimal byte values
  0x3e = 62 (byte size of this document) as int32
3e 00 00 00
{  
  "_id" : 7, (datatype 0x01 = double), (cstring "_id\0"), (7 as a double floating-point value)
01   5f 69 64 00   00 00 00 00 00 00 1c 40
  "instr" : "XYZ 3m", (datatype 0x07 = string), (cstring "instr\0"), ((not-cstring!) string len 0x07 in int32), (byte array "XYZ 3m" (6 bytes) + an extra null byte)
02   69 6e 73 74 72 00   07 00 00 00   58 59 5a 20 33 6d 00
  "hval" : 904.72, (datatype 0x01 = double), (cstring "hval\0"), (904.72 in a double)
01   68 76 61 6c 00   f6 28 5c 8f c2 45 8c 40
  "ts" : ISODate("2019-07-21T01:12:15.348Z") (type 0x09 UTC datetime), (cstring "ts\0"), (UTC milliseconds since the Unix epoch in an int64)
09   74 73 00   f4 1e 16 12 6c 01 00 00
}  
  0x00 datatype indicator for a BSON document
N.b. the document type is the only one that has it's type indicator at the end.
Notes on the above example
shell example with bsondump of a single-doc collection

Please see bsonspec.org if you want to know the full specification.

The main point for database users is:

  • BSON is a variable-length format that packs the key-value pairs one after the other in tuples of type + key name [+ byte length when not a fixed-length type] + value data.
  • Correct deserialization requires consuming the datatype value, the keyname, and a size field if the datatype uses one. Once you have that you can calculate the offset to the end of the value data.
  • The start of BSON data is always assumed to be an object - never an array or scalar by itself. MongoDB won't store anything except a document in the collection structure.
    • mongodump's output *.bson files likewise just start with a document. The next document begins in the byte immediately after the end of the previous. E.g. Imagine there is a mongodump-created file xyz.bson with ten's of thousands of documents, of size 25kb +/- 1kb. The first four bytes might be, say, the int32 value 26031. The 26031st byte will be a null byte. The 26032 ~ 26035th bytes will be the int32 value of the next document's byte size. Let's say it is 25938. The (26031 + 25938 =) 51969th byte will be final null byte of the second document. The next four bytes will be be size of the next document etc. As long as the dump is valid you can skip through the head to tail of the documents without parsing their content, til eventually there will be four bytes holding int32 value n, where n is exactly the remaining length of the file starting from the beginning of this last object's int32 size field.
  • For brevity no nested objects or arrays were shown in the example above, but they are also another key-value tuple in this serialization/deserialization algorithm. Type value 0x03 or 0x04 will indicate an (embedded) object or array respectively, then next four bytes will be an int32 for the total size (this size field and final null byte included).
  • Arrays are packed just like objects, including having keys, but the spec insists the keys must be "0", "1", "2", ... etc. To be honest the keys seem totally superfluous to me.
  • Key names consume space in every document, even when all the documents in a collection have exactly the same ones. Use short key names to save space.
  • Except for the required ["0", "1", "2", ...] in arrays the format places no expectations/assumptions about which fields are included, or in which order they are serialized.

In practice BSON is the encoding of MongoDB and isn't used in any other software as popular as MongoDB yet. Nonetheless the BSON specification is one thing and MongoDB is another. There are some extra requirements that MongoDB places on any BSON object it will store in a collection:

  • 16MB maximum size. The BSON specification places no upper limit on the size of data it encodes but the MongoDB database server and the drivers do.
  • "_id" field: Exluding the oplog every document saved to a collection will have an "_id" field value. It is the primary key value. An _id value of an ObjectId() type will be given automatically by the driver (not the db server) if none is specified at by the user code above the driver API beforean insert.

BSON is also used by MongoDB to package commands and results being sent to and from the server using the wire protocol. The BSON command documents being sent by clients will have the command name as the first key (this is a fixed expectation of the mongod/mongos server nodes) and don't need an _id key as they don't represent a collection document. They might have a collection document embedded, eg. { "insert": "mycollection", "documents": [ { "_id": 999, .... } ], ... }.

Datatypes

JSON only supports the same datatypes that a Javascript tokenizer will handle. (If you are Javascript programmer you are no doubt thinking 'But what about other types such as Date?' Surprise- these are not covered by the JSON specification.)

  • String (Null-terminated UTF-8)
  • Number (JSON does not specify the binary format)
  • Boolean
  • Null
  • Object
  • Array

BSON extends to have these necessary datatypes that mostly any database would need:

Number types:

  • int32
  • int64
  • uint64
  • Double (8-byte IEEE 754-2008 format)
  • Decimal (16-byte IEEE 754-2008 format)
  • Datetime (without timezone, i.e. assumed to be UTC always)
  • Timestamp
  • Generic binary data
  • ObjectID (MongoDB uses this type. It would be called a GUID in some other databases that exist.)

BSON also include these (in my opinion) exotic-for-a-database-system datatypes

  • Min key
  • Max key
  • Javascript code (As a UTF8 string, i.e. not in a compiled, runtime-executable format.)
  • A few special 'binary types'
  • Function (as a compiled, runtime-executable format????)
  • UUID
  • MD5