KIWI GPS is a navigation system built in many cars. In order to use a particular map you have to put the map DVD into the DVD drive in your car and the GPS system will load it up and use it. Besides map data the DVD also includes sound data used for voice directions. Many car manufactures also augment the data with other extensions.

The contents of the DVD can be viewed and explored using a program named Kiwi Format Explorer, or KFE for short. Here’s how KFE looks like:

It just so happened that I wanted to replace standard voice-overs with … Gruntz monologues! I’d love to hear those little buggers tell me to turn left or to take a second exit on a roundabout up ahead. So basically … it was solely for shit and giggles. Unfortunately with KFE you may only preview the saved voice-overs – it won’t allow you to edit or replace them. In that moment I thought to myself – surely in order to playback those voice-overs KFE has to decode them first. And if I manage to reverse the decoding algorithm perhaps I will be able to design an encoder and encode my own samples as well.

The following specification of sound data in the KIWI GPS was reversed for the system used in Volvo cars specifically. It may or may not work in other car brands. In the case of Volvo all the sound data is stored in the DVD in the file named VOICE001.ME. In order to provide the specification KFE was reversed to see how it exactly reads the archive and how it decodes the sound data before playing it back.


The VOICE001.ME file consists of five sections laid out one after another:

  1. Language descriptors
  2. Record descriptors
  3. Unknown section #1
  4. Unknown section #2
  5. Sound data section

Unknown sections #1 and #2 are referenced in language descriptors in the file. However I couldn’t force KFE to actually use them and as such I had nothing to reverse here. For that reason their functionality remains unknown.

A few notes on data types:

  • all multi-byte integer values are big endian (i.e. the most significant byte is first).
  • all fields representing data size and offsets are split in half in the file itself. So in order to read them you’ll have to remember to multiply them by two. They’re marked by prefixes “sizeHalf” and “offsetHalf” accordingly.

1. Language descriptors

The first section in the VOICE001.ME file constains descriptors for languages – they may be perceived as “containers” for records.

// size: LanguageDescriptors.sizeHalfSection*2 or 0x02 + 0x40*languagesCount
struct LanguageDescriptors
{
	uint16_t sizeHalfSection; // (0x00) the (half) size of this section
	LanguageDescriptor languageDescriptors[languagesCount]; // (0x02) the number of languages is implied from the field sizeHalfSection (see below)
};

// size: 0x40 (64)
struct LanguageDescriptor
{
	uint32_t offsetHalfRecordDescriptors; // (0x00) the (half) offset to the list of record descriptions
	uint16_t sizeHalfRecordDescriptors; // (0x04) the (half) size of the list of record descriptions
	uint32_t offsetHalfUnknownSection1Data; // (0x06) the (half) offset to the part of the unknown section #1 corresponding to the language
	uint16_t sizeHalfUnknownSection1Data; // (0x0A) the (half) size of the language data in the unknown section #1
	uint32_t offsetHalfUnknownSection2Data; // (0x0C) the (half) offset to the part of the unknown section #2 corresponding to the language
	uint16_t sizeHalfUnknownSection2Data; // (0x10) the (half) size of the language data in the unknown section #2
	uint32_t offsetHalfSoundData; // (0x12) the (half) offset to the part of the sound data section corresponding to the language
	uint32_t sizeHalfSoundData; // (0x16) the (half) size of the sound data for the language
	uint8_t languageId; // (0x1A)
	uint8_t voiceId; // (0x1B)
	uint8_t padding[36]; // (0x1C) zero padding reserved for extensions
};

In order to get the number of languages based on the field LanguageDescriptors.sizeHalfSection, you have to multiply it by two (to get the real size), subtract two (bytes reserved for the field sizeHalfSection) and divide the result by 64 (the size of a single language descriptor):

languagesCount = (sizeHalfSection*2 - 2)/64

The field LanguageDescriptor.voiceId seemed to have one of two values – 0x00 for male sounds and 0x80 for female ones. This may not be the rule however.

In KFE this section is called the Voice Distribution Header, the list of record descriptors for a particular language is called the General-purpose Voice Offset Table and the sound data for the language is called the General-purpose Real-Voice Data Table.

2. Record descriptors

A record represents a single voice-over played by the GPS.

// size: 0x02 + recordsCount*0x12
struct RecordDescriptors
{
  uint16_t recordsCount; // (0x00)
  RecordDesc recordDescs[recordsCount]; // (0x02)
};

// size: 0x12 (18)
struct RecordDescriptor
{
  uint16_t voiceLanguageId; // (0x00)
  uint32_t voiceDataId; // (0x02)
  uint32_t offsetHalfSoundData; // (0x06) the (half) offset to the record sound data, relative to the beginning of the language sound data in the sound data section
  uint32_t sizeHalfSoundData; // (0x0A) the (half) size of the record sound data
  uint8_t reproductionMethod; // (0x0E)
  uint8_t numberOfChannels; // (0x0F)
  uint16_t bitRate; // (0x10)
};

The field RecordDescriptor.offsetHalfSoundData represents the relative offset to the record sound data. The absolute offset to the sound data may be calculated by adding the relative offset to the base offset of the language sound data:

offsetSoundData = LanguageDescriptor.offsetHalfSoundData*2 + RecordDescriptor.offsetHalfSoundData*2

3. Record sound data + ADPCM decoder

The record sound data is a variation of ADPCM encoding. It’s comprised of 16-byte blocks, where the first byte is the block header and the remaining 15 bytes represent 60 bytes of PCM data. PCM samples have to be 16-bit. Therefore a single 16-bit sample is encoded using just 4 bits. As a consequence the decoded PCM data stream is always a multiple of 60 bytes.

Before we delve into the algorithm itself, let’s introduce a few concepts (I came up with such names that represent their function in the algorithm as closely as possible):

  • each 4 bits representing a 16-bit PCM sample will be called a nibble. Each nibble is signed and may represent a value from -8 to 7.
  • the decoding algorithm stores the last two decoded PCM samples in the form of a shift register. Each history sample is 32-bit. I’ll refer to them as “history A” and “history B” (the former being the most recently decoded sample). The history samples are initialized to 0 before decoding the record sound data.
  • each 16-byte block of ADPCM data defines two values: a magnitude and coefficients index. The magnitude define the size of the quantization step while the coefficients index define how the history samples A and B should affect the decoded samples.

The first byte of the ADPCM block (the block header) stores the values for the magnitude and the index:

magnitude = 0x0C - (blockHeader >> 4);
index = blockHeader & 0x03;

In practice the magnitude is a value from the range [0; 12] inclusive and the index is 0, 1, 2 or 3.

The remaining 15 bytes are filled with nibbles – the first PCM sample is represented by the 4 most significant bits in the second byte of the block. The second PCM sample is stored by the 4 least significant bits in the second byte and so on. Here is a pseudo-C function which decodes a 4-bit nibble into a 16-bit PCM sample:

int32_t histA; // history A
int32_t histB; // history B

// arrays of coefficients
uint8_t coefA[4] = { 0x00, 0x3C, 0x73, 0x62 };
uint8_t coefB[4] = { 0x00, 0x00, 0x34, 0x37 };

// the nibble argument is assumed to be of the form: 0000xxxx
int16_t decodeNibble(int8_t nibble, uint8_t magnitude, uint8_t index)
{
  int32_t histBuffer;
  int32_t postNibble;

  postNibble = (nibble ^ 0x08) - 0x08; // a trick for extending the sign bit of the nibble to 16 bits
  postNibble = (int16_t)(postNibble << magnitude); // applying the magnitude to the nibble

  histBuffer = histA * coefA[index];
  if (histBuffer < 0)
    histBuffer += 63;
  histBuffer /= 64;

  postNibble += histBuffer;

  histBuffer = histB * coefB[index];
  if (histBuffer < 0)
    histBuffer += 63;
  histBuffer /= 64;

  postNibble -= histBuffer;

  // updating the history shift registry
  histB = histA;
  histA = postNibble;

  // clamping the calculacted PCM sample to 16 bits
  if (postNibble < -32768)
    postNibble = -32768;
  else if (postNibble > 32767)
    postNibble = 32767;

  return (int16_t)postNibble;
}

A few words on the decoding algorithm:

  • the resulting PCM sample is calculated using the nibble and history samples A and B.
  • the amount the nibble should affect the resulting sample depends on the magnitude defined in the block – a nibble may represent values as low as -8, -7, …, 6, 7 and as big as -32768, -28672, …, 24576, 28672.
  • the index defines a pair of coefficients. The history A sample is multiplied by the first coefficient and added to the resulting PCM sample, while the history B sample is multiplied by the second coefficient and is subtracted from the decoded sample. This part of the algorithm accommodates for sample errors.
  • the calculated 32-bit PCM sample is input into the history shift register.
  • before outputting the sample it is clamped to fit 16 bits.

4. ADPCM encoder

This is more of a supplement to the specification. In order to be able to patch the VOICE001.ME file with custom sounds one would have to be able to reverse the decoding process and provide an encoder. In general this process is not a trivial task – the encoding process is in fact a lossy compression algorithm where the quality of the resulting sound depends heavily on the chosen magnitudes and indexes for ADPCM blocks.

One could probably think of some heuristics choosing the correct values. I’m not that smart though so I went full brute-force instead. There are 13 different values for the magnitude and 4 possible index values. This gives us 52 different configurations for a single block – not bad at all. So the first loop of the brute force algorithm is iterating through all the magnitude/index configurations. Given a particular configuration I then iterated through all PCM samples in the block and tried to fit the best nibble for each (i.e. the one for which the decoding process would give back the value closest to the original sample). Once I fit the nibbles for a given configuration I sum up the total error for the block, reset the history to the previous values and proceed with the next configuration. The configuration with the smallest total error is our winner and we tackle the next block in the stream.

Obviously it’s a slow process requiring some backtracking and a lot of decoding attempts to find the best configuration. During my tests without any optimizations the brute-force encoder worked at the rate of almost 11 000 000 samples per minute. Assuming sounds are 4 seconds long with a bitrate of 22050 it gives us about 125 records per minute. Not too shabby knowing how slow brute force algorithms can get.

You can inspect the encoder source code in my GitHub repository here: <TODO>

5. Final words

Reversing the decoder and designing the encoder took me well over a month. Designing an encoder was really fun. Unfortunately I’m not really into signal processing so I might be missing a big picture here, not being able to correctly interpret what the decoder is actually doing and not being able to efficiently implement an encoder.

Now back to my original idea of replacing the sounds with Gruntz voice-overs … ironically I had more problems cutting and arranging those sound samples in any sensible way than reversing the KIWI codec so I eventually gave up. But you are free to do it yourself or at least design the samples! In the repository hosting my KIWI codec source code I’ve left transcriptions of all original VOICE001.ME voice-overs which might help you with that, you can get them here: <TODO>. And if you decide to take on the challenge, let me know!

Cheers