CPP Memory Structure of DATA with Pointer Casting

Introduction:

C++ is a mid-level programming language as it offers low-level abstraction facilities (OS, system, and hardware), as well as high-level abstraction facilities (language features that hide and do many low-level functions in a single statement), and does not offer garbage collection (reorganizing or defragmenting used memory).

One of the low-level abstraction facilities in C++ is a pointer. Pointers are simply variables that contain a memory address to allocated memory, rather it be memory used by another variable, an array, a function, an OOP object, or manually allocated memory, just to name a few.

In machine language, the language that a processor understands and can interpret, there are no concepts of variables, functions, arrays, OOP objects, and so forth. To access data in machine language, we need to work directly with memory addresses. Here's an Assembly Language example:

; MOVE A VALUE INTO THE RAX REGISTER

  mov rax, 0x4A4F5921

; MOVE VALUE IN RAX (64-bit register) TO MEMORY
; ADDRESS 0xFF01DFABFADECA69.

  mov [0xff01dfabfadeca69], rax

From low-level assembly to highly level programming languages, memory addresses of data are represented through an abstraction layer that simplifies and hides the underlying complexities, by replacing memory addresses with names. The compiler converts those names to memory locations at compile time so developers can focus more on the concepts of their solution instead of the underlying complexities.

Here's how to use variables in Assembly Language (NASM example):

.data

  ; CREATE A 4-BYTE INTEGER VARIABLE AND DEFINE A VALUE

    Msg: dd 0x4A4F5921

  ; CREATE A VARIABLE TO STORE DATA IN
  
    Msg2: dd 0x0
    
.code

  ; LOAD DATA FROM MSG1 AND STORE IT IN MSG2
  
    mov rax, [msg1]
    mov [msg2], rax

C++ provides many facilities to developers to help them produce powerful solutions without having to access memory directly. However, there are times when such access is necessary, and C++ provides developers with the tools to accomplish such tasks.

The Joyous Challenge:

Understanding the underlying complexities can provide us with in depth knowledge which will allow us to make more informed decisions in our code.

One such complexity is how data is stored in memory, especially on a variety of different processors which could store data in the Little Endian or Big Endian formats.

The Joyous challenge is to print the byte and bit structures of a given data type.

The Joyous Solution:

C++ incorporates a lot of safety and security measures with most of its Standard Library. However, when working with the lower-level facilities in C++, we work without any security safety nets and therefore we must be diligent at providing our own. We do not want to crash the application, cause memory leaks, or allow access to memory outside the scope of our algorithm or data, among other security issues.

For our solution, we will be using pointers, type conversions, and templates.

Coding the Solution:

The smallest unit of bits that can be read from and written to memory on most modern processors is called a "byte", which generally consists of a string of 8 bits. However, the authoritative information on C++ suggests it could possibly be bigger on specialized processors. This blog assumes you are coding on a modern processor, like an Intel, AMD, or ARM, which all have 8-bit bytes.

The byte is an integer value which can contain the decimal (base 10 number system) values of -128 to 127 for signed integers, or 0 - 255 for unsigned integers. C++ has multiple "data types" that work with bytes, such as: CHAR, UNSIGNED CHAR, INT8_T, UINT8_T, and STD::BYTE. Please note: Though these byte data types are defined in modern C++ specification, not all compilers, or compiler versions may provide them. Please check your compiler documentation for further information.

To visually see how data is accurately stored in memory, we need to read it byte by byte. We can do this with pointers in C++.

The first hurdle to overcome is pointer data types, which must be the same as the data type whose memory we want to access. For example, if the data type is an INT (an integer), then the pointer must also have the data type of INT, as seen in the following example:

int main() {

    int Value { 127 };
    // char *PValue { &Value }; // ERROR - NOT AN INT TYPE
    int *PValue { &Value }; // CORRECT - BOTH ARE INT TYPES
    
}

In the example above, when we read the memory pointed to by the "PValue" pointer, we receive all the bytes defined by the size of an "INT" type, which on 32-bit processors, and above, is 4-bytes.

To create a pointer with a different data type than that of the data, which in our case, we want to create a byte-size pointer to access the 4-byte (INT) variable, we need to recast the pointer data type (Reinterpret the bit pattern from one data type to another). There are currently (as of C++ 20) six ways to recast a pointer, but we will only look at two of them in this blog post: reinterpret_cast and static_cast (for complete details on these two type conversion operators, please see the links provided at the end of this post).

As mentioned earlier, processors either store data in the Little Endian or Big Endian format. These formats determine how the bytes that make up the data are stored in memory. For example, we have an unsigned 32-bit integer that contains the following data:

0x0A0B0C0D

In the Little Endian format, the data would be stored as:

 03   02   01   00  <-- MEMORY OFFSET
[0A] [0B] [0C] [0D] <-- BYTE DATA

In the Big Endian format, the data would be stored as:

 03   02   01   00  <-- MEMORY OFFSET
[0D] [0C] [0B] [0A] <-- BYTE DATA

On a side note, memory is read from right to left.

To determine which data storage format a processor uses, we can create an unsigned integer and assign it a simple value like '7' (0x00000007 in hexadecimal). Then we can check if the first byte of the memory the data uses is '7' or not. If the byte value is '7', then the processor uses the Little Endian format. If it is not a '7', then the processor uses the Big Endian Format. Here is one way you can achieve success:

#include <iostream>

void CheckEndian() {

  // THE VALUE SHOULD RANGE FROM 1 TO 255 SINCE WE
  // ONLY WANT TO READ A BYTE VALUE
  const unsigned int ValueToCheckAgainst { 7 };
  
  // WE CAN NOT CREATE A POINTER TO A CONST, SO
  // WE NEED TO DEFINE A NON-CONST VARIABLE.
  unsigned int Value { ValueToCheckAgainst };
  
  // CREATE A POINTER AND REINTERPRET ITS CAST (bit
  // pattern)
  unsigned char *PointerToValue
      { reinterpret_cast<unsigned char *>(&Value) };
  
  // CHECK THE FIRST BYTE
  std::cout << (PointerToValue[0] == ValueToCheckAgainst
               ? "This system is little-endian\n"
               : "This system is big-endian\n");
}

int main() {

  CheckEndian();
  
}

(Screenshot from Android phone using the CxxDroid - C++ Compiler IDE)

In the above demo, we use the "reinterpret_cast" pointer type conversion operator to allow us to access the 4-byte integer data as an array of bytes, as demonstrated with the "PointerToValue[0]" array access. However, this is the critical area of code where we need to make sure we don't try to access data outside those 4-bytes.

As the following code example shows, we can access a variable's memory as an array as well.

#include <iostream>

int main() {

  unsigned int Value { 7 };
  unsigned int *PValue { &Value };
  
  std::cout << PValue[0] << "\n";
}

(Screenshot from Android phone using the CxxDroid - C++ Compiler IDE)

Caution should be taken in the code above as any index value greater than zero will cause memory to be accessed outside of the range of the original data. This will have undefined behavior. When we access the data's memory at the byte level, the range can be greater than zero, so caution should be used to make sure we do not access memory outside of the bounds of the data.

To display the individual byte values in memory that the data is comprised of, we need to start at the last byte. C++ offers a function that provides us with the number of bytes a data type is comprised of, called "sizeof()".

Here's one way of displaying the bytes in memory that make up the data for a data type:

#include <iostream>

int main() {

  unsigned int Value { 0x4A4F5921 };
  const unsigned int SizeOfValue { sizeof( Value ) };
  
  unsigned char *bytes = reinterpret_cast<unsigned char *>(&Value);
  
  for (int i = SizeOfValue - 1; i >= 0; i--) {
  
    std::cout << bytes[i] << " ";
    
  }
}

(Screenshot from Android phone using the CxxDroid - C++ Compiler IDE)

Looking at the last screenshot above, the message prints as "J O Y !" instead of "74 79 89 33" as expected. The reason this happened is C++ prints byte data types as ASCII characters. This is automatic.

To prevent C++ from printing ASCII characters and have it print the numerical values instead, we need to recast the CHAR pointer type to a larger integer type. Here's one solution.

#include <iostream>

int main() {

  unsigned int Value { 0x4A4F5921 };
  const unsigned int SizeOfValue { sizeof( Value ) };
  
  unsigned char *bytes = reinterpret_cast<unsigned char *>(&Value);
  
  for (int i = SizeOfValue - 1; i >= 0; i--) {
  
    std::cout << static_cast<unsigned int>(bytes[i]) << " ";
    
  }
}

(Screenshot from Android phone using the CxxDroid - C++ Compiler IDE)

The code above, while it works exactly how we want it to, is written in a poor coding style. If you're interested in having a career in software development, learning good coding etiquette early on would help you in you career path.

To correct the etiquette of the code above and to make the solution scalable, it would be best to move the code in the main function to its own. Here's an example:

#include <iostream>


void PrintDataBytes(unsigned int &Value) {

  const unsigned int SizeOfValue { sizeof( Value ) };
  unsigned char *bytes { reinterpret_cast<unsigned char *>(&Value) };
  
  for (int i = SizeOfValue - 1; i >= 0; i--) {
  
    std::cout << static_cast <unsigned int>(bytes[i]) << " ";
    
  }
}


int main() {

  unsigned int Value1 { 0x4A4F5921 };
  PrintDataBytes(Value1);

}

Why would it matter to take the extra time to use proper coding etiquette on a simple test demo?

Scalability.

Scalability is a major factor in wanting to make the extra effort to use proper coding etiquette. Having the logic in it's own function will allow us to reuse it over and over again, and it makes the code more readable, as well as easier to understand for others.

In "The Joyous Solution" section above, I mentioned using templates as part of the solution. Templates allows us to write the function once, and pass any data type or class to it in it's parameters. Without templates, we would have to create the same function for each data type we might possibly want to pass as a parameter.

On a side note, while we only write one function for all data types, thanks to templates, the compiler itself will duplicate the function as many times as the number of different data types passed to the function. If you're one who concerns themselves with file sizes, you could potentially see an increase in those numbers the more data types you pass to the function.

Adding a template to our previous code above is very difficult and scary. The requirements are, you add a short template statement above the function and then change the "Value" data type to the one defined by the template. For example:

#include <iostream>


template <typename AnyDataType>
void PrintDataBytes(AnyDataType &Value) {

  const unsigned int SizeOfValue { sizeof( Value ) };
  
  unsigned char *bytes { reinterpret_cast<unsigned char *>(&Value) };
  
  for (int i = SizeOfValue - 1; i >= 0; i--) {
  
      std::cout << static_cast <unsigned int>(bytes[i]) << " ";
      
  }
}


int main() {

  unsigned int Value1 { 0x4A4F5921 };
  PrintDataBytes(Value1);
  
}

That code addition was exhausting, but through perseverance, it was completed.

Well... Almost completed.

We use decimal (base-10) numbers in many areas of our lives, and we can use them in coding. However, they do not represent the base-2 number system very well.

Hexadecimal, a base-16 number system, is much easier to mentally convert to binary then decimal numbers, making it an easier representation.

We can use the <iomanip> library to quickly and easily display the byte values in hexadecimal. Here's an example:

#include <iostream>
#include <iomanip>

template <typename AnyDataType>
void PrintDataBytes(AnyDataType &Value) {

  const unsigned int SizeOfValue { sizeof( Value ) };
  unsigned char *bytes { reinterpret_cast<unsigned char *>(&Value) };
  
  for (int i = SizeOfValue - 1; i >= 0; i--) {
  
    std::cout << std::hex          // PRINT HEXADECIMAL DIGITS
              << std::uppercase    // PRINT IN UPPERCASE LETTERS
              << std::setfill('0')  // PAD DIGITS WITH ZERO
              << std::setw(2)       // SET THE HEX WIDTH TO 2 CHARACTERS
              << static_cast <unsigned int>(bytes[i])
              << " ";
  }
}


int main() {

  unsigned int Value1 { 0x4A4F5921 };
  PrintDataBytes(Value1);
  
}

(Screenshot from Android phone using the CxxDroid - C++ Compiler IDE)

Speaking of the base-2 (binary) number system, C++ offers a great way to display integer data as binary numbers. Here's an example:

#include <iostream>
#include <iomanip>
#include <string>

template <typename AnyDataType>
void PrintDataBits(AnyDataType &Value) {

  const unsigned int SizeOfValue { sizeof( Value ) };
  unsigned char *bytes { reinterpret_cast<unsigned char *>(&Value) };

    for (int i = SizeOfValue - 1; i >= 0; --i)
    {
        std::cout << std::bitset<8>{bytes[i]} 
                  << "  ";
    }

    std::cout << "\n\n";
}


template <typename AnyDataType>
void PrintDataBytes(AnyDataType &Value) {

  const unsigned int SizeOfValue { sizeof( Value ) };
  unsigned char *bytes { reinterpret_cast<unsigned char *>(&Value) };

  for (int i = SizeOfValue - 1; i >= 0; i--) {
    
    std::cout << std::hex           // PRINT HEXADECIMAL DIGITS
              << std::uppercase     // PRINT IN UPPERCASE LETTERS
              << std::setfill('0')  // PAD DIGITS WITH ZERO
              << std::setw(2)       // SET THE HEX WIDTH TO 2 CHARACTERS
              << static_cast <unsigned int>(bytes[i])
              << " ";

  }
  
  std::cout << "\n";
} 


int main() {
  
  unsigned int Value1 { 0x4A4F5921 };
  PrintDataBytes(Value1);
  PrintDataBits(Value1);

}

(Screenshot from Android phone using the CxxDroid - C++ Compiler IDE)

IN CLOSING:

C++ provides an exceptional amount of tools to create solutions with low-level to high-level parts, providing very safe methods to extremely unsafe methods. Which ever path a developer chooses to take when creating a solution in C++, it is important to learn, understand, and remember the underlying systems so we can craft better solutions.

You may have noticed I never used "using namespace std", nor "std::endl" in any of the code above.

The using statement should never be used as it puts all of the standard library into the main space which can cause naming collision and other issues. It's also a poor coding practice.

The "std::endl" function... It is not critical to use in simple demos like this.