Starting Address of Functions in C



What’s there to C?


In the post about function pointers, there was a mention of the function’s name being the function’s starting address. It was also mentioned that we will take a look at how that happens. You will have to bear with me a little bit since we will be taking a brief look at the code generated by three of the four stages of the C compilation process. We will be skipping only the preprocessing step.

 

First things first

Let us briefly examine the memory layout for a program when it is being executed. Ignoring some compiler-specific sections, we can say that for each program, memory will be divided into five sections, namely, stack, heap, data, bss/comm, and text. How these sections are used is as follows:

 

Stack – all local variables will be on the stack and each function/block will have its own “stack frame”

Heap – any memory that is allocated by calls to malloc/calloc/realloc (dynamically allocated memory) will be on the heap.

bss/comm – conventionally, all uninitialized global variables will be in this section

Data – all initialized global variables will be in this section

Text – the actual code that forms a program (all the statements in a function) will be in the text section.

It is pretty obvious that, if the statements in a program are going to be in some part of memory, there will be addresses associated with the statements. And there will be one statement that will be the first statement in every function. The address of that first statement of a function will be the starting address of that function.

In the next section we will take a look at the following:

  • X86 assembly code that is generated by the compiler (output of compilation, the second stage after preprocessing)
  • disassembled object code (output of assembling, the third stage of compilation)
  • disassembled executable image (output of linking, the third stage of compilation) of a program.

An important point to understand is that whenever a label is created in C or assembly code (a label is a symbol followed by the colon (:) operator), it basically represents a memory address.

 

The next point of interest is what will a function call in C look like in assembly. If we are looking at X86 64-bit architecture, a function call will become the callq (call instruction, with the suffix q for 64-bit version) instruction. The operand to this instruction has to be the address of the function being called.

In the world of ARM 64 architecture, the instruction would be bl (branch and link) and again the operand has to be the address of the function being called.

 

The output after the compilation stage

First, a look at a simple C program.

 

#include <stdio.h>

 

int func (int a, int b)

{

  int ret = a + b;

  return ret;

}

 

int main ()

{

  int i = 33, j = 44, res;

 

  res = func (i, j);

  return 0;

}

 

Below is a slightly truncated version of the X86 assembly code generated by the compiler. Remember, right now we are only interested in the labels and not the assembly code.

 

In this truncated output, we can see 4 labels – func, .LFB0, .LFE0, main, .LFB1 and .LFE1. Out of these the labels starting with .LF have been generated by the compiler whereas the func and main are from the C source file.

 

Also, note that between the labels func and .LFB0 (also between main and .LFB1) there is no assembly code and so the addresses will be the same for both labels.

 

So now it is clear that the names of the functions func and main have become labels. Now let us find out the actual addresses (relocatable address) of these functions.

 

Also, note that one single statement in C will be made up of multiple assembly statements. For example the statement int ret = a + b in the function func, has become 4 statements (in bold).

 

func:

.LFB0:

  pushq %rbp

  movq  %rsp, %rbp

  movl  %edi, -20(%rbp)

  movl  %esi, -24(%rbp)

  movl  –20(%rbp), %edx

  movl  –24(%rbp), %eax

  addl  %edx, %eax

  movl  %eax, –4(%rbp)

  movl  -4(%rbp), %eax

  popq  %rbp

  ret

.LFE0:

main:

.LFB1:

  pushq %rbp

  movq  %rsp, %rbp

  subq  $16, %rsp

  movl  $33, -4(%rbp)

  movl  $44, -8(%rbp)

  movl  -8(%rbp), %edx

  movl  -4(%rbp), %eax

  movl  %edx, %esi

  movl  %eax, %edi

  call  func

  movl  %eax, -12(%rbp)

  movl  $0, %eax

  leave

  ret

.LFE1:

 

And what does the object file created by the assembler look like?

The next stage in the compilation process is assembling. The assembler takes the assembly code output of the compilation stage (shown above) and translates it to the machine statements or instructions. The output of the assembler is called object code.

These instructions will have the format opcode <operand list>. Some instructions do not have any operands while others have one or more operands.

The output is in binary format and is not human-readable. We use a tool called objdump (there are other tools also) to convert the object code into a human-readable version. The output that we get when we use a tool like objdump is referred to as disassembled code.

The disassembled output of using the objdump command on the object code of our program is shown below. Again it is a slightly truncated version.

 

A brief explanation of the disassembled code is below.

First, the line containing <func>:. This is just the label corresponding to the function and the offset is 0 (the offset of the first instruction (next line). There is a similar line with a label for the main function.

 

All the other lines start with the a hexadecimal number followed by a ‘:’ and one or more hexadecimal numbers. After that what is shown is the disassembled assembly code.

 

The first hexadecimal number is the offset of the statement/instruction.

 

The next bunch of hexadecimal number(s) are the object code representing the statements of the assembly code generated from the original C program.

 

And if we compare the hexadecimal numbers corresponding to the assembly statements, we can see that 55 is the object code (or more appropriately opcode for the assembly statement “push %rbp”.

 

As already mentioned, a function call in the 64-bit X86 architecture is the callq instruction. So in the main function, the callq instruction has to be present with the address of func as its operand.

 

Though there is a callq 0x3f statement at offset 3a (shown in maroon) there is nothing that indicates that it is actually a call to the function func. That is because this code has not gone through the process of linking. But we can see that in the text section of memory, the offset of func is 0x0 and the offset of the main is 0x1a

0000000000000000 <func>:

   0:   55                      push   %rbp

   1: 48 89 e5             mov    %rsp,%rbp

   4: 89 7d ec             mov    %edi,-0x14(%rbp)

   7: 89 75 e8             mov    %esi,-0x18(%rbp)

   a: 8b 55 ec             mov    -0x14(%rbp),%edx

   d: 8b 45 e8             mov    -0x18(%rbp),%eax

   10: 01 d0                 add    %edx,%eax

   12: 89 45 fc             mov    %eax,-0x4(%rbp)

   15: 8b 45 fc             mov    -0x4(%rbp),%eax

   18: 5d                       pop    %rbp

   19: c3                       retq  

 

 

000000000000001a <main>:

  1a: 55                                         push    %rbp

  1b: 48 89 e5                              mov     %rsp,%rbp

  1e: 48 83 ec 10                         sub      $0x10,%rsp

  22: c7 45 fc 21 00 00 00         movl    $0x21,-0x4(%rbp)

  29: c7 45 f8 2c 00 00 00         movl    $0x2c,-0x8(%rbp)

  30: 8b 55 f8                               mov    -0x8(%rbp),%edx

  33: 8b 45 fc                               mov    -0x4(%rbp),%eax

  36: 89 d6                                   mov    %edx,%esi

  38: 89 c7                                   mov    %eax,%edi

  3a:e8 00 00 00 00                   callq  3f <main+0x25>

  3f: 89 45 f4                              mov    %eax,-0xc(%rbp)

  42: b8 00 00 00 00                 mov    $0x0,%eax

  47: c9                                       leaveq

  48: c3                                       retq

 

And finally, look at the disassembled executable code

 

The executable code is the output of the linking stage of the compilation process. This is again in binary format is disassembled using the same objdump command.

The look and feel of the executable is very similar to the object that we looked at above.

The main difference is that all the labels and statements now have a relocatable address instead of offsets. Thus func:, the label corresponding to the function func, now has the relocatable address 0x4004ad.

Again the disassembled code below is a (hugely) truncated version of the disassembly of the executable code. Please note that even after the linking is completed and we get the executable file, the addresses of the functions and the statements in the functions are not absolute addresses. Those addresses will be available only when the program is executing.

As we saw already in the previous section, what is of interest to us is the callq instruction.

Now callq instruction, at relocatable address 0x4004e7, has a hexadecimal value, 0x4004ad, as the operand and 0x4004ad indeed is the relocatable address of the function func.

 

00000000004004ad <func>:

 

  4004ad:  55                        push   %rbp

  4004ae:  48 89 e5             mov    %rsp,%rbp

  4004b1:  89 7d ec             mov    %edi,-0x14(%rbp)

  4004b4:  89 75 e8             mov    %esi,-0x18(%rbp)

  4004b7:  8b 55 ec             mov    -0x14(%rbp),%edx

  4004ba:  8b 45 e8             mov    -0x18(%rbp),%eax

  4004bd:  01 d0                  add    %edx,%eax

  4004bf:  89 45 fc              mov    %eax,-0x4(%rbp)

  4004c2:  8b 45 fc             mov    -0x4(%rbp),%eax

  4004c5:  5d                       pop    %rbp

  4004c6:  c3                       retq  

 

00000000004004c7 <main>:

  4004c7:  55                                         push   %rbp

  4004c8:  48 89 e5                              mov    %rsp,%rbp

  4004cb:  48 83 ec 10                         sub    $0x10,%rsp

  4004cf:  c7 45 fc 21 00 00 00          movl   $0x21,-0x4(%rbp)

  4004d6:  c7 45 f8 2c 00 00 00         movl   $0x2c,-0x8(%rbp)

  4004dd:  8b 55 f8                               mov    -0x8(%rbp),%edx

  4004e0:  8b 45 fc                               mov    -0x4(%rbp),%eax

  4004e3:  89 d6                                   mov    %edx,%esi

  4004e5:  89 c7                                   mov    %eax,%edi

  4004e7:  e8 c1 ff ff ff                       callq  4004ad <func>

  4004ec:  89 45 f4                              mov    %eax,-0xc(%rbp)

  4004ef:  b8 00 00 00 00                  mov    $0x0,%eax

  4004f4:  c9                                        leaveq

  4004f5:  c3                                        retq  

  4004f6:  66 2e 0f 1f 84 00 00        nopw   %cs:0x0(%rax,%rax,1)

  4004fd:  00 00 00

 

So the name of a function indeed gives the starting address of it.

 

NOTE:

When using gcc/clang compilers (on Linx/UNIX platforms) the following commands can be used to generate the different outputs shown above (assembly, object file and executable). Assuming that the C source file is called pam.c, the various commands would be:

gcc -S pgm.c – This will generate pgm.s file

gcc -c pgm.s – This will generate pgm.o file (you can use pgm.c instead of pgm.s)

gcc pgm.o -o pgm – This is will generate an executable called pgm (you can use pgm.c instead of pgm.o)

 

Again on Linux/UNIX platforms, to disassemble the .o file and the executable file the program to use the objdump program. You can use this program to disassemble the object code/executable and also to see all the global symbols (variables and functions). Note that this program will just print its output on the screen.

 

objdump -d pgm – disassemble the code section of the executable file pgm.

objdump -t pgm – will print the global symbols from the symbol table.

 

 

Ready to take your skills to the next level? Explore job opportunities at Vayavya, where innovation meets talent. Apply now and join our dynamic team! 

100% LikesVS
0% Dislikes

Author

  • Venu Kolathur

    Venu Kolathur is Chief Architect and Co-Founder at Vayavya Labs and has over 38 years of industry & academic experience. He is responsible for product technology road-map, and design strategies.