Raphael S.Carvalho's Programming Blog

raphael.scarv@gmail.com

"A programmer that cannot debug effectively is blind."

Wednesday, July 16, 2014

Craziness

Everything I have done so far in my life has something to do with Eggs with Bacon, how about yours?

The Famous Eggs with Bacon.
 

Friday, December 27, 2013

ZFS Adjustable Replacement Cache (ARC) Analysis on OSv

Find further info about OSv at: http://osv.io

:: Overview ::
- The size of the ZFS ARC is allowed to grow more than the target size, but arc_reclaim_thread
must eventually wake up to reduce the size to the target one.
I initially thought that it wouldn't be working as the commit '29cf134' partially disabled its 
functionality, however, running 'osv zfs' later shows that the arc size is really reduced to 
conform the target.

- The ARC initial target should be initialized to 1/8 of all memory, then be reduced on 
eventual memory pressures (Tomek has already touched this, and Glommer suggested a similar 
approach). arc_init() gets the system memory through kmemsize() which currently always return 0,
thus arc itself is coming up with a number when setting the target size (16MB).

- Another important detail is l2arc (level2 ARC) currently disabled which means performance 
penalty depending on the workload.

:: For memory pressure ::
- By knowing that arc_reclaim_thread() is working and rely on arc_adjust() to resize the lists, 
we know that arc_shrink() would work on OSv.

arc_shrink() reduces the arc target size by doing the following:
* Firstly, to_free is calculated as follow: to_free = arc_target_size >> arc_shrink_shift (5)
* Then it will guarantee that to_free will not reduce target_size to a value lower than the 
minimum target size.
  If the condition above is true, then target size is reduced by to_free (which means reducing the 
arc size by about 3.125%).
  If not, target size is set to the minimum target size.
* And finally, arc_adjust() is called to do the actual work.

:: ZFS ARC performance on Cassandra ::
- The results below show that the ARC misses ratios are really high on both cases. ZFS ARC is 
performing well on small workloads, but when it comes to higher ones, the performance isn't 
the same.

[raphaelsc@muninn bin]$ ./cassandra-stress -d 192.168.122.89 -n 10000000
total,interval_op_rate,interval_key_rate,latency/95th/99th,elapsed_time
225940,22594,22594,1.5,3.0,33.2,10
512367,28642,28642,1.5,2.5,69.4,20
762547,25018,25018,1.5,2.6,93.7,30
1029819,26727,26727,1.5,2.5,93.7,40
1269269,23945,23945,1.5,2.7,93.4,50

(gdb) osv zfs
:: ZFS TUNABLES ::
    zil_replay_disable=0
    zfs_nocacheflush=0
    zfs_prefetch_disable=0
:: ARC SIZES ::
    Actual ARC Size: 64839968
    Target size of ARC: 16777216
    Min Target size of ARC: 16777216
    Max Target size of ARC: 16777216
    Target size of MRU: 15728640
:: ARC EFFICIENCY ::
Total ARC accesses: 63962
    ARC hits: 51622 (80.71%)
        ARC MRU hits: 18842 (36.50%)
            Ghost Hits: 1811
        ARC MFU hits: 32306 (62.58%)
            Ghost Hits: 970
    ARC misses: 12340 (19.29%)
:: L2ARC ::
    Actual L2ARC Size: 0
Total L2ARC accesses: 0
    L2ARC hits: 0 (nan%)
    L2ARC misses: 0 (nan%)

[raphaelsc@muninn bin]$ ./cassandra-stress -d 192.168.122.89 -n 10000000
total,interval_op_rate,interval_key_rate,latency/95th/99th,elapsed_time
208736,20873,20873,1.7,3.8,27.0,10
424091,21535,21535,1.7,3.5,102.4,20
624038,19994,19994,1.7,3.6,102.4,30
871778,24774,24774,1.7,3.4,76.9,40
1048259,17648,17648,1.6,3.2,111.4,50
1307851,25959,25959,1.6,3.1,76.9,60
1564253,25640,25640,1.6,3.0,571.1,70
1814642,25038,25038,1.6,2.8,74.7,80
2066720,25207,25207,1.6,2.8,40.1,91
2264887,19816,19816,1.6,2.9,40.0,101

(gdb) osv zfs
:: ZFS TUNABLES ::
    zil_replay_disable=0
    zfs_nocacheflush=0
    zfs_prefetch_disable=0
:: ARC SIZES ::
    Actual ARC Size: 143722352
    Target size of ARC: 16777216
    Min Target size of ARC: 16777216
    Max Target size of ARC: 16777216
    Target size of MRU: 15728640
:: ARC EFFICIENCY ::
Total ARC accesses: 226173
    ARC hits: 158569 (70.11%)
        ARC MRU hits: 54671 (34.48%)
            Ghost Hits: 6017
        ARC MFU hits: 85117 (53.68%)
            Ghost Hits: 3033
    ARC misses: 67604 (29.89%)
:: L2ARC ::
    Actual L2ARC Size: 0
Total L2ARC accesses: 0
    L2ARC hits: 0 (nan%)
    L2ARC misses: 0 (nan%)

Monday, October 21, 2013

Booting and File systems

Arch: x86
Firmware: BIOS
Partitioning scheme: MBR-based
Platform: Linux
Bootloader: Syslinux-based

BIOS comes to live when the computer is switched on.
BIOS seeks a bootsector living in the first sector of the respective disk.
The last two bytes of the first sector must match the boot signature.
If found, BIOS loads the first sector of the disk into the main memory and starts executing it.

The first sector will go through each entry in the partition table, then attempt to find the bootable partition.
If found, then it will load the first sector of the partition given the CHS address in the respective entry.

This first sector is the bootloader. It's usually hardcoded so that it knows the location of the second stage.
The second stage has file system drivers, modules, etc.
The second stage will determine which file system driver should be used to this partition based on the magic number in the superblock.
Each file system driver has such a discovery function.
When a driver is found, then its file system operations structure is hooked to the mount point (partition).

Bootloaders usually have implicit absolute paths for configuration files, e.g. /boot/config.
Such files inform the location of the kernel image and initramfs.

Kernel has its own file system drivers embedded in its image.
When mounting devices, it will go through the list of available drivers and perform the following with each entry:
- Check if the driver supports the file system installed in the device.
- If so, hook this driver to the device.
- If not, go to the next entry.

After that, all operations on such device (mount point) will use the operations provided by the driver (previously hooked at the mount time).

Saturday, September 28, 2013

Does 'processor's size' limit the size of main memory?

Processor doesn't limit the size of main memory, but instead the addressable space.

"Does it mean something else also?"
It means a lot of things.
- Register size.
- Addressing.
- Data Alignment.

On x86-32 as an example, ESP (Stack Pointer) is a 32 bits long register that is used to point to the top of the stack. So it will be used implicitly by x86 when doing stack-related operations (PUSH, POP). Registers such as ESP work basically as an offset into the addressable space.

Nowadays, addresses issued by your program go through a translation process in MMU (On computers provided with MMU), but I will not to discuss this here as it has nothing to do with the main purpose of this topic. It's possible to have an address bus whose size is lower/higher than the size of the processor (Maybe, PAE rely on having a higher number of addressing lines =]).

Yes, address bus must be at least compatible with the size of the word of the processor. Otherwise, how would we send all bits of the address to the memory controller on load/store operations?

x86 real-mode is an interesting example of having an addressable space higher than the size of the word of the processor.
At that time, you had segment:offset addresses where segment would be multiplied by 16 (shift << 4), then the result would be added to the offset. This generated address would then be sent to the memory controller through the address bus. Even though real mode processors had at most registers of 16 bits, the address bus was made of 20 addressing lines.
Up to 1 megabyte of physical memory could be accessed.

The following sentence will probably help you:
"A microprocessor will typically have a number of addressing lines equal to the base-two logarithm of its physical addressing space" http://en.wikipedia.org/wiki/A20_line

There is an interesting approach used by compilers when certain operations aren't natively supported by the underlying processor. For example, 32-bit processors don't support 64-bit data values, but some compilers circumvent that by emulating 64-bit operations (load/store, arithmetic, branch).

Suppose we will run the following snippet of code under a 32-bit processor:
 long long int a, b; // 64-bit values (Even on 32-bit processors).  
 a += b; // Add a to b; store the result into a.  
How would it be possible if 32-bit processors cannot operate on data whose size is higher than 32 bits?
As I said above, it will emulate such operations. It does that by using multiple instructions (steps).
Yes, it will be slower, but that's the only way of dealing with data higher than that supported by the processor.

On a 32-bit processor, if you're adding one 64-bit value to another one, then the addition must be done partially as 64-bit values aren't supported by 32-bit processors.

The assembly code respective to the above snippet would look something like the following:
 # eax:ebx will be used to store a.  
 # ecx:edx will be used to store b.  
 add ebx, edx; # we must calculate ebx:edx first (storing the result into ebx)  
 # note that adc will be used instead of add;  
 # there may be a carry left over by the previous addition,  
 # so the next addition must take it into account.  
 adc eax, ecx; # then calculate eax:ecx (storing the result into eax)  
Yeah, it's expensive (from both resource and performance standpoint since many general-purpose registers are being used, and multiple steps are required to get the operation done respectivelly) and boring (personal opinion =P), but nevertheless, how would we do it otherwise?

Hope it will help you,
Raphael S. Carvalho.

Friday, September 6, 2013

x86 security mechanisms

Hi folks,

Unless the kernel of your Operating System uses the "follow the bouncing kernel" scheme, kernel and user environment share the same address space. So the following doubt may arises: If the user is sharing its address space with the kernel, which security policies/mechanisms are used to ensure safety? That's what I will be addressing here, so fasten your seat belt :)

MMU (Memory Management Unit) uses both paging and segmentation as mechanisms of protection.
I will focus on paging here, so if you want to understand how paging relates with segmentation, take a look at:

MMU performs checking based on the CPL (Current Privilege Level)[1]
Each physical page has a correspondent page entry that tells the hardware which operations are allowed and who can access it. 
When user programs are being executed, the CPL is always 3, thus pages marked as system-only aren't available. There are a lot of flags stored in each 'page descriptor', but I will present the two ones more relevant due to the purpose of this post.

[1]: CPL is an essential feature to the protection mechanism of the x86. 
It prevents processes running with a lower-privileged level from doing things that would halt the entire system.
By the way, CPL is stored in the bottom two bits of the CS register.

Write and Present are flags that determine the current state of a page. If the write flag is turned on, then write operations are allowed. Otherwise, the page is read-only.
The present flag basically tells MMU whether or not the underlying page is present in physical memory. If it is not, the page was probably swapped to some other place due to some event, e.g. out of memory.

When the system traps into the kernel-level e.g. syscall, the processor itself switches the mode from user to system (from CPL 3 to 0). From this moment on, pages marked as system-only are available. 

By knowing that, we readily understand that processes running under CPL 3 (user programs) will never be able to access the kernel address space. Unless there is some exploitable code in the kernel itself, user space will not be able to peek into the kernel space.

Hope you enjoyed it,
Raphael S. Carvalho.

Saturday, August 31, 2013

Endianess

Look carefully at the following snippet of code:

int c = 0xFFAABBCC;
printf("%02x\n", ((char *)&c)[0]);

If you aren't familiar, then you may be asking yourself: How does it work?

Each hexadecimal digit represents a nibble, that is, 4 bits. Then 2 hexadecimal digits = 1 byte. 'int c' stores a 32-bit/4-bytes value.
It's also important mentioning that '0x' is prepended to all hexadecimal values in the C language.

Computer memory is basically a bunch of sequenced 8-bits cells{1}, then it's not possible to store all the bytes from that value into a single cell. Why? integer stores a multi-byte value, and so must span several memory cells.
Unfortunately, there are some architectures that store numbers in different ways.

* {1}: This may not be true in the real world! Google about NUMA systems.

x86 is a little-endian architecture, which means that less-significant bytes are stored first.
Do you understand the meaning of most-significant byte and less-significant byte at the following hexadecimal value: 0xFFAABBCC?
It's just a terminology to describe significance respective to each byte of a multi-byte value.
0xFF is the most-significant, whereas 0xCC the less-significant one.

So answer me the following, which byte from the variable c will be stored first in memory? the most-significant or the less-significant?
If you understood the content above, you know that it depends on the underlying arch.

On a little-endian arch, the bytes from 0xFFAABBCC will be stored in memory as follow (On a big-endian arch, 0xFF (the most-significant byte) would be stored first instead):

[0] = 0xCC
[1] = 0xBB
[2] = 0xAA
[3] = 0xFF


* [0] meaning that it's a lower address than [3].

* An example of big-endian arch is PPC.

- So let's get started figuring out what each piece of the code means:

This is the first step of the code: ((char *)&c)
It basically gets the address from an integer variable, then we have a pointer to an integer. Thereafter, it converts the integer pointer into a character one by using an explicit cast.
It means that it's a pointer to a 8-bits value from now on.

((char *)&c)[0]: The second step will basically gets the value pointed to by the character pointer.
As I told you, less-significant bytes are stored first on little-endian archs, then 0xCC is the output. If it were [1] instead, then 0xBB would be output since [1] references the second less-significant byte.
You can see how the bytes were individually stored in memory by looking at my description above.

If you got the content from this post, then you won't have any troubles in creating a code to check the endianess of your machine. If you're feeling adventurous, take it as an exercise =)

Hope you liked it,
Raphael S. Carvalho.

Saturday, July 20, 2013

*** X86 PROTECTION *** [ 1 / ??? ]

This is the first tutorial of a series about the protection mechanisms on x86 arch.

* This text was mostly based on the Intel 80386 Programmer's Reference Manual.
NOTE: If something in this article is unclear to you, consult the manual itself.
NOTE: This article explains the protection mechanisms, if you aren't familiar with the address formation cycle (segmentation->paging), then you should go back and learn it before reading this article.
NOTE: Sentences surrounded by double quotes are implicitly explanations taken from the Intel Manual.

Today I will talk about the protection criteria on x86. Firstly, it's important to know that protection is essential to confine bugs.
It means that intended/unitended bugs from one procedure shouldn't damage others.
Protection on x86 was built with two important goals in mind: help detect and identify bugs. Bugs must be found and eliminated as quickly as possible.

Proctection on x86 was made possible through the following mechanisms:
- Check memory accesses.
- Check instruction execution.


It's also important to know that the system developer has the power of deciding which mechanisms will be used (according to the system design objectives).

Protection Overview:
-----
The Intel manual lists all aspects of protection as applied both to segmentation and paging.

Aspects of protection:
1. Type checking
2. Limit checking
3. Restriction of addressable domain
4. Restriction of procedure entry points
5. Restriction of instruction set


It's interesting knowing that the protection hardware of x86 is an integral part of the memory management hardware.
When a memory reference is made, this hardware must check that it conforms the protection criteria as defined by the system programmer.
Any memory access that doesn't conform the criteria results in an exception, which must be handled by the exception mechanism.
Invalid memory access will prevent the cycle from starting, thus protecting the system. You can ask yourself, this protection mechanism wouldn't result in a performance penalty since there is a check on every memory access?
The answer is no, both the address formation and the check on memory access are performed concurrently.
NOTE: I won't talk about exception handling in this article. If you're interested, then you should go forward and learn it yourself.

It's also important to clarify the meaning of privilege when applied to aspects of protection.

"The concept of "privilege" is central to several aspects of protection (numbers 3, 4, and 5 in the preceeding list)."
"Applied to procedures, privilege is the degree to which the procedure can be trusted not to make a mistake that might affect other procedures or data. Applied to data, privilege is the degree of protection that a data structure should have from less trusted procedures."

It's very easy to understand, then additional note shouldn't be needed. I will finish this protection overview topic by emphasizing that this concept of privelege applies both to segment and page protection.

Segment-level protection

-----
It's essential to know that checks are performed automatically by the CPU whenever you load a segment descriptor into a segment register, and also with every segment access.
Segment descriptors simply stores info about a segment. "Segment registers hold the protection parameters of the currently addressable segments."

Besides, I suppose you know the difference between segmentation on real and protected mode. I won't present the differences in this article.

Descriptors and Protection Parameters
-----
All of these fields are set by the system programmer at the time the descriptor is created. As stated in the manual, the application programmer shouldn't be concerned with these details unless needed (E.G: better understanding or exploitation issues).

It's also very interesting that each segment register has an invisible portion for storing several parameters about the segment. The processor loads not only the base address, but also protection information (limit, type, and privilege level).
So the CPU can look at this invisible portion (on subsequent memory checks) instead of retrieving the protection parameters from the descriptor on every segment access.
"Therefore, subsequent protection checks on the same segment do not consume additional clock cycles."



Hope you liked it,
Raphael S. Carvalho