What does the big data enterprise market look like in 2016? Is this a winner-take-all market where we will see certain companies dominate…

Various reports have pegged Big Data market to be worth around $40 billion in 2016. [1] [2]. Big Data have clearly three leaders – Cloudera, Horton Works and MapR, of which only Horton Works had its IPO. Its stock is trending at around $10 a share.

Horton Works had revenues worth $46 million reported in 2014 [3]. Looking at the steady increase in their revenues over the years, they might touch $100 million in revenues in the year 2016. Hadoop is open source, meaning that companies can use it for free. So how does Horton Works makes money? One word, support. Hortons Works have more than 800 customers at the moment, providing 24/7 web and telephonic support [4].

Cloudera, being the first to lead in the big data race, has the advantage of a beginner, meaning more customers. Most companies might be reluctant to shift to one of their competitors. Cloudera have also raised about $1 billion, with a big chunk coming from INTEL. Cloudera had claimed more than $100 million in revenues in 2014 [5], way ahead of Horton Works, and expected to reach $300 million in 2016. Cloudera revenue model is same as Horton Works, meaning selling support.

MapR is quite different from the above two. They are dedicated to creating proprietary extensions to Hadoop while maintaining the API compatibility, but at the same time, provide extra products and capabilities that compliment Hadoop ecosystem to work better. The strength of MapR is in its propritary products like MapR FS, MapR DB and MapR Streams [6]. MapR FS is a POSIX filesystem that supports distributed, reliable, high performance, scalable and fully read/write filesystem. The Hadoop filesystem HDFS, is nowhere close to MapR FS, and is one of the main reasons, why customers prefer MapR. In 2014, MapR had about 700 paying customers [7]. MapR M5 had a price tag of $4000 per node per year [8], which means they might be making a lot more than Horton Works or Cloudera.

Even if you take the combined market capitalization of all the above companies, its nowhere close to the entire big data market. They are plenty of other players like Syncsort (expected $75 million in big data revenues in 2013), MarkLogic (expected $96 million in big data revenues in 2013), OperaSolutions (expected $124 million in 2013), Actian (expected $138 million in big data revenues in 2013), Pivotol (expected $300 million in big data revenues in 2013), PWC (expected $312 million in big data revenues in 2013), Accenture (expected $415 million in big data revenues in 2013), Palantir (expected $418 million in big data revenues in 2013), SAS (expected $480 million in big data revenues in 2013), Oracle (expected $491 million in big data revenues in 2013), Teradata (expected $518 million in big data revenues in 2013), SAP (expected $545 million in big data revenues in 2013), Dell (expected $652 million in big data revenues in 2013), HP (expected $869 million in big data revenues in 2013), IBM (expected $1.37 billion in big data revenues in 2013) [9] [10].

More and more startups are turning up in big data like DataHero (raised $6.1 million in Series A funding), Tamr, Domo (valued at $2 billion), Arcadia Data, Looker, Kyvos Insights, Confluent (raised $24 million in Series B funding), AtScale and ThoughtSpot (raised $30 million in Series B funding).

The year 2016 might see more companies providing proprietary or open solutions to complement the big data or even the Hadoop ecosystem. The open source nature of Hadoop may make it difficult to earn revenues, but there’s absolutelty no barrier for a new company or startup to enter into the big data race. Big data is definitely going to see a proliferation of players and technologies in 2016!

Advertisements

Why is size of int in most 64-bit systems 4 bytes?

I had written an extensive post detailing why size of int is not fixed and what C standard says about it. Have a look through it.

If you lack a basic understanding of the sizes of various data types, I suggest you take a look at: http://programmingbytes.tumblr.com/post/124121068726/what-is-sizeof-int

There are currently 5 main data models:

  • LP64
  • ILP64
  • LLP64
  • ILP32
  • LP32

Datatype LP64 ILP64 LLP64 ILP32 LP32
char                8        8          8        8        8
short             16      16        16      16      16
int                 32      64        32      32      16
long              64      64        32      32      32
long long                            64
pointer          64      64        64      32      32

Depending on the data model used by your compiler, the size of primitive data types shall vary.

Have a read through: http://www.unix.org/version2/whatsnew/lp64_wp.html

What is the size of int in C?

Your C textbook might say sizeof(int) = 2, and another might say its 4 actually! Quite confusing, and yet both are wrong. If your textbook say its either 2, 4 or something else, its either old or wrong! One way or the other, its time to look for a new book!

The C standard mentions nothing about the size of int or other primitive data types. All it does mention is a range of integers it can hold, and in case of int, -32768 to 32767, which requires 16 bits or 2 bytes minimum.

The keyword here is minimum. Int should be able to hold atleast any numbers within -32768 to 32767, or should be 2 bytes in size minimum. But compilers are free to go beyond that minimum. On 32-bit and most 64-bit systems, you shall see sizeof(int) = 4.

Lets write a simple C program.

#include <stdio.h>

int main() {

    printf(”Sizeof int = %d\n”, sizeof(int));

    printf(”Sizeof int* = %d\n”, sizeof(int*));

    return 0;

}

In 16-bit systems, sizeof int shall probably be 16 bits or 2 bytes. On 32-bit systems, sizeof int shall probably be 32 bits or 4 bytes. But when you run it on 64-bit systems, you might be expecting 64 bits or 8 bytes due to the word size. But chances are, sizeof int in 64 bit systems shall still be 4 bytes.

The interesting thing to note here is sizeof(int*), which is same as the word size of your processor. On 64-bit systems, sizeof(int*) = 8 bytes, and on 32-bit systems, its 4 bytes.

Another thing worth mentioning is about sizeof(long) and sizeof(short). On your system, sizeof(int), sizeof(long), sizeof(short) might have the same value. All C standard mentions is the range of values:

short: -32768 to 32767

int: -32768 to 32767

long: 2147483648 to 2147483647

All the above are signed values. Since short, int, long can be held in a 4 byte memory, chances are sizeof int, short, long shall be 4 bytes (if you are having a 64 bit system).

A couple of things to remember:

  • A short int must not be larger than an int.
  • An int must not be larger than a long int.
  • A short int must be at least 16 bits long.
  • An int must be at least 16 bits long.
  • A long int must be at least 32 bits long.
  • A long long int must be at least 64 bits long.

if you need to understand why compilers choose these sizes, have a look though: http://www.unix.org/version2/whatsnew/lp64_wp.html

But sizeof(char) is fixed and is 1 byte. If you need bigger width char types, have a look at the wide character data type – wchar_t.

To fix this discrepancy, C99 introduced some fixed width integers like int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t. They are defined in <stdint.h>. Irrespective of the compiler or the machine you use, their size is fixed. Just don’t forget to turn on the C99 mode!

Why doesn’t C support function overloading?

I have heard this question over and over – Why doesn’t C support function overloading? Rather than quoting C standard back to you, I thought I shall take a practical approach.

Let’s write a simple C program.

#include <stdio.h>

void func() {

    printf(”Hello World!\n”);

}

int main() {

    func();

    return 0;

}

Compile it, and get the binary. Those who are familiar with Linux, shall know the nm command. Lets apply that on the binary, and we get the below output.

0000000000601040 B __bss_start
0000000000601040 b completed.6335
0000000000601030 D __data_start
0000000000601030 W data_start
00000000004004a0 t deregister_tm_clones
0000000000400510 t __do_global_dtors_aux
0000000000600e08 t __do_global_dtors_aux_fini_array_entry
0000000000601038 D __dso_handle
0000000000600e18 d _DYNAMIC
0000000000601040 D _edata
0000000000601048 B _end
0000000000400604 T _fini
0000000000400530 t frame_dummy
0000000000600e00 t __frame_dummy_init_array_entry
0000000000400768 r __FRAME_END__
000000000040055d T func
0000000000601000 d _GLOBAL_OFFSET_TABLE_
                w __gmon_start__
0000000000400408 T _init
0000000000600e08 t __init_array_end
0000000000600e00 t __init_array_start
0000000000400610 R _IO_stdin_used
                w _ITM_deregisterTMCloneTable
                w _ITM_registerTMCloneTable
0000000000600e10 d __JCR_END__
0000000000600e10 d __JCR_LIST__
                w _Jv_RegisterClasses
0000000000400600 T __libc_csu_fini
0000000000400590 T __libc_csu_init
                U __libc_start_main@@GLIBC_2.2.5
000000000040056d T main
                U puts@@GLIBC_2.2.5
00000000004004d0 t register_tm_clones
0000000000400470 T _start
0000000000601040 D __TMC_END__

Please keep in mind that it can vary with compiler.

If you look through the above output, you shall see the func function.

Now, let’s compile the same program using G++ compiler, and apply the nm command on the new binary.

0000000000601040 B __bss_start
0000000000601040 b completed.6335
0000000000601030 D __data_start
0000000000601030 W data_start
0000000000400560 t deregister_tm_clones
00000000004005d0 t __do_global_dtors_aux
0000000000600dd8 t __do_global_dtors_aux_fini_array_entry
0000000000601038 D __dso_handle
0000000000600de8 d _DYNAMIC
0000000000601040 D _edata
0000000000601048 B _end
00000000004006b4 T _fini
00000000004005f0 t frame_dummy
0000000000600dd0 t __frame_dummy_init_array_entry
0000000000400818 r __FRAME_END__
0000000000601000 d _GLOBAL_OFFSET_TABLE_
                w __gmon_start__
00000000004004d0 T _init
0000000000600dd8 t __init_array_end
0000000000600dd0 t __init_array_start
00000000004006c0 R _IO_stdin_used
                w _ITM_deregisterTMCloneTable
                w _ITM_registerTMCloneTable
0000000000600de0 d __JCR_END__
0000000000600de0 d __JCR_LIST__
                w _Jv_RegisterClasses
00000000004006b0 T __libc_csu_fini
0000000000400640 T __libc_csu_init
                U __libc_start_main@@GLIBC_2.2.5
000000000040062d T main
                U puts@@GLIBC_2.2.5
0000000000400590 t register_tm_clones
0000000000400530 T _start
0000000000601040 D __TMC_END__
000000000040061d T _Z4funcv

There is no func function! What happened really is that the G++ compiler changed the name of func symbol to _Z4funcv! So when you overload your functions in your C++ code, even though they have the same name in the source code, they have different symbol names in the binary. Its called name mangling.

So if you write a C++ code having a func() function, make it into a library, and try calling it from C, you shall get a “undefined reference to func()”. The reason being that in the C++ library, instead of func, its _Z4funcv. But you can solve this issue by putting the C++ code inside “extern C” block.

#include

extern “C” {

    void func() {

        printf(”Hello World!\n”);

    }

}

int main() {

    func();

    return 0;

}

Lets compile it using G++ compiler and see the nm output.

0000000000601040 B __bss_start
0000000000601040 b completed.6335
0000000000601030 D __data_start
0000000000601030 W data_start
0000000000400560 t deregister_tm_clones
00000000004005d0 t __do_global_dtors_aux
0000000000600dd8 t __do_global_dtors_aux_fini_array_entry
0000000000601038 D __dso_handle
0000000000600de8 d _DYNAMIC
0000000000601040 D _edata
0000000000601048 B _end
00000000004006b4 T _fini
00000000004005f0 t frame_dummy
0000000000600dd0 t __frame_dummy_init_array_entry
0000000000400818 r __FRAME_END__
000000000040061d T func
0000000000601000 d _GLOBAL_OFFSET_TABLE_
                w __gmon_start__
00000000004004d0 T _init
0000000000600dd8 t __init_array_end
0000000000600dd0 t __init_array_start
00000000004006c0 R _IO_stdin_used
                w _ITM_deregisterTMCloneTable
                w _ITM_registerTMCloneTable
0000000000600de0 d __JCR_END__
0000000000600de0 d __JCR_LIST__
                w _Jv_RegisterClasses
00000000004006b0 T __libc_csu_fini
0000000000400640 T __libc_csu_init
                U __libc_start_main@@GLIBC_2.2.5
000000000040062d T main
                U puts@@GLIBC_2.2.5
0000000000400590 t register_tm_clones
0000000000400530 T _start
0000000000601040 D __TMC_END__

Yes! We now have the func symbol in the binary! We can now make it into a C++ library and shall have no problem calling it from C code.

Any function declared/defined within extern C block shall have C linkage, meaning that you can’t have overloaded functions within extern C block (hope you can figure that out yourselves!)

If you look through some C++ libraries, you shall see code like the below.

#ifdef __cplusplus

    extern “C” {

#endif

// All functions declared here.

#ifdef __cplusplus

    }

#endif

__cplusplus macro is defined by every C++ compiler, and hence the “extern C” block shall become visible only when compiled using a C++ compiler.

The interesting fact is that C++ standard doesn’t mention how the name mangling should be done or what algorithm to follow. Its left entirely upto the compiler designers.

Why local variables in C have garbage values as default?

Local variables are those variables that are created within a function. Unless its static, they go out of scope and are destroyed when the function finishes its execution. All local variables are stored in the stack. I hope you know what a stack is. Before a function begins execution, details like function arguments, return address, local variables, etc is pushed onto the stack. When the function finishes execution, its popped off the stack.

Consider the below program.

int f1() {

    f2();

    f3();

}

int f2() {

    int a = 10;

}

int f3() {

    int b;

    printf(“%d\n”, b);

}

If you run the above program, you might get output as 10 (It might vary depending on your stack structure). On first look, it looks like garbage value, but its actually the value of a in function f2(). You can change the value of a, but the output shall still be the value of a. That should be enough to tell you that its not “garbage value” like most people would call it.

Let’s go deep. First f1 gets called, and is pushed onto the stack. F1 calls f2, and f2 is pushed onto the stack. The stack now contains f2 and f1. After f2 finishes its execution, its popped off the stack. Now f1 calls f3, and f3 is pushed onto the stack. The stack now contains f3 and f1. But f3 is stored in the exact location of where f2 was stored earlier. Since the memory is not cleared before pushing f3, we get what we call “garbage value”, which is actually the data used for f2.

Segmentation Fault – Why does it occur?

int main() {

    int *a = NULL;

    printf(“Value of a = %d\n”, *a);

    return 0;

}

Those who are proficient in C can easily notice that the above C code shall result in Segmentation Fault. And the reason? Derefering a NULL pointer

For the less savy person, a is a pointer with value NULL. A pointer stores the address of some variable, which means that the value of a pointer is address of some variable. Here a is storing NULL, which means its pointing to nothing. In the next line, we use the * or dereferencing operator. What it does is that, it tries to get the value of the variable to which a is pointing. For example, if a has the address of b, then *a gives the value of b. But in this example, a is storing NULL, which in most Operating Systems mean the address 0x0. Since we cant access that address, we cant get the value from that address. Segmentation Fault.

Unlike Java which shall throw NullPointerException, in C or C++, Segmentation Fault is not easy to detect until and unless you start learning how to use a debugger like GDB. You won’t get any helpful messages to help you debug the point of issue.

A segmentation fault occurs due to an invalid access to a memory location. In other words, you are trying to access some memory location for which you do not have access or not allowed to. Another example is, you create an array A[10]. If you try accessing the A[10] or A[11] element, it can result in a segmentation fault. The keyword here is “can”. Its not necessary that a segmentation fault should occur, since C does no bound checking. So if you can access A[10] memory address, it shall give you the value stored in that address.

Segmentation Fault cannot be avoided. Every single C programmer shall be greeted with Segmentation Fault at one or another point in their career. The only way forward is to understand why it happens. It may be as simple as missing out a ‘&‘ in a scanf() statement. Learn to use GDB or some other debugging tool. In all probability, the errors shall be so simple that you shall understand the error once you see the backtrace.

Why is it unsafe to use gets() in C?


Let’s see the syntax of the
gets() function.

char *gets (char *s)

Consider that s is a character array of size 10, i.e. s[10]. It can hold upto 10 characters.

If the user inputs a string with length less than 10, it can be stored in s. What happens when the user inputs a string with length greater than 10, say 11 or 20 or 100? The 11th character shall go to s[10] (array indices start with 0, i.e. our original array is from s[0] to s[9]). Obviously this is outside your array, but C does no bound checking. So chances are, it may succeed, or in other words, overwrites the stack. A good compiler should warn about it. But as a human, we all have a tendency to ignore all warnings that the compiler throws at us.

It shall keep overwriting the stack, and It shall succeed until gets() reaches an address to which you have no access to. What happens now? In Linux, its probably the dreaded Segmentation Fault, which can crash your system. But why not use this vulnerability for something better?

That’s exactly what Morris worm did. It exploited the finger daemon – in.fingerd, which used gets() to read into a 512-byte buffer. Using the above vulnerability of gets(), it can use it to overwrite the stack. Rather than overwriting the stack with some random data, why not overwrite with some something specific?

Morris worm overwrote the stack to modify the return address of the current activation record in the stack,. So when the current executing function returned, it gave control to some instruction in the stack (another modification!) that calls execv() which replaces the process with a shell. So now you got access to the shell on a remote machine, using which you can make a copy of the worm and transfer it to another machine! Voila!