Think about it

... Or not

0 notes & Comment

Let’s Hadoop now

I’ve been playing with Hadoop for few weeks now. Hadoop is an open source Apache product inspired by Google MapReduce and Google File System. It allows your application to work with thousand of nodes and petabytes of data. It’s written in Java and run on commodity hardware. I’m not going to write a tutorial but I will tell you how did I start.

Here a list of the few tutorial that I used:

Basically I followed the tutorial of Michael G. Noll but I use the Cloudera packages for debian/ubuntu. Few thing you need to know:

  • When you use the Cloudera package it will automatically create the user hdfs and the group hadoop.
  • Hadoop instantiate a lots of servers such as the namenode, those servers are bind to the IPs of their lookup name. Which mean that if your namenode is called “namenode01” and that it’s assign to 127.0.0.1 when Java will spawn the server, it will listen exclusively on this IP. It’s imperative to assign this lookup name to the external IP.
  • Namenode is the single point of failure of HDFS but one thing I didn’t get until reading the full documentation of Hadoop, is that the namenode has one file that maps every files block on the cluster. If you loose this file it doesn’t matter how many nodes you have or how redundant there are, you will loose everything.
  • The config file /etc/hadoop/conf/master doesn’t designate the master. This file actually designate the secondary namenode which is not a slave or a backup to the namenode either. The master is the local machine you use to start Hadoop.
  • HDFS will use any byte available on the system. In the config file hdfs-site.xml you need to define “dfs.datanode.du.reserved” to reserve some space for the system.
  • Certain MapReduce job may died because they run out of memory, you can/should define an appropriate value by defining Java options with “mapred.child.java.opts
  • You can use bin/start-df.sh to start HDFS and use bin/start-mapred.sh to start the MapReduce service. (HDFS should start first, then MapReduce AND MapReduce should stop first then HDFS). You can also use bin/start-all.sh and bin/stop-all.sh

The main problem I ran into was to add some redundancy to the namenode. On small cluster, the namenode is running on the same server than the secondary namenode and the first datanode. Also the namenode and secondary namenode consume a lot of resources, especially memory.

The namenode distribute data accross your cluster, he also take care of redundancy of the data. If one datanode goes down and that you have a redundancy set to 2. The namenode will replicate the data lost from datanodes (where the data is not lost) to a datanode available to keep that redundancy. The namenode does that by keeping record of every action in two files, called fsImage and edits. fsImage is the current file namespace and edits contain every modification to the current file namespace. For performance optimization the namenode will merge those two files only when it starts. Which mean that if you never restart you namenode, the edits file will grow large and at the next restart it will take some time to merge those two files. That’s why the secondary namenode exist, it will contact the namenode every 1 hour (this is configurable), retrieve those two files, merge them and return them back to the namenode. Merging those two files are resource extensive on large cluster, that’s why you should run this services on a different server than the namenode. The secondary namenode doesn’t backup anything, also it will note replace the namenode if the namenode fail.

You can you can tell the namenode to save those two files to different location (another disk or a remote disk), by defining “dfs.name.dir” in hdsf-site.xml like so:

	<property>
		<name>dfs.name.dir</name>
		<value>/data/dfs/name/,/disk2/backup/name,/mnt/nfs/backup/name</value>
	</property>

In my case I configured my secondary namenode on a server with the same specification of the namenode and in its hdfs-site.xml I defined “dfs.name.dir” to use the remote backup and I configured the master/slaves files. If the namenode goes down, I still have to manually turn on the secondary namenode into a namenode but in this case everything is ready (Caution: You will need to transfer the IP of the namenode to the secondary namenode).

Finally as a quick band-aid by reading the documentation I found out that you can download those two files from the namenode via its webservice, you can get fsImage with “http://NAMENODE:50070/getimage?getimage=1” and edits with “http://NAMENODE:50070/getimage?getedit=1”. I wrote a bash script to retrieve those files and back them up on each datanode at different time of the day, at least you wont loose everything.

I hope this will help you start to setup your Hadoop cluster, one last thing you could do is to read the Hadoop documentation.

0 notes & Comment

4 Basic features Netflix should add asap

From all the online streaming services Netflix is the one I use the most, in the same time it’s the one that frustrate me the most. Here my list of feature that they should add asap:

A real search engine:

Indeed right now you can do a basic search that return movies but it doesn’t return real results… What do I mean? well if you search “robots” it will return a list of movies with “robots” in the title and if each of those movie are not available it will give you two similar titles that have little to nothing to do with what you search. There is no way to search by tags, by categories or by date. Also no way to order by ratings. Search by date is really important when I want to search for new content.

Watch preview/trailer:

Actually I had to browse around until I found one title that actually had a trailer. Since they store every movie why don’t they store a tiny piece of each? Or just a YouTube link to the actual movie trailer… Or why not partnered with IMDB? I mean right now it’s what I manually do anyway.

Order comments:

Lets say I found a movie or a documentary. Netflix have lots of comments, you can rate the movie and rate the comments but there is no way to order them by ratings or by comment helpfulness. Amazon is doing a great job with that!

New arrival ?:

On my home page there is two new list “Recently Added” and “New Released”. As a developer I can see that both are redirected to the same script called “newReleases” but with different parameters. One page just get a list ordered by date added (I guess) and the second one ordered by released date (I guess, because it doesn’t seems to be the case). Both of those lists have nothing to do with what normal people consider new content, at least if they are supposed to be new content, those list are bugging. In “Recently added” I have a movie from 1993 and in “New released” I have movie from 2009.

Those are not big features from a UI standpoint, a product standpoint or a development standpoint. The search could use a solr index with a cache layer. The trailer part could be easily plug with IMDB. Ordering comments can however be quite complicated depending on how they implemented. The last feature is the easiest to implement since I think they are just ordering out of a wrong field.

0 notes & Comment

PHP, Node.JS, Mysql and Mongo

I’ve spent few weeks looking at performance between Apache/PHP vs Node.JS and Mysql vs MongoDB.

All test were run on my local computer with ApacheBench :

- Intel Core 2 Duo E8500 @ 3.16GHz

- 8Gb of Ram

- Ubuntu 11.10 Desktop 64bits

- Apache 2.2.20

- PHP 5.3.6

- Node.JS 0.6.6

- Mongodb 1.8.2

- Mysql 5.1.58

The first graph comparing PHP and Node.JS is based on a simple return “Hello world” application. You can see that Node.Js perform a bit better than Apache/PHP which event fail to give any result in high concurrency test (PS: There is actually a lots of good request but too much of them failed to get any accurate reading). This test is just a brute force representation on the capability of Node.js.

The second graph represent the performance between PHP/Mysql vs PHP/Mongodb vs Node.Js/Mysql vs Node.js/Mongodb. This is a basic application which just retrieve 1 random row out of the database of 100,000 rows and return “Good” on success. You can quickly notice that the degradation of the performance of Node.js when you have to connect to a database at each server request but if we keep the connection open, Node.Js will perform better. I will have to to the same test with PHP and a persistent connection.

Obviously those test for Mongo vs Mysql were made to test a specific use case. I didn’t test performance when retrieving more than one row, or ordering a set of rows or querying a field without index or even looking through a 100Million rows.

I did test writing in MonogB compare to writing in MySQL. For example writing 10,000 rows in MongoDB took 0.6 second where the same code with MySQL took 7 minutes (I was inserting one row at the time). MongoDB end-up being faster here because it’s actually not writing to the disk right away.

Those technologies should be use accordingly, MongoDB is really fast in writing, so you could use it for storing user activity on a social network or login page views. Node.JS is really fast to handle a request, some people even use Node.JS as a load balancer.

Source code:

0 notes & Comment

Hollywood need to grow up

First of all I’m going to go quickly through all the current solution:

Cable:

Well everybody know that you pay $100 for maybe 5 channels that you would like to watch and they package it with lots of garbage that you would never watch.

Digital TV:

Yes you can still plug an antenna in recent TVs and get a digital signal. I leave in Hollywood and I have only 2 working channel. I don’t how how to explain that but it’s pretty bad.

Netflix:

I use it when I don’t know watch to watch but I’m in the mood to watch something. I just turn it on, browse until I find something. Sometimes if I know watch I want, I’ll check to see if it’s available but most of the time it’s not available at least not in streaming mode.

Hulu:

This one is a pretty good joke. You can either watch TV show for free with lots of advertising, right after they are broadcast on TV or you can pay a monthly fee and watch the same TV show with the same amount of advertising, only difference this time you can use your TV instead of your computer screen.

DVDs:

If you don’t care about quality, this is still a pretty good way to watch a movie. You need a DVD player which you can get for really cheap nowadays.

Bluray:

You need an updated Bluray player, which is pretty hard to get if you didn’t buy a PS3 (PS: Just buy a PS3). I had a player for 3 days until I couldn’t update it to watch a new movie, so I returned it. Also I could use my computer which has an HDMI output with a Full HD cable (whatever that means…) and my Full HD TV. In theory it should work, in practice it’s a nightmare. I end up buying a software that remove protection on the Bluray so I can actually watch it on my TV.

Theaters:

The best choice according to me but I won’t pay $12 for any kind of movie, also you always end-up spending over $25.

iTunes:

Fairly cheap TV show and movies but stuck with DRM and Apple codec, which mean if I buy a media on iTunes, I can’t watch it on my TV except if I have a Apple TV or I’m willing to plug my computer to my TV.

Amazon:

Same thing that iTunes, you need a device that can connect to it.

Ultraviolet:

UV is supposed to be the answer to the crisis, I think the idea is pretty good. It’s really close to what I’m thinking about. First of all it’s limited to newly acquired media and there is not a lot of device than can work with it (Actually just the Flixter app).

Illegal downloads/Streamings:

The Antichrist of Hollywood. There was a survey not so long ago saying that every body under 25 most likely downloaded at least one illegal movie, mp3 or TV show.

But let’s be honest, this is the most convenient way to consume media. It’s on demand, fairly cheap (Still have to pay for Internet), large directory. I’m not saying you should download, I’m just saying it exist and it’s pretty hard to ignore it as a “solution”. Personally I hate streaming, it don’t even know how people consider that as viable solution.

The problem is that all legal solutions are really inconvenient. We are supposed to be in the digital age, everything is supposed to be connected but for some reason Hollywood doesn’t want to grow up. Today under my TV there is only one box, it’s a Western Digital player which has access to Hulu, Netflix, my local network and even Youtube.

This box however doesn’t read DVD or Bluray, so all my collection of DVDs/Bluray is pretty useless and right now I have to rip/encode those DVDs one by one so I can watch them on my TV, but this is still illegal. Indeed in US and most developed countries, it is illegal to bypass a security system such as the one installed on DVDs and Bluray. The thing I don’t understand is at the beginning of the age of Digital Music, it wasn’t illegal to rip a CD, actually even software like iTunes were doing it. Why does it need to be different with videos? You can even rip of your olds records but it’s illegal to do so with a DVD.

I dream of a day where I can just type/scan the bar-code of my DVD collection and make them available for download or streaming to any device that I have. I literally want iCloud for videos! iCloud is a new service from Apple that allow you to sync your local music library to Apple cloud storage. You can then access all your music from Apple device (*only inconvenient). Amazon and Google have a similar product.

In the same time Hollywood need a new way to survey its viewers. There should be a website where I can go and officially vote for a TV Show or why not use some donation system like flattr.com or use the good old text messaging system to vote. Anything would be better than just surveying 300people per city.

0 notes & Comment

The Facebook effect

I’m sure that you all know how Facebook improve its product:

  • “Find” Idea
  • Implement
  • Deploy
  • “Check user reaction”
  • Possible rollback (but most likely not)

Facebook is using this “process” since the beginning, we all know about it because they either change their design, or some privacy control, or totally change one part of the product (Message) or not so long ago change the chat and you hated it. This process allow quick implementation of potential great idea but Facebook is so big and still growing that they can afford to loose few people, who will probably come back anyway since Facebook is/was a unique product.

So in some sense you can say that they care more about the product than their users. In some sense because the original point of this shortcut to production is to add value to the product.

The Facebook effect is when you have an user user say something like “Wow this new feature/design is really bad, please bring back the old one” and a week to a few days to an hour later the same user will totally forgot why he wasn’t happy. I’m not quite sure if this is because the change was actually a genius idea, luck or just the user getting tired of complaining… This is not the point, the point is Facebook did it for a long time and got away with it.

The problem is now a other companies are envious. They think they can do the same thing without doing any test, or checking if it’s a useful feature, or checking statistics, or even listening to their user after deployment. They think that they can copy Facebook process and that the user will definitely react but get over it after (This is actually an argument that I heard). There is a saying like “You don’t have to create what the user want, you have to create something that the user doesn’t know he wants”. This is quite dangerous when you actually already have a business, just look at Digg (Reddit on Digg) or the new design of Gizmondo (Lost 15%).

Yes there is a possibility that this could work but there is more example of it failing. Facebook is an exception, it’s what help them grown bigger than MySpace but now that there is new competitors like Google+, they cannot afford it except for tiny tweaks. Moreover I’m not sure that Facbook decide on those features without actually having at least some number backing them up. Their latest feature called Timeline was in beta-test for months and now that it’s out you actually have to opt-in.

Anyway there is different way of improving your products without alienating users such as A/B testing and statistics, you don’t have to bet your company each time you want to improve something.

0 notes & Comment

A new Mobile age in France

Few month ago after a trip in France, I wrote US lags in broadband services. I explain how the French government stimulated the market by poking competition with a stick or a carrot (which ever picture your prefer).

I talk about free.fr one of the top 3 French ISPs and how they remodeled the industry in 2002, also that they were working on they own Phone Network which should launch at the beginning of 2012.

Well this day was today Tuesday 10th, Free unveiled there new offer to the rest of the world. Actually just to France, since I didn’t see this news on any other tech blog who are probably busy with CES 2012.

I watched the press conference (in French) where funder Xavier Niel explain why they had to create a new phone operator and how his competitor were abusing their position. He listed a few points against them such as:

  • Contract for at least one year but in general two years (even when not buying a new phone)
  • Long and really complex contract (plenty of hidden cost)
  • Too many offers and too complex
  • Fees to call European country
  • Expansive text/media messages
  • Limited access to the internet (just mail and web) and limited in quantity (1gb)

A plan with Unlimited call nation wide, Unlimited Text message (extra cost for media message), 1Gb of limited internet cost in France today between 49.99Euros and 85Euros.

Solutions of Free to the previous problem are:

  • Month to month contract
  • One single page contract
  • One unique plan
  • Unlimited call to 40 destinations (Including USA/Canada)
  • Unlimited text/media messages
  • Unlimited access to the internet (voip, p2p, newsgroup, …) until 3Gb

And this for only 19.99Euros ($28) twice cheaper than the cheapest competitor and 4 times as cheap as the two major competitor. If you already have Free as an Internet Provider the price of the plan drop to 15.99Euros ($22).

Few month ago the French government decided that there was a need for a cheap plan for the lower class, because they saw the phone as a necessity. After talking to the 3 operator they came out with a 50minutes, and 50 text message for 10euros ($14) per month. Well Free decided that this was “bullshit” (literally) and offer now 1 hour and 60 text message for 2Euros ($2.8) per month.

The only inconvenience with the Free offer according to the first critics is that they do not subvention phones. You will have to buy the phone you want the full price which for an iPhone 4S is around 629euros. However you have the possibilities to pay in 12, 24 or 36 months. Even with this Free is still way cheaper than its competitor.

Right after this event competitors rushed over Twitter and other social network to announce that they will have new plans really soon but this time the customer will be the one to decide what the market price should really be.

I really wish that the US wakes up to shake its own market.

7 notes & Comment

US lags in broadband services

I recently traveled to France for two week, and while there I also made a visit to my hometown in Reunion Island, which is located 6000 miles from Paris in the middle of the Indian Ocean.

I remembered when I was a teenager back in 1998, I had a monthly 10-hour limit on the dial-up Internet.  Once you account for the fact that the modem was a US Robotics 33.6K modem you’ll quickly understand how little 10 hours a month is.  This was an era where you could find Free AOL and CompuServe CDs for “free internet”, which wasn’t actually free for me as I still had to pay a phone fee to access those server from where I was. Back then we paid 35euros (around $43 at the time) for the 10 hours of Internet (The price was actually in Francs).  However since I generally went over my plan we ended up paying anywhere from $20 to $40 more each month. My family was poor but my parents understood somehow the importance of the Internet. Even Christmas I used to ask to give me additional Internet hours instead of a normal gift.

I remember using WinMX to download “stuff” (this was the beginning of MP3s/DivX) and I was so envious of those Americans with T1 and T5 lines that were all over in WinMx!

Few years later a couple new technologies were introduced, namely ISDN and DSL.  These new technologies offered to provide more bandwidth, but naturally since I lived on an island in the middle of nowhere, bandwidth still had to be rationed as with modems.  At the time Reunion was connected to the World Wide Web by satellite and we had one unique Internet Provider Wanadoo (old name of Orange and France Telecom). The price for the 128K of DSL was 56euros ($62) and we still had time and bandwidth limits, ISDN was as expansive.

In 2002, Reunion Island was finally connected to the SAFE (South Africa Far East cable) which is an optical fiber submarine communications cable linking Melkbosstrand, South Africa to Penang, Malaysia. I now had 35 hours of 128K DSL Internet and I paid something around 65 euros ($85).

Sometime in  2004, I think, I finally had 512K of unlimited Internet for the same price as the 128K DSL. In 2004 I left Reunion to go to Bordeaux, France and when I arrived I was blown away by the Internet market, which was booming!

In France there were many more Internet providers such as Numericable, Orange, Free, Neuf Cegetel to name a few, who all offered different technologies such as Satellite, Cable and ADSL.

At that time in 2004, Free.fr. another Internet provider, was a new competitor with really aggressive prices! To begin with they offered free dial-up Internet to anyone. Free.fr then started offering unlimited ADSL in addition to unlimited phone time (to landlines) for 29.99 euros ($40), which compared to other providers was a lot cheaper. Other competitors started to follow them and a law was introduced against Orange. It should be noted, for a short period,  Orange was forbidden to compete, as they pretty much had a monopoly on the market, so the French government wanted to motivate other competitors to jump into the race. I was livid that this law had passed because as a Orange customer I knew I wouldn’t see my monthly bill be reduced, and thus I was eventually forced to switch to another provider, Free.

Today in France, you can get Unlimited Internet with ADSL 28Mbits, unlimited free phone to more than 200 destinations (landlines only, mobile are however cheap), 128 TV channel,  personal Internet box that allow you to record TV and offer games other services such as personal cloud storage, etc, all for only 29.99euros ($45).  I should note the law cited above regarding Orange has now finally – and I might add thankfully – been repealed.

There are now only 3 major competitors in the market but there is also multiple number of smaller providers that rent networks from the from bigger providers.

Going back to my homeland, Reunion Island, I cannot understate how impressed I was with the telecommunication market and the plethora of services they now provide! There are multiple providers (at least 5) who are offering the same services as the companies in France for almost the same price (tax are higher in Reunion Island) and all that in only 5 years… There is even free Wifi, which covers part of the island.

In 2008 I moved to San Francisco and couldn’t believe how far behind the US lagged in telecommunications. I had AT&T DSL which is the worse Internet line I have ever used, and the speed wasn’t anything special.

Today I live in Los Angeles, the capital of the media and I even work for a leading company which creates, publishes, and distributes media, Break Media. At home I have Time Warner Internet Cable with Power Boost (20Mbps) for $57, which only covers Internet.  I don’t get phone, TV or other anything else for that matter.  On top of that, my choices as far as Internet providers is very limited, I can choose AT&T with U-Verse for $75 (24Mbps, still no Phone or TV included and no tax) or Time Warner.

Additionally, even if you want to compare the mobile (cellphone) markets, France and Reunion Island come out ahead.  I can get customized plan which are fitted just for my particular needs. For example I don’t call anyone so  I don’t need 500 minutes per month but I use the Internet and text a lot. There is no plan for this in the US today (Virgin Mobile does this but not on every phone). The cheapest plan I found is with T-mobile was $70 (for 500 minutes, unlimited Text, unlimited Internet and tax included), which throttles the bandwidth once you’ve reach a threshold.

With Verizon or AT&T you can get 450 Minutes, unlimited Text,  2GB of Internet for $90 (no tax included).

In France, with Orange, I can get unlimited Text, unlimited Internet and 60 to 300 minutes for 25 euros to 56 euros ($36 to $80 tax included) respectively.  What’s more, most plans even include mobile TV. Other provider have even better offers, as Orange is generally more expansive. As I stated earlier there are only 3 major providers (Orange, SFR, Bouygues) and many MVNO(Mobile virtual network operator) but Free.fr is also going to begin providing mobile services next year and is expected to undercut the market just as they did with internet services.

So now, sitting in Los Angeles, I’m truly envious of France – and I’m even envious of my Island… It’s really sad that knowing I did so much to escape it, now has better telecommunication services at a better price than the US.

What happen here? Why did the US miss the boom?

I personally have no idea, I wasn’t here during the past 7 years, maybe the country is too big? Infrastructure too old?

But why did my Island succeed where America did not? There are practically no high-tech jobs over there in Reunion Island – which is one of the main reasons why I left (the major sectors are Tourism and Sugar). GDP per capita of my island in 2007 was $23,501 and GDP per capita of US in 2007 $43,170. There is less of a need for Internet in Reunion Island then there is here in the United States.

France doesn’t offer high speed Internet everywhere though as there are still some area with 1024k of ADSL but they are actively working on it.  The French government is pushing providers to fix any coverage gaps by providing help in the form of investments or get fined if they do not comply.

I’m not saying the US government should do something like the French government, after all that is against free markets. I’m actually looking at investors and entrepreneurs, as there are a lot of investors who fund new 2.0 companies (Color, Path) but I’m thinking, maybe they should first help lay down a new network infrastructure so they are on par with their Europeans. It may be harder to deploy a new network infrastructure in the US, but the benefits are clearly worth it. Consumer choice with regards to telecommunication services, helps with the market by continually having to innovate while at the same time it benefits consumers who can choose which providers meet their needs more efficiently.  It’s a win-win for everyone involved.

Tags: usa internet economy reunion att orange nostalgia

28 notes & Comment

Gearman and Cygwin

Gearman is designed to distribute tasks to multiple servers. The first example coming to my mind is a online video encoder. There will be two server a web server and an encoder. The user will upload a video to the web server which will send en encoding task to the other server. The task will be run asynchronously and the web server will be able to retrieve information about this task.

In this post I will show you how to install Gearman on windows with cygwin, there is multiple tutorial on how to use Gearman.

I’ve spent few days trying to compile unsuccessfully Gearman latest version 0.24. However the version 0.14 works like a charm.

Installing Cygwin

First you need to install Cygwin, while doing that you will be prompt to add packages, you need to add those packages: (A tutorial on how to use Cygwin)

  • gcc
  • make
  • libuuid1-devel
  • libiconv
  • wget

Gearman require libevent an event notification library which you cannot get with Cygwin, you will need to compile it.

Getting libevent

First you need to download the latest libevent release here (2.0.14 for me)

Unpacked the package somewhere.

Open cygwing shell and cd to the unpacked library.

Run the following command:

  • ./configure (Configure might prompt some error related to required package that you need to install with Cygwin)
  • make
  • make install

Installing Gearman

First download the Gearman package 0.14 that you can find here https://launchpad.net/gearmand/+download

Unpacked it somewhere

Cd to the unpacked directory with Cygwin

Run the following command:

  • ./configure (Configure might prompt some error related to required package that you need to install with Cygwin)
  • make
  • make install

Now Gearman should be ready to use! You can do gearman.exe —help for a list of command.

Please note that there is no PECL extension for gearman on windows, so you might want to use the pear extension instead (pear install Net_Gearman-alpha).

PS: This is obviously not made for production server

Tags: gearman php windows cygwin developer