Normalization, Standardization and Rescaling

This is copy/paste of an interesting FAQ from, I really loved reading the article and thought of reposting the same in a better formatted manner for readability. Courtsey: First, some definitions Rescaling “Rescaling” a vector means to add or subtract a constant and then multiply or divide by a constant, as you would do to change the units of measurement of the data. For example, to convert a temperature from Celsius to Fahrenheit.

Zinda ho tum !

One of the Bollywood movies which I always loved watching has been ZNMD, here is a collection of shayari recited by Farhan Akthar (Imran) in ZNMD. A compiled version from Souncloud. Apne Hone Par Mujhko Yaqeen Aa Gaya (The poem comes after the trio’s deep-sea dive) Pighle neelam sa behta ye sama, neeli neeli si khamoshiyan, na kahin hai zameen na kahin aasmaan, sarsaraati hui tehniyaan pattiyaan,

Converting large csv's to nested data structure using apache spark

What is Apache Spark ? Apache Spark brings fast, in-memory data processing to Hadoop. Elegant and expressive development APIs in Scala, Java, and Python allow data workers to efficiently execute streaming, machine learning or SQL workloads for fast iterative access to datasets. Quick start guide Problem Statement / Task To read lot of really big csv’s (~GBs) from Hadoop HDFS, clean, convert them to nested data structure and update it to MongoDB using Apache Spark.

Ponmudi - 2

If you happen to be in trivandrum with a bike and you loves to ride, Ponmudi is one place that should go. Sharing some photos taken by Saurab Devanandan during the trip. The Ponmudi ! Bullets :) Me the posing Nabeel the posing Again Nabeel the posing Saurab being the one taking pics, his bike the posing ISRO has some office on top of Ponmudi(Must be fun working here)

Data science and unix command line

Note : This article applies only to those who code. I have seen many strugling with MS Excel trying to figure out data in a large csv file, I don’t blame them beacause most people I have met ignore standard unix command line tools just because they cared about commandline tools. When the data is BIG(anything above .5GBs) and if we are trying to figure out say even the coloumn names of a csv file MS Excel will get stuck and we will see a MS Windows Not Responding.