Archive

Archive for November, 2009

PHP Map/Reduce similar implementation (DRAFT)

November 19th, 2009

This is the first attempt to make a Map/Reduce implementation purely on PHP.

Why it is needed purely on PHP? For me last time this question was actual when I was implementing some Information Extraction engine.

First of all, it is scripting language, the cheapest hosting that provides PHP/MySQL is available for 2-3 dollars a month, sometimes even for free. It is easier to get PHP running somewhere on server than other languages/technologies.

Secondly PHP is very simple, lightweight, written applications are easy to run and deploy.

Thirdly, for other people it easier to use and run a PHP framework, rather than other language frameworks.

As a basis for the PHP Map/Reduce similar implementation lightweight PHP Multithreading [multithreading article] engine was used.

What was changed/improved in order to get  PHP Map/Reduce?
First of all, Map/Reduce should work not like several threads on one machine, but on distributed network. Usage of database and message broker is a solution to this problem.

JobTracker -

NameNode+DataNode is located on one place and it is a database, in the future it could be replaced with some kind of “BigTable similar implementation on PHP and MySQL”

TaskTracker

<picture> Structure and architecture

Step by step description

+ link to sf.net, reviwed code

Types to run:

1. Similar to PHP Multithreading, everything on one machine, one main processes and several computation threads.
2. Distributed MapReduce

PHP Map/Reduce Limitations:

PHP Map/Reduce is mainly using database as a storage for the input/output/intermediate data. It is known that databases has limitations of stored information size compared to huge MapReduce of Google, Hadoop, etc. But building BigTable[] or HDFS[] similar storage this is another topic.

NameNode+DataNode is located on one place and it is a database, in the future it could be replaced with some kind of “BigTable similar implementation on PHP and MySQL”. Why not? If we consider the article
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
it is much to be improved in Big Distributed data storages area.

Author: Anton Vedeshin Categories: Articles Tags:

Bad Behavior has blocked 248 access attempts in the last 7 days.