PHP Map/Reduce similar implementation (DRAFT)
This is the first attempt to make a Map/Reduce implementation purely on PHP.
Why it is needed purely on PHP? For me last time this question was actual when I was implementing some Information Extraction engine.
First of all, it is scripting language, the cheapest hosting that provides PHP/MySQL is available for 2-3 dollars a month, sometimes even for free. It is easier to get PHP running somewhere on server than other languages/technologies.
Secondly PHP is very simple, lightweight, written applications are easy to run and deploy.
Thirdly, for other people it easier to use and run a PHP framework, rather than other language frameworks.
As a basis for the PHP Map/Reduce similar implementation lightweight PHP Multithreading [multithreading article] engine was used.
What was changed/improved in order to get PHP Map/Reduce?
First of all, Map/Reduce should work not like several threads on one machine, but on distributed network. Usage of database and message broker is a solution to this problem.
JobTracker -
NameNode+DataNode is located on one place and it is a database, in the future it could be replaced with some kind of “BigTable similar implementation on PHP and MySQL”
TaskTracker
<picture> Structure and architecture
Step by step description
+ link to sf.net, reviwed code
Types to run:
1. Similar to PHP Multithreading, everything on one machine, one main processes and several computation threads.
2. Distributed MapReduce
PHP Map/Reduce Limitations:
PHP Map/Reduce is mainly using database as a storage for the input/output/intermediate data. It is known that databases has limitations of stored information size compared to huge MapReduce of Google, Hadoop, etc. But building BigTable[] or HDFS[] similar storage this is another topic.
NameNode+DataNode is located on one place and it is a database, in the future it could be replaced with some kind of “BigTable similar implementation on PHP and MySQL”. Why not? If we consider the article
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
it is much to be improved in Big Distributed data storages area.