The New Conference Room Pilot:…
The New Conference Room Pilot: http://bit.ly/biT4pt via @addthis
The New Conference Room Pilot: http://bit.ly/biT4pt via @addthis
This is the first attempt to make a Map/Reduce implementation purely on PHP.
Why it is needed purely on PHP? For me last time this question was actual when I was implementing some Information Extraction engine.
First of all, it is scripting language, the cheapest hosting that provides PHP/MySQL is available for 2-3 dollars a month, sometimes even for free. It is easier to get PHP running somewhere on server than other languages/technologies.
Secondly PHP is very simple, lightweight, written applications are easy to run and deploy.
Thirdly, for other people it easier to use and run a PHP framework, rather than other language frameworks.
As a basis for the PHP Map/Reduce similar implementation lightweight PHP Multithreading [multithreading article] engine was used.
What was changed/improved in order to get PHP Map/Reduce?
First of all, Map/Reduce should work not like several threads on one machine, but on distributed network. Usage of database and message broker is a solution to this problem.
JobTracker -
NameNode+DataNode is located on one place and it is a database, in the future it could be replaced with some kind of “BigTable similar implementation on PHP and MySQL”
TaskTracker
<picture> Structure and architecture
Step by step description
+ link to sf.net, reviwed code
Types to run:
1. Similar to PHP Multithreading, everything on one machine, one main processes and several computation threads.
2. Distributed MapReduce
PHP Map/Reduce Limitations:
PHP Map/Reduce is mainly using database as a storage for the input/output/intermediate data. It is known that databases has limitations of stored information size compared to huge MapReduce of Google, Hadoop, etc. But building BigTable[] or HDFS[] similar storage this is another topic.
NameNode+DataNode is located on one place and it is a database, in the future it could be replaced with some kind of “BigTable similar implementation on PHP and MySQL”. Why not? If we consider the article
http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
it is much to be improved in Big Distributed data storages area.
Just have written an article about cross-platform PHP Multithreading engine http://tiny.cc/lc2J1
This article could have appeared approximately 1,5 year ago, but at that time I didn’t have any free time nor ability to publish anything connected with it, because source code parts described here were used in several commercial projects.
The main problem of PHP multithreading engine implementation is a shared piece of memory that usually vary from one OS to another. This shared memory used by threads in order to talk to each other, get tasks and commands, give back results and use other shared Input/Output resources if needed. As PHP could be used on at least two platforms (Linux, Windows or even more [1]) which have different memory management, structure, etc. from the Programmers point of view. This fact lowers the chances to make a multiplatform PHP multithreading engine.
Several quite strong attempts in the direction of PHP Multithreading are: PHP Threader – Multithreading-like Functionality in PHP [12], Emulate threads using separate HTTP requests [13] and Improved Thread Simulation Class for PHP [14].
At the first glance the first one [12] seems to be complete and fine, but in closer look (analysis of code and examples) there are some serious drawbacks exist, namely:
What about the second one [13]? The Author claims “This package provides an alternative solution that consists in sending multiple HTTP requests to the same Web server on which PHP is running”. The third one [14] is also based on HTTP requests. The approach is reasonable, but both solutions lack of thread management, job distribution etc. They do the first part (start several threads and get response from them) very well, but shared piece of memory, full thread control, distribution of jobs/tasks, etc. is missing. It is also questionable would they work with huge amounts of data.
Other previous articles on the topic of PHP Multithreading (see [2],[3],[4]) mostly provide information about using, for example, forking (see [8] for pcntl_fork) in Linux or curling (see [9] for curl) in windows etc. They still do not provide full Multithreading solution.
Additionally there were a lot of attempts to realize PHP Multithreading [10], [11].
Reading the articles Straight away several questions arise. How would we track life-cycle of threads? What happens if some of the threads would hang on or crash unexpectedly without any notice?
Inspired by:
and
As a result some ideas came how to improve the code and concept provided in articles above [5] and [6], make the algorithm more automated, universal, clean and less complicated.
What kind of improvements will be added?:
Improvement I: Shared memory
As a shared piece of memory any database (MySql, Postgres, Oracle, etc.) could be used
We need only two tables (please see a picture below), one for messaging/tasking/command called cmd and another one for tracking the life-cycle of threads called threads.
Table: Cmd
cmd_id – just a primary key
proc_id – ID of a thread
cmd – command given (for example: calculate, exit, etc.)
param – additional parameters needed for the command, usually a serialized PHP object
result – stored after the command was done, usually a serialized PHP object.
done – flag for Main thread, was command/task done or not, helps to calculate results and reassign task to another thread if current is not responding.
datestamp – just and time and date
Table: Threads
threads_id – just a primary key
proc_id – ID of a thread
last_beat – last timestamp when Thread was alive
busy – flag, is it busy or not
state – parameter that represents state (for example exit, ready, etc.)
Personally I used MSSQL Server, but tables and commands are ANSI SQL compatible, that means there is no problem using other databases like MySQL, Postgres, Oracle etc. (further in the article you will see that for the communication with database the EZsql DB abstraction class [7] is used, so it is easy to change the DB engine. EZsql is not the best solution/abstraction class and of course you can use your own connector to access the Database).
MSSQL DDL of cmd and thread tables:
Example of data
Data shown is for application with 30 threads.
Cmd table, Threads table.
Improvement II: Message broker
Message/task/commands brokering is done through Database together with the functions in the code:
Loop “THREADS main cycle”
and
and finally
please see main.php and Thread.php for detailed information
Improvement III: thread life-cycle management
We need to know which of the threads are ready for processing, which are busy, which are ended the processing and asking for termination, etc. Message broker is used to give “vital” commands to threads.
This is a job of the following functions:
please see Thread.php
The source code is well annotated and commented, so have a look inside.
Download: PHP_Multithreading_sourcecode_v1.0
One of the applications where the code was used was a massive download and generating thumbnails from pictures.
Work volume: ~3000 JPEG pictures, 0.5-1.5 Mb each
Hardware used: 1.4 Ghz Pentium 4 processor, 2 GB RAM, IIS 5.5 etc.
Internet connection: 8/8 Mbit synchronous connection
Linear download and resizing would take 5 hours
Multithreaded solution with 20 threads took less than 3 minutes.
What definitely distinguishes PHP Multithreading engine proposed in current article?
Any comments, questions and suggestions about the article are highly appreciated.
The next step will be creating PHP Map/Reduce similar implementation and hosting it on http://sourceforge.net/projects/phpmapreduce/.
[1] Supported platforms by PHP http://wiki.php.net/platforms
[2] Sonic server http://dev.pedemont.com/sonic/
[3] Process Forking with PHP http://www.electrictoolbox.com/article/php/process-forking/
[4] Multithreading in PHP with CURL http://www.ibuildings.nl/blog/archives/811-Multithreading-in-PHP-with-CURL.html
[5] Multi-threading strategies in PHP http://www.alternateinterior.com/2007/05/multi-threading-strategies-in-php.html
[6] Communicating with threads in PHP http://www.alternateinterior.com/2007/05/communicating-with-threads-in-php.html
[7] EZsql DB abstraction class http://www.woyano.com/jv/ezsql
[8] PHP Function pcntl_fork http://www.php.net/manual/en/function.pcntl-fork.php
[9] PHP library Curl http://www.php.net/curl
[10] Attempt to make PHP Multithreading Tutorial
http://phpmultithreaddaemon.blogspot.com/2007/09/introduction.html
[11] Discussions about PHP Multithreading http://webforumz.com/php/12595-multithreaded-php.htm
[12] PHP Threader
MultiThreading-like Functionality in PHP http://www.phpclasses.org/browse/package/4082.html
[13] Emulate threads using separate HTTP requests http://www.phpclasses.org/browse/package/3953.html
[14] Improved Thread Simulation Class for PHP http://w-shadow.com/blog/2008/05/24/improved-thread-simulation-class-for-php/
Fast and easy Excel Plugin for M/M/C queue calculation Queueing ToolPak 4.0” http://www.business.ualberta.ca/aingolfsson/QTP/
Lihgtweight and multiplatform PHP Multithreading article is coming soon http://anton.vedeshin.com
implementing hadoop complex Writable types like Class{IntWritable, IntWritable, MapWritable}
Had a problem, external USB HDD, MBR Error 3 + MBR Error 1 – resolved! just removed main HDD, installed ubuntu – works great!
Bad Behavior has blocked 185 access attempts in the last 7 days.