Note- this is a backup of the knol at http://knol.google.com/k/data-mining-through-cloud-computing
Since Google closed knol, I am porting this here. It used to be the top ranked SERP for "data mining cloud computing" and had 5500 views at the knol- quite possibly it was one of the earliest frameworks for software interface design for the cloud (circa 2008)
Data mining through Cloud computing
Ohri Framework-Data mining through Cloud computing
Summary- Data encryption and compression before data transfers to cloud computing network. Open source analytical tool R in cloud Graphical Interface like Rattle for results Billing at cost plus pricing.Authors
The Ohri Framework tries to create an economic alternative to proprietary data mining softwares by giving more value to the customer and utilizing open source statistical package R , with the GUI Rattle , hosted on a cloud computing environment.
It is based on the following assumptions-
1) R is relatively inefficient in processing bigger file sizes on same desktop configuration as other softwares like SAS.
2) R has a steep learning curve , hence the need for the GUI Rattle .
3) The enhanced need for computing resources for R is best solved using a cloud computing on demand processing environment. This enables R to scale up to whatever processing power it needs. Mainstream data mining softwares charge by CPU count for servers and are much more expensive due to software costs alone.
4) Users of big data sizes have data hygiene issues in transportation of data. This is solved by using encryption and compression before transporting the raw CSV files and the processed CSV files. The use of CSV is to enhance usage by internal data sources. The use of PGP is recommended both as compression and encryption.Compression of data cuts down on bandwidth transportation costs.
5) Pricing of service is recommended to be on cost plus model to enhance usage by subscribers.
Part of the reason mainstream softwares like SPSS and SAS continues to enjoy a profitable lead is
1) Standardized language elements (Data Step and Procedures)
2) Ease of Learning SAS, SPSS through multiple tutorials
3) Output Delivery to multiple sources
4) Input from multiple data sources
But the most important reason is the sheer efficiency of these softwares like the SAS PDV in reading large files . If
common softwares like Microsoft Excel could load a 300 mb file that easily, it would make a significant dent.
Large files are assumed to be used by larger license holders or bigger corporate users.
Cloud computing could be of help here to languages like R. R is very very good in advanced stats, is free, the packages are peer reviewed. It has little known but very good GUI’s too (like rattle). If you place rattle GUI in a cloud , it would use processing power on demand, and output results.SAS wont do it because they charge by the CPU count on this, and that's an idle asset (reaping rewards from programming done long back). The expensive costs of data mining software threatens to create a new digital divide by restricting access to these tools to poor people.R can be of help here. With cloud computing you need not invest in distributed computing costs, just give poor people in Asia access to a fast internet connection with a simple computer.
Thus you save on hardware costs and software costs.People pay only when they use the system. But an additional costs is fixed cost of the remote application built to support the framework, including transport bandwidth cost. A suggestion could be to use compressed and encrypted data transfers to and fro from the remote cloud.Softwares analogous to that from PGP.com would be of help here
Pay for bandwidth, and cost + small markup for the cloud hosting costs. Economies of scale will ensue.
R’s graphical system is superior than than SAS or SPSS, but it can be tweaked to newer graphical softwares like Silverlight or Flex or even Flash.
The little guy no longer needs to squeeze himself for the big computing power.I could be totally wrong here, but it may be worth a shot.Companies like Zementis (www.zementis.com and Kirix (www.kirix.com) are already offering data mining as a service (for scoring and analytics respectively).
The Ohri framework thus tries to replace hardware costs (for R) , software costs (for other softwares ), data hygiene costs (by encryption) , bandwidth costs (by compression) to give data mining on demand for the masses.
http://decisionstats.com/ohri/
The modified Ohri Framework tries to mash the following
0) HTTPS rather than HTTP
1) Encryption and Compression Software for data transfer (like PGP)
2) Open source stats package like R in cloud computer (like Amazon EC2 or Rightscale with hadoop)
3) GUI to make it easy to use (like Rattle GUI and PMML Package)
4) A Data Mining Open Source Package (like Rapid Miner or Splunk)
5) RIA Graphics (like Silverlight )
6) Secure Output to cloud computing devices (like Google Docs)
7) Billing or Priced at simple cost plus X % (where simple cost can be like 0.85 cent /per instance hour or more depending on usage and X should not be more than 15 %)
8) Open source sharing of all code to ensure community sandboxing
Intention is to remove fixed computing costs of servers and desktops to normal PC’s (Ubuntu Linux ) with (Firefox or IE Explorer ) access to secure data mining on demand .
On tap demand mining to anyone in the world without going for the big license purchases/renewals (software expenses) or big hardware purchases (which become obsolete in 2-3 years).
All copyrights and brand names are respectively acknowledged. This is a non commercial project and the views are personal only.
2comments
It is based on the following assumptions-
1) R is relatively inefficient in processing bigger file sizes on same desktop configuration as other softwares like SAS.
2) R has a steep learning curve , hence the need for the GUI Rattle .
3) The enhanced need for computing resources for R is best solved using a cloud computing on demand processing environment. This enables R to scale up to whatever processing power it needs. Mainstream data mining softwares charge by CPU count for servers and are much more expensive due to software costs alone.
4) Users of big data sizes have data hygiene issues in transportation of data. This is solved by using encryption and compression before transporting the raw CSV files and the processed CSV files. The use of CSV is to enhance usage by internal data sources. The use of PGP is recommended both as compression and encryption.Compression of data cuts down on bandwidth transportation costs.
5) Pricing of service is recommended to be on cost plus model to enhance usage by subscribers.
Part of the reason mainstream softwares like SPSS and SAS continues to enjoy a profitable lead is
1) Standardized language elements (Data Step and Procedures)
2) Ease of Learning SAS, SPSS through multiple tutorials
3) Output Delivery to multiple sources
4) Input from multiple data sources
But the most important reason is the sheer efficiency of these softwares like the SAS PDV in reading large files . If
common softwares like Microsoft Excel could load a 300 mb file that easily, it would make a significant dent.
Large files are assumed to be used by larger license holders or bigger corporate users.
Cloud computing could be of help here to languages like R. R is very very good in advanced stats, is free, the packages are peer reviewed. It has little known but very good GUI’s too (like rattle). If you place rattle GUI in a cloud , it would use processing power on demand, and output results.SAS wont do it because they charge by the CPU count on this, and that's an idle asset (reaping rewards from programming done long back). The expensive costs of data mining software threatens to create a new digital divide by restricting access to these tools to poor people.R can be of help here. With cloud computing you need not invest in distributed computing costs, just give poor people in Asia access to a fast internet connection with a simple computer.
Thus you save on hardware costs and software costs.People pay only when they use the system. But an additional costs is fixed cost of the remote application built to support the framework, including transport bandwidth cost. A suggestion could be to use compressed and encrypted data transfers to and fro from the remote cloud.Softwares analogous to that from PGP.com would be of help here
Pay for bandwidth, and cost + small markup for the cloud hosting costs. Economies of scale will ensue.
R’s graphical system is superior than than SAS or SPSS, but it can be tweaked to newer graphical softwares like Silverlight or Flex or even Flash.
The little guy no longer needs to squeeze himself for the big computing power.I could be totally wrong here, but it may be worth a shot.Companies like Zementis (www.zementis.com and Kirix (www.kirix.com) are already offering data mining as a service (for scoring and analytics respectively).
The Ohri framework thus tries to replace hardware costs (for R) , software costs (for other softwares ), data hygiene costs (by encryption) , bandwidth costs (by compression) to give data mining on demand for the masses.
http://decisionstats.com/ohri/
Some time back, I had created a framework for data mining through on demand cloud computing. This is the next version- it is free to use for all, with only authorship credit back to me…………..
It tries to do away with fixed server ,desktop costs AND fixed software costs in softwares which are used for data mining ,stats and analytics and have huge huge per CPU count annual license fees
The modified Ohri Framework tries to mash the following
0) HTTPS rather than HTTP
1) Encryption and Compression Software for data transfer (like PGP)
2) Open source stats package like R in cloud computer (like Amazon EC2 or Rightscale with hadoop)
3) GUI to make it easy to use (like Rattle GUI and PMML Package)
4) A Data Mining Open Source Package (like Rapid Miner or Splunk)
5) RIA Graphics (like Silverlight )
6) Secure Output to cloud computing devices (like Google Docs)
7) Billing or Priced at simple cost plus X % (where simple cost can be like 0.85 cent /per instance hour or more depending on usage and X should not be more than 15 %)
8) Open source sharing of all code to ensure community sandboxing
Intention is to remove fixed computing costs of servers and desktops to normal PC’s (Ubuntu Linux ) with (Firefox or IE Explorer ) access to secure data mining on demand .
On tap demand mining to anyone in the world without going for the big license purchases/renewals (software expenses) or big hardware purchases (which become obsolete in 2-3 years).
All copyrights and brand names are respectively acknowledged. This is a non commercial project and the views are personal only.
Activity for this knol
This week:
50pageviewsTotals:
5504pageviews2comments
Comments
Untitled
does it work for micro array expression data also? Who is the cloud service provider? is it a private cloud, public cloud or hybrid?chitra gupta - 04 May 2011
it is a framework. google, amazon, azure all are using some part of it, now but not all. it was written in 2008 and worth an update
Ajay Ohri - 04 May 2011
Ajay Ohri - 04 May 2011
No comments:
Post a Comment