Original author(s) | Microsoft |
---|---|
Initial release | 2016 |
Written in | Python |
Platform | Windows, Linux |
Available in | R |
Website | docs |
RevoScaleR is a machine learning package in R created by Microsoft. It is available as part of Machine Learning Server, Microsoft R Client, and Machine Learning Services in Microsoft SQL Server 2016.
The package contains functions for creating linear model, logistic regression, random forest, decision tree and boosted decision tree, and K-means, in addition to some summary functions for inspecting and visualizing data.[1]
It has a Python package counterpart called revoscalepy. Another closely related package is MicrosoftML, which contains machine learning algorithms that RevoScaleR does not have, such as neural network and SVM.
In June 2021, Microsoft announced to open source the RevoScaleR and revoscalepy packages, making them freely available under the MIT License.[2]
Concepts
Many R packages are designed to analyze data that can fit in the memory of the machine and usually do not make use of parallel processing. RevoScaleR was designed to address these limitations. The functions in RevoScaleR orientate around three main abstraction concepts that users can specify to process large amount of data that might not fit in memory and exploit parallel resources to speed up the analysis.
Compute Contexts
A compute context refers to the location where the computation on the data happens. It could be "local" (on the client machine) or "remote" (on a data platform such as a SQL server, or Spark). Pushing the computation to a remote server allows people to take advantage of the greater compute resources that a remote machine may have. If the data being analyzed reside on the same machine, using a remote compute context also removes the need to pull data across the network onto the client machine.[3]
Data source
Data source defines where the data comes from. There are various data sources available in RevoScaleR, such as text data, Xdf data, in-SQL data, and a spark dataframe. People can wrap their data in a data source object and use that as run analytics in different compute context. Different data sources are available in different compute context. For example, if the compute context is set to SQL server, then the only data source one can use would be an in-SQL data source.
Analytics
Analytic functions in RevoScaleR takes in data source object, a compute context, and the other parameters needed to build the specific model, such as formula for the logistic regression or the number of trees in a decision tree. In addition to those parameters, one can also specify the level of parallelism, such as the size of the data chunk for each process or number of processes to build the model. However, parallelism is only available in non-express edition.
Limitations
The package is mostly meant to be used with a SQL server or other remote machines. To fully leverage the abstractions it uses to process a large dataset, one needs a remote server and non-Express free edition of the package. It cannot be easily installed such as by running "install.packages("RevoScaleR")" like most open source R packages. It's available only through Microsoft R Client, a distribution of R for data science, or Microsoft Machine Learning Server (stand-alone with no SQL server attached), or Microsoft Machine Learning Services (a SQL server services). However, one can still use the analytics functions in an Express, free version of the package.
See also
References
- โ "RevoScaleR package". Microsoft Corporation. Retrieved 2018-04-12.
- โ Looking to the future for R in Azure SQL and SQL Server - Microsoft SQL Server Blog
- โ "Compute context for script execution in Machine Learning Server". Microsoft Corporation. Retrieved 2018-04-12.