Microsoft offers Azure ML data import CLI, SDK for Snowflake, other databases

Microsoft offers Azure ML data import CLI, SDK for Snowflake, other databases

Microsoft has come out with a new integration that will allow Snowflake and AWS S3 users to bring their data to its Azure Machine Learning (ML) service for AI model training and development.

The integration is being done via a new data import command line interface (CLI) and software development kit (SDK) that allows data to be brought in from data repositories outside the platform, Amar Badal, senior manager for Azure Machine Learning, wrote in a blog post.

A CLI is a text-based user interface that can be used to query files, run programs and interact with a computer instance or server.

The CLI and SDK kit can be used, for example, to create a connection between the Snowflake instance and Azure ML, Badal wrote, adding that a data scientist could query the connection to pull required data inside the machine learning service.“If the scenario demands to import data on a schedule, one can use popular cron or recurrence patterns to define the frequency of import,” Badal wrote.

Crons are utility programs that allows users to schedule repetitive tasks at a specified time by entering a set of commands in a CLI.

New CLI and SDK to help data scientists

The new integration, according to dbInsights’ principal analyst Tony Baer, is aimed at helping data scientists improve productivity and shorten their product development or model training lifecycles.

Microsoft highlighted that the new features will eliminate the need for data scientists to communicate regularly with data engineering teams.

“Each import, whether scheduled or not, creates a unique version of the dataset which is in turn used in training jobs, giving data scientists the required traceability in scenarios that need retraining or for model audits,” Badal wrote.

Doug Henschen, principal analyst at Constellation Research, agreed that the new tools will be helpful for data scientists in particular, noting, “Any company running Snowflake on Azure will gain another, well-integrated option for doing data science with the data managed within Snowflake.”

Azure ML strategy different from Snowflake’s Snowpark

Microsoft’s integration takes a different approach to machine learning compared to Snowflake, analysts said. Snowflake offers Snowpark, designed to allow developers to apply their preferred tools in a serverless manner to Snowflake’s virtual warehouse compute engine. 

“Azure ML did not go the Snowpark route. Instead, Microsoft is saying go ahead and import data out of Snowflake and process it in our environment, rather than implementing Azure ML functions as user defined functions (UDFs) in Snowpark. That’s not without precedent, as Snowflake partner H2O has taken a similar tack,” Baer said.

Azure’s tactic to not follow the Snowpark route could be largely attributed to Snowflake’s strategy, according to Henschen.

“Snowflake created Snowpark to facilitate data science work, but it has largely left it to partners to provide the software and services needed to execute. Snowflake customers need partners like Microsoft to make the most of the data managed on the Snowflake platform,” Henschen said.

Alongside the new CLI and SDK integration, which is still in public preview, the company has introduced a new lifecycle management feature on Azure ML’s managed datastore, dubbed “Hosted on Behalf of” or HOBO datastore.

The offering, according to Badal, gives users the right to manage imported data from repositories such as Snowflake and AWS S3 via the new CLI and SDK integration.

“A policy to automatically delete an imported data asset if unused for 30 days by any job is set on every imported data asset in the AzureML managed datastore. All that one has to do is to set ‘azureml://datastores/workspacemanagedstore’ as the path when defining their import and rest will be handled by AzureML,” Badal wrote, adding that the offering comes with an autodelete feature for any imported data set that remains unused for over 30 days.

This feature, according to Henschen, is crucial in the development and ongoing refresh and replacement of machine learning models.

The lifecycle management feature too is in public preview, Microsoft said.

Add a Comment