Tag: Application Insights

Continuous export for Azure Application Insights using Azure Data Explorer (Kusto)

Published July 6, 2022

If you use Azure, chances are that you’ve used Application Insights. Application Insights collects telemetry data for web applications, and allow that telemetry to be queries, analyzed or used to alert for any anomalies. It’s backed by the Kusto engine, which makes it possible to query and aggregate substantial amounts of data very quickly. It also relatively inexpensive. Depending on the nature of your data, however, you may find yourself bumping into one of its limitations. In most cases these limitations can be overcome by “upgrading” to AI’s big brother, Azure Data Explorer (ADX) which is more commonly known as Kusto.

This article outlines a pattern for continuously streaming data from Application Insights data to Kusto.

Application Insight Limitations

Retention – Data in application Insights is retained for 90 days at no charge and can be retained beyond that for a cost of $0.155 per GB per month. While the price is reasonable, there is a hard cap on retention of 2 years. Data older that 2 years is purged, so if longer retention is required, another solution is required.

Backup – At present, there is no way to backup and restore Application Insights data. It is certainly possible to export this data to a variety of media, but there is no way to restore it.

Data limits – Application Insights can struggle when a large amount of data is requested. It is not possible in any circumstance to query more than 500,000 rows or 64 MB of compressed data. It is possible to implement paged queries to work around this limitation, but this can be problematic. Query timeouts are also limited to 100 seconds, and unlike the underlying Kusto engine itself, these limits are absolute.

Scale – Application Insights is a “one size fits all” service. It cannot be scaled either up or down. It is therefore not possible to overcome issues with query performance of service limits by adding power.

Schema – At present, Application Insights collects data into 10 different tables. The schema of these tables is fixed and cannot be changed. It is possible to use custom data in these tables, in fact many have columns of dynamic type for just this purpose. JSON data can be added into these columns and queried by the engine. This makes Application Insights highly flexible.

The downside of this flexibility is performance. Querying custom data requires the engine to parse data at runtime. The engine is incredibly efficient at doing this, but it cannot compare to more structured columns, particularly when querying massive amounts of data. The fixed nature of Application Insights precludes other approaches for improving query performance like materialized views, etc.

Moving to Kusto

If any of these limitations are an issue, you may wish to consider moving your Application Insights data into Azure Data Explorer, otherwise known as “Kusto”. Kusto is the engine behind all of Azure Monitor (which includes Application Insights and Log Analytics), and it employs the same query language.

When you use your own Kusto cluster, you have complete control over your data. A Kusto cluster contains one or more nodes and can be scaled automatically. Specifically, it solves the limitations inherent to Application Insights while maintaining familiarity with the same data types and query language (KQL). It addresses the AI limits in the following ways:

Retention – Kusto has advanced data retention and caching settings that can be set at both the database level and the table level. Retention can be set to unlimited if necessary.

Backup – Kusto can connect to external tables that are connected to Azure storage accounts or to SQL tables. Continuous export can be added to any Kusto tables so that the externalized data is always up to date. Data can be restored from these externalized sources, or by reingesting directly from them. Alternatively, AI data can be simultaneously streamed into Azure storage accounts, and this data can be ingested into Kusto for restoration.

Data limits – The default query limits in Kusto are the same as those found in Application Insights, but here they are soft limits. They can be overridden, and asynchronous operations can be used to circumvent them when necessary. In most cases however, by using data optimization strategies available to Kusto, these limits should be less important.

Scale – Kusto clusters can be as small as 1 node (for development – a single node cluster has no SLA), and as large as 1,000. Each node can be as small as 2 CPUs/14 GB RAM, and as large as 32 CPUs/128 GB RAM. There is no limit to the quantity of data that can be ingested.

Schema – This is where Kusto really shines. Data can be transformed natively at the time of ingestion using update policies. Custom Application Insights data can be extracted from the dynamic columns into more structured tables. This has the benefit of greatly enhancing performance. In addition, materialized views can be created to further enhance query performance, create pre-aggregated query targets, etc. These strategies can greatly improve query performance.

By streaming Application Insights data into Kusto, you can continue to take advantage of the rich data collection capabilities of Application Insights, without being constrained by its storage limitations. In this scenario, AI acts as your telemetry collector, and Kusto your storage engine. The remainder of this article outlines how to do this.

Setting it all up

In our scenario, we are collecting data from 3 Application Insights tables, pageViews, customMetrics, and customEvents. To capture this data, we will use Diagnostic Settings, which transforms the table names to AppPageViews, AppMetics, and AppEvents respectively. The entire process is shown below for reference:

Azure Monitor collects data from application Insights as it arrives through Diagnostic settings. The data is then sent to an Azure Event Hub, as well as to an Azure Data Lake Gen2 (ADLG2) account for long term storage and recoverability. Azure Data Explorer (Kusto) ingests data directly from the Event Hub in near real time. Event Hub data is transformed and delivered to three staging tables through update policies and functions. In addition, External tables are connected to three containers in the storage account for diagnostic purposes or re-ingestion on demand.

Create an Event Hub and (optionally) a storage account

Data will be streamed continuously to an Event Hub and to and Azure Data Lake Gen 2 (ADLG2) account.

The Application Insights instance, the ADLG2 account, and the Event Hub namespace must all exist within the same Azure region. This is a limitation of the Azure Monitor service. The Kusto cluster can exist anywhere.

When creating the storage account, be sure to select the option for “Enable hierarchical namespace” from the Advanced page. This is what distinguishes an ordinary storage account from an ADLG2 account.

Configure Application Insights diagnostic settings

Many Azure services can stream usage data through their “Diagnostic Settings” option. In the case of Application Insights, all the collected data can be streamed. It should be noted however that the table names do not match those within the Application Insights logs, they are the same as those found in the Log Analytics workspace that backs the AI instance. In the example below, we are collecting data from the AppEvents, AppMetrics, and AppPageViews tables (customEvents, customMetrics, and pageViews in AI).

In this case we are sending data to an Event Hub and to an ADLG2 storage account. Each table will store its data in a separate container, and it is not possible to change that container.

Create the Kusto ingestion table and set up ingestion

The data stream to the Event Hub contains records from three different tables with different schemas. To accommodate this, we will create a temporary holding table, set up a policy to automatically distribute data in this table to three tables with different schemas, and then a retention policy to purge the holding table after distribution.

The holding table to receive Event Hub data will be named Pages_EventHub, and can be created from a Kusto query window using the following command:

.create table Pages_EventHub (records: dynamic)

This will create a table with one column named records which is of the dynamic data type. Event Hub data will land here.

Next, we create an ingestion mapping to match the incoming Event Hub JSON data to the holding table. This can be done from a query window using the following command:

.create table Pages_EventHub ingestion json mapping "RawRecordsMapping"
'['
' {"column": "records", "Properties":{"Path":"$.records"}}'
']'

When we define an ingestion, we will refer to this mapping by the name RawRecordsMapping. This mapping is a property of the holding table, and it will return the records path from the incoming JSON data and place it in the records column of the Pages_EventHub table.

Next, we define the data connection for the ingestion. To define a connection, navigate to your Kusto cluster, and open the Databases node, and then open the database that will receive the data. Finally, select Data connections, then Add data connection, and select Event Hub.

Give the connection a name, select the Event Hub namespace and hub, the default consumer group, and no compression. Use the table name and mapping created above and select JSON as the data format. When finished save the data connection.

If data is flowing into the Event Hub, it should begin to appear in the ingestion table within a few minutes, a typical time lag is 5 minutes. Once confirmed, it’s time to create the destination tables and update policies.

Create destination tables and update policies

We want to take data from the Event Hub and “reconstitute” it in Kusto. To do that, we will closely copy the data structure from the Log Analytics workspace that is connected to our Application Insights instance, leaving out some unnecessary system data. In our case, we will create three tables using the following Kusto commands (one at a time) in the Query window:

.create table pages_Staging_PageViews (TenantId: string, TimeGenerated: datetime, Id: string, Name: string, Url: string, DurationMs: real, PerformanceBucket: string, Properties: dynamic, Measurements: dynamic, 
OperationName: string, OperationId: string, ParentId: string, SyntheticSource: string, SessionId: string, UserId: string, UserAuthenticatedId: string, UserAccountId: string, AppVersion: string, AppRoleName: string, AppRoleInstance: string, ClientType: string, ClientModel: string, ClientOS: string, ClientIP: string, ClientCity: string, ClientStateOrProvince: string, ClientCountryOrRegion: string, ClientBrowser: string, ResourceGUID: string) 

.create table pages_Staging_Events (TenantId: string, TimeGenerated: datetime, Name: string, Properties: dynamic, Measurements: dynamic, OperationName: string, OperationId: string, ParentId: string, SyntheticSource: string, SessionId: string, UserId: string, UserAuthenticatedId: string, UserAccountId: string, AppVersion: string, AppRoleName: string, AppRoleInstance: string, ClientType: string, ClientModel: string, ClientOS: string, ClientIP: string, ClientCity: string, ClientStateOrProvince: string, ClientCountryOrRegion: string, ClientBrowser: string, ResourceGUID: string) 

.create table pages_Staging_Metrics (TenantId: string, TimeGenerated: datetime, Name: string, ItemCount: int, Sum: real, Min: real, Max: real, Properties: dynamic, OperationName: string, OperationId: string, ParentId: string, SyntheticSource: string, SessionId: string, UserId: string, UserAuthenticatedId: string, UserAccountId: string, AppVersion: string, AppRoleName: string, AppRoleInstance: string, ClientType: string, ClientModel: string, ClientOS: string, ClientIP: string, ClientCity: string, ClientStateOrProvince: string, ClientCountryOrRegion: string, ClientBrowser: string, ResourceGUID: string)

Next, we construct queries that will fit the schemas for these three and filter the result for the appropriate type. These queries will then be used to create Kusto functions for each of the three tables. The commands to create the three functions, which contain our queries can be found below.

.create-or-alter function fn_Pages_PageViewsIngest {
Pages_EventHub
| mv-expand records
| where records.Type == "AppPageViews"
| project 
TenantId = tostring(records.Properties.TenantId),
TimeGenerated = todatetime(records.['time']),
Id = tostring(records.Id),
Name = tostring(records.Name),
Url = tostring(records.Url),
DurationMs = toreal(records.DurationMs),
PerformanceBucket = tostring(records.PerformanceBucket),
Properties = todynamic(records.Properties),
Measurements = todynamic(records.Measurements),
OperationName = tostring(records.OperationName),
OperationId = tostring(records.OperationId),
ParentId = tostring(records.ParentId),
SyntheticSource = tostring(records.SyntheticSource),
SessionId = tostring(records.SessionId),
UserId = tostring(records.UserId),
UserAuthenticatedId = tostring(records.UserAuthenticatedId),
UserAccountId = tostring(records.UserAccountId),
AppVersion = tostring(records.AppVersion),
AppRoleName = tostring(records.AppRoleName),
AppRoleInstance = tostring(records.AppRoleInstance),
ClientType = tostring(records.ClientType),
ClientModel  = tostring(records.ClientModel), 
ClientOS  = tostring(records.ClientOS), 
ClientIP  = tostring(records.ClientIP),
ClientCity  = tostring(records.ClientCity), 
ClientStateOrProvince  = tostring(records.ClientStateOrProvince), 
ClientCountryOrRegion  = tostring(records.ClientCountryOrRegion), 
ClientBrowser  = tostring(records.ClientBrowser), 
ResourceGUID  = tostring(records.ResourceGUID)
}

.create-or-alter function fn_Pages_EventsIngest {
Pages_EventHub
| mv-expand records
| where records.Type == "AppEvents"
| project 
TenantId = tostring(records.Properties.TenantId),
TimeGenerated = todatetime(records.['time']),
Name = tostring(records.Name),
Properties = todynamic(records.Properties),
Measurements = todynamic(records.Measurements),
OperationName = tostring(records.OperationName),
OperationId = tostring(records.OperationId),
ParentId = tostring(records.ParentId),
SyntheticSource = tostring(records.SyntheticSource),
SessionId = tostring(records.SessionId),
UserId = tostring(records.UserId),
UserAuthenticatedId = tostring(records.UserAuthenticatedId),
UserAccountId = tostring(records.UserAccountId),
AppVersion = tostring(records.AppVersion),
AppRoleName = tostring(records.AppRoleName),
AppRoleInstance = tostring(records.AppRoleInstance),
ClientType = tostring(records.ClientType),
ClientModel  = tostring(records.ClientModel), 
ClientOS  = tostring(records.ClientOS), 
ClientIP  = tostring(records.ClientIP),
ClientCity  = tostring(records.ClientCity), 
ClientStateOrProvince  = tostring(records.ClientStateOrProvince), 
ClientCountryOrRegion  = tostring(records.ClientCountryOrRegion), 
ClientBrowser  = tostring(records.ClientBrowser), 
ResourceGUID  = tostring(records.ResourceGUID)
}

.create-or-alter function fn_Pages_MetricsIngest {
Pages_EventHub
| mv-expand records
| where records.Type == "AppMetrics"
| project 
TenantId = tostring(records.Properties.TenantId),
TimeGenerated = todatetime(records.['time']),
Name = tostring(records.Name),
ItemCount = toint(records.ItemCount),
Sum = toreal(records.sum),
Min = toreal(records.Min),
Max = toreal(records.Max),
Properties = todynamic(records.Properties),
OperationName = tostring(records.OperationName),
OperationId = tostring(records.OperationId),
ParentId = tostring(records.ParentId),
SyntheticSource = tostring(records.SyntheticSource),
SessionId = tostring(records.SessionId),
UserId = tostring(records.UserId),
UserAuthenticatedId = tostring(records.UserAuthenticatedId),
UserAccountId = tostring(records.UserAccountId),
AppVersion = tostring(records.AppVersion),
AppRoleName = tostring(records.AppRoleName),
AppRoleInstance = tostring(records.AppRoleInstance),
ClientType = tostring(records.ClientType),
ClientModel  = tostring(records.ClientModel), 
ClientOS  = tostring(records.ClientOS), 
ClientIP  = tostring(records.ClientIP),
ClientCity  = tostring(records.ClientCity), 
ClientStateOrProvince  = tostring(records.ClientStateOrProvince), 
ClientCountryOrRegion  = tostring(records.ClientCountryOrRegion), 
ClientBrowser  = tostring(records.ClientBrowser), 
ResourceGUID  = tostring(records.ResourceGUID)
}

With the three functions in place, we need to create an update policy that will use the results of a function to load a table whenever data is added to the holding table. For our pages_Staging_PageViews table, we run the following command to create the policy.

.alter table [@"pages_Staging_PageViews"] policy update @'[{"Source": "Pages_EventHub", "Query": "fn_Pages_PageViewsIngest()", "IsEnabled": "True", "IsTransactional": true}]'

This command adds an update policy to the pages_Staging_PageViews table. This update policy will be invoked whenever data is added to the Pages_EventHub table. It will execute the fn_Pages_PageViewsIngest function created above against this new data and load the result into the pages_Staging_PageViews table. The function itself filters out all data that did not originate from the original AppPageViews table and transform it to match the destination schema.

The commands for creating the policies on the other two tables are below:

.alter table [@"pages_Staging_Events"] policy update @'[{"Source": "Pages_EventHub", "Query": "fn_Pages_EventsIngest()", "IsEnabled": "True", "IsTransactional": true}]'

.alter table [@"pages_Staging_Metrics"] policy update @'[{"Source": "Pages_EventHub", "Query": "fn_Pages_MetricsIngest()", "IsEnabled": "True", "IsTransactional": true}]'

The last step is to add a retention policy to the Pages_EventHub table that will remove data automatically after it has been processed. This is an optional step and can be done at any point to conserve resources. A retention policy will remove ingested data after a defined time. Setting the period to 0 will delete the data shortly after all update policies have completed.

In this case the policy is added to the holding table by running the following command:

.alter-merge table Pages_EventHub policy retention softdelete = 0d recoverability = disabled

At this point, data should be flowing into the three destination tables shortly after it arrives through the event hub.

Connect external tables to the ADLG2 data (optional)

Earlier, we selected both an event hub and a storage account to receive data from Application Insights. The reason for the storage container is to provide an authoritative source of persisted data. Data in Application Insights expires by default after 90 days and cannot be retained any longer than 2 years. Data in Kusto can be persisted for an unlimited period, but it too can be configured to expire after a period of time. Storing the data in a storage account ensures permanency, and provides a location to re-ingest from should any disaster befall the Kusto data.

Kusto can be connected to external data sources as an external table. These sources can be a storage account, or SQL databases. While not strictly required, it is a good idea to create external tables connected to this data so that this data can be queried, and re-ingested with relative ease whenever necessary.

Connecting Kusto to ADLG2 storage is a two-step process. First you create a shared access signature, and then you create an external table in Kusto using that signature. A shared access signature can be created for the entire account, a container, or even a folder. Since we will be connecting to three different containers, we will create the signature at the account level. To do this navigate to the storage account in Azure, and the select Shared access signature in the Security + networking section. Select Blob and File from Allowed services, and then Container and Object from Allowed resource types. Set an expiry date applicable to your situation. The external table will stop working once your expiry date is exceeded.

When ready, click the Generate SAS and connection string button, and the screen will appear as follows:

Make note of the Blob service SAS URL – it will be needed in the next step. It’s also a good idea to record these settings, as it’s not possible to go back and retrieve them later.

Capturing the three tables above to ADLG2 creates the following three containers in the storage account:

insights-logs-appevents
insights-logs-appmetrics
insights-logs-apppageviews

When creating the external tables below, the Blob service SAS URL values need to be modified to include these containers by adding them before the token in the URL. Therefore:

https://mystorageaccount.blob.core.windows.net/?sv=2021-06-08&……. becomes

https://mystorageaccount.blob.core.windows.net/insights-logs-appevents?sv=2021-06-08&……. and so on.

To create the external table in Kusto, navigate to a Kusto query window that is connected to the appropriate database. The following commands can be used to create the table, substituting the sample url with the ones from above:

.create-or-alter external table Pages_AppEvents_Ext (['time']:datetime,resourceId:string,ResourceGUID:guid,Type:string,ClientBrowser:string,ClientCity:string,ClientCountryOrRegion:string,ClientIP:string,ClientModel:string,ClientOS:string,ClientStateOrProvince:string,ClientType:string,IKey:guid,_BilledSize:int,OperationName:string,OperationId:guid,ParentId:guid,SDKVersion:string,SessionId:string,UserAccountId:string,UserAuthenticatedId:string,UserId:string,Properties:dynamic,Name:string,ItemCount:int) 
kind=blob 
dataformat=json
( 
   h@'https://mystorageaccount.blob.core.windows.net/insights-logs-appevents?******' 
)

.create-or-alter external table Pages_AppMetrics_Ext (['time']:datetime,resourceId:string,ResourceGUID:guid,Type:string,ClientBrowser:string,ClientCity:string,ClientCountryOrRegion:string,ClientIP:string,ClientModel:string,ClientOS:string,ClientStateOrProvince:string,ClientType:string,IKey:guid,_BilledSize:int,OperationName:string,OperationId:guid,ParentId:guid,SDKVersion:string,SessionId:string,UserAccountId:string,UserAuthenticatedId:string,UserId:string,Properties:dynamic,Name:string,Sum:int,Min:int,Max:int,ItemCount:int) 
kind=blob 
dataformat=json 
( 
    h@'https://mystorageaccount.blob.core.windows.net/insights-logs-appmetrics?******'
)

.create-or-alter external table Pages_AppPageViews_Ext (['time']:datetime,resourceId:string,ResourceGUID:guid,Type:string,ClientBrowser:string,ClientCity:string,ClientCountryOrRegion:string,ClientIP:string,ClientModel:string,ClientOS:string,ClientStateOrProvince:string,ClientType:string,IKey:guid,_BilledSize:int,OperationName:string,OperationId:guid,ParentId:guid,SDKVersion:string,SessionId:string,UserAccountId:string,UserAuthenticatedId:string,UserId:string,Properties:dynamic,Measurements:dynamic,Id:guid,Name:string,Url:string,DurationMs:int,PerformanceBucket:string,ItemCount:int) 
kind=blob 
dataformat=json 
( 
    h@'https://mystorageaccount.blob.core.windows.net/insights-logs-apppageviews?******'
)

Once created, the external tables can be queried like any other table. They can be used for data validation or reingestion as appropriate.

In Conclusion

Once the data is flowing, subsequent tables and update policies can be set up to further transform the data, and materialized views can be created to further optimize query performance. Moving Application Insights data into Kusto gives you the best of both worlds, the telemetry collection capabilities of Application Insights, and the big data power of Kusto. This approach is not limited to Application Insights either – it can be used with and Azure services that support Azure Monitor with Diagnostic Settings.

Be aware however that this migration is a one way street. Once the data is in Kusto, it can be retained for as long as you like, but it can’t be put back into the source.

Connect Power BI to Azure Monitor data using Direct Query with Azure Data Explorer

Published March 15, 2022

Man – that’s a boring title. But it’s accurate.

A few months ago, I posted an article outlining how to connect Power BI to Azure Application Insights and Azure Log Analytics (jointly referred to as Azure Monitor) with Direct Query. This article describes an approach that allows you to use a native Kusto connector to connect to the Azure Monitor instance as if it were an ADX cluster. This allows for Direct Query to be used, among other things. The option connecting Power BI available through the Azure Monitor UI uses an html connector to query the respective APIs, and that connector doesn’t support Direct Query.

The problem with using this connector is that it’s a bit of a hack. At the time it was written, you needed to use the old Power BI driver for Kusto to make it work, and that approach isn’t simple. Over time, it stopped working altogether for Application Insights. The ADX connector has since been updated to support connection to Azure Log Analytics (but not Application Insights) and is therefore still valid.

There is however another way to achieve this by using your own ADX cluster. ADX clusters allow for “cross-cluster queries” that permit tables in a database in one cluster to be joined or unioned with tables in a completely different cluster. The same proxy addresses mentioned above can be used in one of these cross-cluster queries, and in this way, be just use the ADX cluster as an intermediary.

Everything that you need to know about this approach can be found in the support article “Query data in Azure Monitor using Azure Data Explorer”

To create a Power BI report that queries Azure Monitor data using Direct Query, first create a new report, and connect to data using the “Azure Data Explorer (Kusto) connector”. Enter the address of the cluster, and the name of a database within that cluster. The database itself doesn’t matter; it simply provides a scope for the query. Finally, you need to specify the query, and this is where the cross-cluster query comes into the picture. The query takes the following form:

cluster(‘ProxyURL‘).database(‘ResourceName‘).TableName

The Proxy URLs differ between Log Analytics and Application Insights. The two take the following forms:

Log Analytics:

https://ade.loganalytics.io/subscriptions/<subscription-id>/resourcegroups/<resource-group-name>/providers/microsoft.operationalinsights/workspaces/<workspace-name>

Application Insights:

https://ade.applicationinsights.io/subscriptions/<subscription-id>/resourcegroups/<resource-group-name>/providers/microsoft.insights/components/<ai-app-name>

The cross-cluster query for the table named “pageViews” in an Application Insights instance named “WhitePagesLogs” in a Resource group named “MyResourceGroup” in the subscription “71a90792-474e-5e49-ab4e-da54baa26d5d” is therefore”

cluster('https://ade.applicationinsights.io/subscriptions/71a90792-474e-5e49-ab4e-da54baa26d5d/resourcegroups/MyResourceGroup/providers/microsoft.insights/components/WhitePagesLogs').database('WhitePagesLogs').pageViews

It is worth explicitly noting that the resource name appears twice in the query – once in the cluster address, and as the database name.

When ready, the Get data dialog box should appear as follows:

If you want to use Direct Query, don’t forget to open the Advanced Options section, and select it here.

At this point, the report can be built, and it will behave as if it was a normal ADX cluster. You can of course build more complex queries, etc, but you cannot build things like functions, or materialized vies, since you do not have administrative access to the engine behind Azure Monitor.

Compared to using the Power BI ADX connector directly, this approach has the advantage of being explicitly supported, and it also works with bot Application Insights, and Log Analytics. On the downside, there is a cost to running your own ADX cluster, although it is minimal. This cluster is simply acting as a gateway in this case, and therefore, a bare minimum of resources will suffice.

2 Comments

Automatically move your Application Insights data into Azure Data Explorer

Automatically move your Application Insights data into Azure Data Explorer

Published November 30, 2020

Application Insights (AI) is a useful way of analyzing your application’s telemetry. Its lightning-fast queries make it ideal for analyzing historical data, but what happens when you start to bump into the limits? The default retention for data is 90 days, but that can be increased (for a fee) to 2 years. However, what happens when that’s not enough? If you query too much, or too often you may get throttled. When you start to bump into these limits, where can you go?

The answer lies in the fact that Application Insights is backed by Azure Data Explorer (ADX or Kusto). Moving your AI data to a full ADX cluster will allow you to continue using AI to collect data, and even to analyze recent data, but the ADX cluster can be sized appropriately and used when the AI instance won’t scale. The fact that it is using the same engine and query language as AI means that your queries can continue to work. This article describes a pattern for doing this.

Requirements

We’ll be working with several Azure components to create this solution. In addition to your AI instance, these components are:

Azure Data Explorer cluster
Azure Storage Account
Azure Event Namespace and at least one Event hub
Azure Event Grid

The procedure can be broken down into a series of steps:

Enable Continuous Export from AI
Create an Event Grid subscription in the storage account
Create an ADX database and ingestion table
Create an Ingestion rule in ADX
Create relevant query tables and update policies in the ADX database

Enable Continuous Export from Application Analytics

AI will retain data for up to 2 years, but for archival purposes, it provides a feature called “Continuous Export”. When this feature is configures, AI will write out any data it receives to Azure blob storage in JSON format.

To enable this, open your AI instance, and scroll down to “Continuous Export” in the “Configure” section. Any existing exports will show here, along with the last time data was written. To add a new destination, select the “Add” button.

You will then need to select which AI data types to export. For this example, we will only be using Page Views, although multiple types can be selected.

Next, you need to select your storage account. First select the subscription (if different from your AI instance), and then select the storage account and container. You will need to know what data region the account is in. Once selected, save the settings.

Initially, the “Last Export” column will display “Never”, but once AI has collected some data, it will be written out to your storage container, and the “Last Export” column will display when that occurred. Once it has occurred, you should be able to open your storage account using Storage Explorer, and then the container to see the output. In the root of the container selected above, you’ll see a folder that is named with the AI Instance name, and the AI instance GUID.

Opening that folder, you’ll find a folder for each data type selected above (if there has been data for them). Each data types will be further organized into folders names for the day, and the hour. Multiple files will be contained withing with the .blob extension. These are multiline json files and can be downloaded and opened with a simple text editor.

The next step is to raise an event whenever new content is added to this storage container.

Create an Event Grid subscription in the storage account

Prior to this step, ensure that you have created, or have available an Event namespace, and an Event hub. You will connect to this hub in this step.

From the Azure portal, open the storage account and then select the “Events” node. Then click the “Event Subscription” button at the top.

On the following screen, you’ll need to provide a name and schema for the subscription. The name can be whatever you wish, and the schema should be “Event Grid Schema”. In the Topic Details section, you will provide a topic name which will pertain to all subscriptions for this storage account. In the “Event Types” section, you select the types of actions that will fire an event. For our purposes, all we want is “Blob Created”. With this selection, the event will fire every time a new blob is added to the container. Finally, under “Endpoint Details”, you will select “Event Hubs” from the dropdown, and then you click on “Select an endpoint” to select your Event Hub.

Once created an event will fire anytime a blob is created in this storage account. If you wish to restrict this to specific folders or containers, you can select the Filters tab, and create a subject filter to restrict it to specific file types, containers, etc. More information on Event Grid filters can be found here. In our case, we do not need a filter.

When ready, click the “Create” button, and the Event subscription will be created. It can be monitored from the storage account and can also be monitored in the Event hub. As new blobs are added to the storage account, more events will fire.

Create an ADX database and ingestion table

From Azure portal, navigate to your ADX cluster and either select a database or create a new one. Once the database has been created, you need to create at least one table to store the data. Ultimately, Kusto will ingest data from the blobs added above whenever they are added, and you need to do some mapping to get that to work properly. For debugging purposes, I find it useful to create an intermediate ADX table to receive data from the blobs, and them transform the data afterward.

In this case, the intermediate table will have a single column, Body that will contain the entirety of each ingested record. To create this table, run the following KQL query on your database:

.create-merge table Ingestion (Body: dynamic)

The dynamic data type in ADX can work with JSON content, and each record will go there. For this to work, you also need to add a mapping to the table. The mapping can be very complex, but in our case, we’re doing a simple load in, so we’re matching the entire JSON record to the Body column in our database. To add this mapping, run the following KQL command:

.create table TweetIngest ingestion json mapping "RawInput"
'['
' {"column": "Body", "Properties":{"Path":"$"}}'
']'

At this point, we are ready for an ingestion rule.

Create an Ingestion rule in ADX

From the Azure portal, open your ADX cluster, and select the “Databases” node in the “Data” section, then click on your database.

The setting that we need is “Data ingestion” in the resulting window. Selecting that takes you to the ingestion rules. Now you want to create a new connection by selecting the “Add data connection” button.

The first selection is the data connection type. The options are Event Hub, Blob storage, or Iot Hub. We need to select Blob storage. Both it, and Event hub will connect to an Event hub, but the difference it that using “Blob storage”, the contents of the blobs will be delivered, and selecting “Event Hub” will only deliver the metadata of the blob being added.

Once the type is selected, you give it a name, and choose the event grid to connect to (the one that you created above) and the event type. Next, you select “Manual” in the Resources creation section. Selecting “Automatic” will create a new Event Hub Namespace, Hub, and Event grid and you won’t have any control of the naming of these resources. Selecting Manual allows you to keep it under control. Select your event grid here.

Next, select the “Ingest properties” tab, and provide the table and mapping that you created above (which in our case was “RawInput”). Also, you need to select “MULTILINE JSON” as the data format.

Once these values are complete, press the Create button and the automatic ingestion will commence. Adding a new blob to the storage account will fire an event, which will cause ADX to load the contents of the blob into the Body column of the Ingestion table. This process can take up to 5 minutes after the event fires.

Create relevant query tables and update policies in the ADX database

Once ingestion happens, your “Ingestion” table should have records in it. Running a simple query in ADX using the table name should show several records with data in the “Body” column. Opening a record will show the full structure of the JSON contained within. If records with different schema are being imported, a query filter can be employed to limit the results to only those records.

For example, the pageViews table in AI will always contain a JSON none named “view”. The query below will return only pageView data from the ingestion table:

This ingestion table can be queried in this matter moving forward, but for performance and usability reasons, it is better to “materialize” the views of this table. To do this, we create another table, and set an update policy on it that will add relevant rows to it whenever the ingestion table is updated.

The first step is to create the table. In our case, we want to replicate the schema of the pageViews table in Application Insights. This is because we want to be able to reuse any queries that we have already built against AI. All that should be necessary is to change the source of those queries to the ADV cluster/database. To create a table with the same schema of the AI pageViews table (mostly), the following command can be executed in ADX:

 .create table pageViews (
    timestamp: datetime, 
    ['id']: string, 
    name: string, 
    url: string, 
    duration: real,
    performanceBucket: string, 
    customDimensions: dynamic, 
    customMeasurements: dynamic, 
    operation_Name: string, 
    operation_Id: string, 
    operation_ParentId: string, 
    operation_SyntheticSource: string,
    session_Id: string, 
    user_Id: string, 
    user_AuthenticatedId: string, 
    user_AccountId: string, 
    application_Version: string, 
    client_Type: string, 
    client_Model: string, 
    client_OS: string, 
    client_IP: string, 
    client_City: string, 
    client_StateOrProvince: string, 
    client_CountryOrRegion: string, 
    client_Browser: string
)

Once the table is created, we need to create a query against the Ingestion table that will return pageViews records in the schema of the new table. Without getting deep into the nuances of the KQL language, a query that will do this is below:

    Ingestion
    | where isnull(Body.view) == false
    | extend view = Body.view
    | mvexpand view
    | extend performancems = view.durationMetric.value /10000
    | extend a = trim_end("\\]",trim_start("\\[",tostring(Body.context.custom.dimensions)))
    | extend b = replace('"}','"',replace('{"','"',a))
    | extend c = todynamic(strcat('{',b,'}'))
    | extend d = trim_end("\\]",trim_start("\\[",tostring(Body.context.custom.metrics)))
    | extend e = replace('"}','"',replace('{"','"',d))
    | extend f = todynamic(strcat('{',e,'}'))
    | project
        timestamp = todatetime(Body.context.data.eventTime),
        id = tostring(Body.internal.data.id),
        name = tostring(view.name),
        url = tostring(view.url),
        duration = toreal(performancems),
        performanceBucket = case(
                performancems < 250, "<250ms",
                performancems < 500, "250ms-500ms",
                performancems < 1000, "500ms-1sec",
                performancems < 3000, "1sec-3sec",
                performancems < 7000, "3sec-7sec",
                performancems < 15000, "7sec-15sec",
                performancems < 30000, "15sec-30sec",
                performancems < 60000, "30sec-1min",
                performancems < 120000, "1min-2min",
                performancems < 300000, "2min-5min",
                ">=5min"
            ),
        customDimensions = todynamic(c),
        customMeasurements = todynamic(f),
        operation_Name = tostring(Body.context.operation.name),
        operation_Id = tostring(Body.context.operation.id),
        operation_ParentId = tostring(Body.context.operation.parentId),
        operation_syntheticSource = tostring(Body.context.data.isSynthetic),
        session_Id = tostring(Body.context.session.id),
        user_Id = tostring(Body.context.user.anonId),
        user_AuthenticatedId = tostring(Body.context.user.authId),
        user_AccountId = tostring(Body.context.user.accountId),
        application_Version = tostring(Body.internal.data.documentVersion),
        client_Type = tostring(Body.context.device.type),
        client_Model = tostring(Body.context.device.deviceModel),
        client_OS = tostring(Body.context.device.osVersion),
        client_IP = tostring(Body.context.location.clientip),
        client_City = tostring(Body.context.location.city),
        client_StateOrProvince = tostring(Body.context.location.province),
        client_CountryOrRegion = tostring(Body.context.location.country),
        client_Browser = tostring(Body.context.device.browserVersion)
}

The “where isnull(Body.view) == false” statement above uniquely identifies records from the pageViews table. This is useful if multiple tables use the same Ingestion table.

Next, we need to create a function to encapsulate this query. When we add an update policy to the pageViews table, this function will run this query on any new records in the Ingestion table as they arrive. The output will be added to the pageViews table. To create the function, it’s a simple matter of wrapping the query from above in the code below and running the command:

 .create-or-alter function pageViews_Expand {
   Query to run
}

This creates a new function named pageViews_Expand. Now that the function has been created, we modify the update policy of the pageViews to run it whenever new records are added to the Ingestion table, and its output will be added to the pageViews table. The command to do this can be seen below:

.alter table Pages_pageViews policy update @'[{"Source": "PagesIngestion", "Query": "Pages_pageViews_Expand()", "IsEnabled": "True", "IsTransactional": true}]'

After the next ingestion run, not only will you see records in the Ingestion table, but if there were page views, you should see the results show up in the pageViews table as well.

If you have data already in the Ingestion table that you want to bring in to the pageViews table, whether for testing or for historical purposes, you can use the .append command to load rows into the table from the function:

.append pageViews <| pageViews_Expand

Finally, if you don’t want to maintain data in the Ingestion table for very long, or not at all, you can set the retention policy on it. Data will be automatically purged from it at the end of the time limit. Setting the value to zero will purge the data immediately, and in that case, the Ingestion table simply becomes a conduit. To set the retention policy on the Ingestion table to 0, you can run the following command:

.alter-merge table Ingestion policy retention softdelete = 0d recoverability = disabled

There are several steps involved, but once everything is wired up, data should flow from Application Insights to Azure Data Explorer within a few minutes. This example only worked with the pageViews table, but any of the AI tables can be used although of course their schemas will be different.

2 Comments

Exceed the 500,000 row limit in Application Insights and Log Analytics with Power BI

Exceed the 500,000 row limit in Application Insights and Log Analytics with Power BI

Published November 2, 2020

The combination of Power BI and Application Insights (AI)/Log Analytics (LA) is a powerful one. These tools provide a quick, convenient, and relatively cheap way to collect and analyze telemetry on a wide variety of applications. One drawback of AI/LA is that any data query will return a maximum of 500,000 rows, which can be quite constraining in some cases. This article describes a way to work around this limit.

In this example, we’ll be working with an Application Insights instance that is being populated by the WordPress Application Insights plugin – in fact, it’s the one used on this very blog. There are a couple of ways to connect Power BI Desktop to AI data. The Power Query code is downloadable directly from Application Insights, and you can also use the Azure Data Explorer proxy address as outline in my post on the topic here. This approach will work for both methods, and for our purposes, we’ll be using the generated Power Query code approach.

To begin, access your Application Insights instance, and open the Logs window. If necessary, dismiss the “Queries” window that pops up. Next, form your query using Kusto Query Language (KQL). In our case, we want a simple dump of all rows in the “pageViews” table, so the query is simple – just pageViews.

Once we have the query the way that we want it, we select the Export button, and choose “Export to Power BI (M query). M is the name of the language that Power Query uses. Once chosen, a text file will be downloaded that contains the Power Query that we will need in Power BI Desktop.

At this point, we launch Power BI Desktop, and choose “Get Data”. Since we already have the query that we need, we will choose “Blank Query”.

Next, we name our query “Page Views”, and select the Advanced Editor. This is where we can paste in the query generated by Application Insights above. At this point, we open the file that was downloaded above, copy the contents, and paste them in this window (the top comments can be excluded).

Of note here is the value that will be automatically set for timespan. By default, this will be set to P1D, which means data will be retrieved only for the previous day. In our example above, we have changed it to show data for the past 365 days.

Selecting “Done” will load a preview of our data into Power Query. However, if we want to then load it into the data model, it will do so in a single pull, and we will be subject to the 500,000 row limit. What we need to do is break up our query into multiple queries, and Power Query lets us do this through the use of functions.

The first thing that we’ll need to do is to decide on how to segment the AI data. In our case, it is unlikely that we will have more than 500,000 page views per month, so if we performed one query per month, we should be able to retrieve all of our data. In order to do this, we need to go back to Application Insights, and form up a query that will return a list of year and month for our data. In our case, this query is:

pageViews
| where timestamp > now(-365d)
| summarize by 
    Year = datetime_part('Year',timestamp), 
    Month = datetime_part('Month',timestamp)

Note that the number of days in the where clause above should match the number of days in the larger query above. Next, export this query to Power BI, and create another query in Power Query. Leave the name as default for now. Selecting Done should return a list of years and months for your data. These values are all numbers, and Power Query recognizes them as such. However, we need to work with them as text later on , so we change their types to text.

Now we will return to our original query, and modify it so that it only returns data for a single month. Reopen the advanced editor and replace the query “pageViews” with:

pageViews | where datetime_part('Month',timestamp) == 10 and datetime_part('Year',timestamp) == 2020

The values chosen don’t matter, but they should return data, In the end, the edited code should look as follows:

Selecting done, we verify that we have data restricted to the specified month. This is where the fun begins. We are now going to turn this query into a function. To do so, we right click on our pageViews Query, and select “Create Function”

We are then presented with a dialog box that asks if we want to create the function without parameters. We can go ahead and select “Create”. We are then prompted to name the function, and we’ll call it “GetViewsByMonthAndYear”. We now need to edit the function. To do so, with the function selected in the query pane, we select the Advanced Editor once again. We then dismiss the following warning, and then we edit the function in two places. First, we need to define two variables to be passed to the function Month and Year , and then we add them to our query.

In the function declaration we add “Month as text” and “Year as text”. We then replace the explicit month and year that we originally queried for with these new variables, Month and Year. Our function code now appears as below:

Now we are ready to use our function. We select our query that contains the list of years and months, select the “Add Column” tab from the ribbon, and choose “Invoke Custom Function”. We give the new column a name “Views”, select our function from the dropdown, and then we select our column containing years and the column containing months to be passed to the function.

At this point, selecting “OK” will cause the function to be executed for each of the listed months. These are individual queries to AI, not one large one. Each query is still subject to the 500,000 row limit, but provided that no specific month exceeds that limit, all of the data will be returned.

Initially, the data is returned as a single table per day, but selecting the expand icon at the right of the column header allows us to retrieve the row values. It’s also a good idea to turn off the “Use original column name” option.

Selecting OK at this point displays all of the appropriate column values. We can then remove the “Year” and “Month” columns, as well as the original Page Views table that we used to create the function. We also need to set the data types for all of our columns because Power Query is unable to detect them using this approach.

Renaming our combined Query to Views, gives us the following result:

We still have a single table, but there is no longer a 500,00 row limit. At this point, we can load the data into the model and build our report.

3 Comments

Connect to Application Insights and Log Analytics with Direct Query in Power BI

Connect to Application Insights and Log Analytics with Direct Query in Power BI

Published June 24, 2020

Application Insights (AI) and Log Analytics (LA) from Microsoft Azure provide easy and inexpensive ways to instrument applications. Using just an instrumentation key, any application can send operational data to AI which can then provide a rich array of tools to monitor the operation of the application. In fact, the blog that you are reading uses an Application Insights plugin for WordPress that registers each view of a page into an instance of AI in my Azure tenant.

Application Insights data can be queried directly in the Azure portal to provide rich insights. In addition, the data can be exported to Excel for further analysis, or, it can be queried using Power Query in either Excel or Power BI. The procedure for using Power Query can be found in this article. The approach for doing so, uses the Web connector in Power Query, which can be automatically refreshed on a regular basis. The Web connector does not however support Direct Query, so the latency of the data in this scenario will be limited by the refresh schedule configured in Power BI. Any features that depend on Direct Query (Aggregations, Automatic Page Refresh) will also not work.

If you’ve worked with AI or LA, and dropped down to the Query editor, you’ve been exposed to KQL – The Kusto Query Language. This is the language that is used by Azure Data Explorer (ADX), or as its code name, “Kusto”. This is of course not a coincidence, as the Kusto engine powers both AI and LA.

Power BI contains a native connector for ADX, and if you can configure an ADX cluster for yourself, populate it, and work with it in Power BI for both imported and Direct Query datasets. Given that ADX is what powers AI and LA, it should be possible to use this connector to query the data for AI and LA. It turns out that the introduction of a new feature known as the ADX proxy will allow us to do just that.

The ADX proxy is designed to allow the ADX user interface to connect to instances of AI and LA and run queries from the same screens as native ADX clusters. The entire process is described in the document Query data in Azure Monitor using Azure Data Explorer. The document explains the process, but what we are particularly interested in is the syntax used to express an AI or LA instance as an ADX cluster. Multiple variations are described in the document, but the ones that we are most interested in are here:

For LA: https://ade.loganalytics.io/subscriptions/<subscription-id>/resourcegroups/<resource-group-name>/providers/microsoft.operationalinsights/workspaces/<workspace-name>

For AI: https://ade.applicationinsights.io/subscriptions/<subscription-id>/resourcegroups/<resource-group-name>/providers/microsoft.insights/components/<ai-app-name>

By substituting in your subscription ID, resource group name, and resource name, you can treat these resources as if they were ADX clusters, and query them in Power BI using Direct Query. As an example, a simple query on this blog can be formed using the ADX connector:

And the result will appear as:

The precise query is provided in the query section of the connector above.

Once the report is built, it can be deployed to the PowerBI service, and refreshed using AAD credentials.

It is important to note that this method does NOT require you to configure an ADX cluster of your own. We are simply utilizing the cluster provided to all instances of AI and LA. We therefore do not have any control over performance levels, as we would have in a full ADX cluster. However, if the performance is adequate,(and the queries designed appropriately), this can be a good approach to work with AI and LA data that has low latency (near real time) requirements.

31 Comments