Anda di halaman 1dari 4

25th Telecommunications forum TELFOR 2017 Serbia, Belgrade, November 21-22, 2017.



Improving Schema Issue Advisor


in the Azure SQL Database
Dejan Dundjerski, Stefan Laziü, Milo Tomaševiü, and Dragan Bojiü

Azure SQL database [4]. Collecting the telemetry data that


Abstract —An analysis of the telemetry data in the Azure tracks the exceptions and errors as they occur in the
SQL database reveals that the most frequent anomalies are customer database enables the detection of potential
due to the schema inconsistency errors which impairs normal problems. It was also recognized that the customers are
functioning of the customer applications. We assumed that
direct e-mail notifications to the customers could shorten time
usually unaware of the problems for a prolonged period of
to resolve the problems. The benefits are first validated by time. Therefore, our intention was to propose an improved
sending e-mail recommendations manually. Then, an advisor that would automatically generate the appropriate
improved schema advisor which detects anomalies and notifications. In this way, the users would be instantly
automatically sends the appropriate notifications to the informed about recently detected errors in the database
customers. Implementation which respects the feedback from which directly affect functioning of an application. The
the users was accompanied by continuous testing process as a
modern way of a new feature development. Evaluation in a proposed approach of automatic, prompt alerting the users
prolonged period confirmed the expected benefits. should improve the time to detect and mitigate schema
Keywords — Azure SQL database platform, Cloud inconsistencies and issues.
services, Feature validation and testing, Schema issue Before realization of this idea, a rough validation of the
advisor; proposed feature was performed by manual notifications.
After that, the main challenge was how to shorten the
I. INTRODUCTION
development cycle from the idea to realization. An
W HILE the traditional relational databases are still
typically employed, introduction of the cloud
environment in recent years provides significant
improved, agile process of the feature development
through integrated implementation, testing and evaluation
in a continuous loop was carried out. An end-result should
advantages to its users. Nowadays the existing on-premise be an improved schema issue advisor based upon the
solutions have been massively migrated to the cloud due to customer feedback.
lower cost of maintenance [1, 2]. In addition, companies The rest of this paper is organized as follows. Section 2
which used to build complex software solutions that were briefly describes the Azure SQL database platform.
installed on-premise with the development cycles of Telemetry pipeline and database advisors are especially
several years, now develop, deploy and publish new covered, as they are important components of the platform
versions of software with a monthly pace [3]. This kind of in the context of this work. Section 3 elaborates the
approach allows the companies to validate the design statement of the problem. Motivation and validation of the
decisions and get feedback from the customers with faster proposed solution was described in Section 4. Section 5
pace. gives an overview of the implementation process and
At the same time, software in the cloud can be used and continuous testing of the feature. Evaluation is presented
combined with machine learning and expert systems. This and discussed in Section 6. Conclusions and highlights of
particularly applies to possibilities of the automatic the prospective future work are brought in Section 7.
performance troubleshooting and auto-tuning of the
relational databases which were initially introduced in the II. AZURE SQL DATABASE
Microsoft Azure SQL cloud-based database is one of the
This paper is sponsored by Microsoft development center Serbia. first database-as-a-service solutions [3]. It is a highly
Dejan Dundjerski is with the Microsoft development center Serbia,
available and reliable system which alleviates the need for
Španskih Boraca 3, and School of Electrical Engineering, University of maintaining the complex servers that host SQL servers.
Belgrade, Bul. kralja Aleksandra 73, 11000 Belgrade, Serbia (e-mail: Among other convenient features, these kinds of services
dejandu@microsoft.com).
provide telemetry that tracks resource utilization,
Stefan Laziü is with the Microsoft development center Serbia,
Španskih Boraca 3, 11000 Belgrade, Serbia (e-mail: encountered errors and other useful and relevant
stlazi@microsoft.com). information for a SQL database.
Milo Tomaševiü is with the School of Electrical Engineering,
University of Belgrade, Bul. kralja Aleksandra 73, 11120 Belgrade, A. Telemetry
Serbia (e-mail: mvt@etf.bg.ac.rs).
Dragan Bojiü is with the School of Electrical Engineering, University
Unlike an on-premise SQL server, the Azure SQL
of Belgrade, Bul. kralja Aleksandra 73, 11120 Belgrade, Serbia (e-mail: database hosts millions of databases across the globe.
bojic@etf.bg.ac.rs). Quite alike to the flight data recording box in a plane, all
important parameters of the database are collected and

978-1-5386-3073-0/17/$31.00 ©2017 IEEE


recorded [5]. In our approach, it is of vital importance to database, they don’t become aware of the problems that
track the information about errors and exceptions that such change caused immediately. Either due to lack of
continuously occur inside the Azure SQL database. telemetry on the application side or bad programming
The collected data are being stored in a structured way practice, when the errors are caught without an appropriate
in both cold and hot storage. Hot storage is used for report, the usual time to detect an error and resolve it from
scenarios like alerting, when it is crucial to have data with the customer perspective was about 2 to 3 days. This
a low lag. Querying of hot storage should be completed in represents a potential problem that could cause the
a restricted amount of time providing near real-time improper functioning of the application or its complete
results. On the other side, cold storage contains much more failure, depending on the application and business logic.
data whose retention period is longer so it is often used for Related work described in [6] relies on the concept of
post-processing and various ad-hoc analysis. Due to the machine learning for assistance in the detection of service
amount of data being stored, querying of cold storage is issues. This study is focused on the machine learning
usually slower and cannot be used for near real-time models for classification of different issues while less
processing. focus has been given on the interaction with the customer
about the detected problems. By analyzing the different
B. Database advisors
systems and solutions on the global market [7], it has been
Collecting the telemetry data from a very large number concluded to the best of our knowledge, that there is no
of different databases allows an important possibility. Even end-to-end solution that adequately addresses the stated
though numerous customers have different workloads, still problem of the schema issues which affect the business and
typical patterns can be extracted and recognized as far as application logic. Specifically, other cloud service
the usage of the database is concerned. By analyzing the providers use the telemetry data to analyze and recognize
telemetry data, those different workloads could be grouped different patterns as a base for continuous improvement
according to their workload thumbprints. With the and optimization of the service they are hosting. Our goal
detection of specific issues, auto-tuning of the database is is to go one step further and use the telemetry data not just
also possible for the cases where the appropriate corrective to improve the system, but to improve the customer
actions are available. experience as well.
This approach does not only alleviate the pressure on
the customer and database administrator, but it also catches IV. VALIDATION OF THE CONCEPT
the issues caused by a bad deployment or existing First, the observed pattern that notification about the
application issues. Consequently, early detection of these errors can provide benefit to the customers had to be
issues allows the customers to be focused on fixing the validated before actual implementation.
actual issue instead of building the monitoring and alerting An initial model for detection of error anomalies was
logic intended to catch such situations. Two existing based on a simple query over the hot telemetry data source.
advisors that follow this kind of approach are Database It helped us to detect the databases with the highest error
Index Advisor and Database Parameterization Advisor [4]. counts in the current hour compared to the average error
count per hour during previous 5 days. The baseline time
III. PROBLEM STATEMENT
period of 5 days was dictated by limitations of the hot
By analyzing the telemetry from the Azure SQL telemetry data source. In such a way, top ten databases
database service, we concluded that various errors with the highest error counts were identified and then
occurring in the service could be classified into different investigated manually.
groups. The largest class, that covers about 25% of the For each of ten databases, a dominant error with the
observed cases, consists of the errors caused by the highest count of occurrence was investigated. After that,
database schema inconsistencies. These issues usually the customer was informed with manually sent message
happen either due to a change in the application logic or about the observed issue including the number of errors,
due to change in database schema. When the customer predicted trend and potential clues how it should be further
changes the application logic, it usually results in set of examined from application standpoint. Initial reply rate
queries targeting non-existing database objects. Otherwise, was quite high – about 60% of customers replied to the
when an inconsistent state comes from the database side, it email. They were satisfied with the notification and found
usually happens when the schema was unintentionally out such information very useful.
modified or when the deployment process between the Further validation iterations included more than top ten
application and the database is not orchestrated correctly. databases and, at the same time, different groups of
These issues affect the proper functioning of end-service detected errors were investigated as well. Different groups
which employs a relational database to preserve it state. of errors had different reply rates, but group of errors
Other classes that represent the errors not related to the related to the schema inconsistencies stood out clearly
schema inconsistencies are not that large since they cover a throughout the entire validation period.
wide spectrum of issues. During the whole validation process, manually sent e-
Focusing on the schema inconsistencies revealed mail messages evolved. Initially the e-mails included only
another useful insight obtained from the telemetry. Once basic set of information. Since the customers asked for
the users change their application or schema of the additional information, we added more clear and detailed
insight in further iterations. This kind of feedback loop resolved from the customer perspective, it will take
helped us to iterate with initial solution much faster during another 2 days until the system marks this anomaly as
the validation. After couple of iterations, by means of resolved and stops monitoring it. The results of telemetry
direct communication with the customers we verified our analysis indicated that the average time to resolve the
assumption that, without informing the users, it takes schema inconsistency issue is 2 days. If the schema
between 2 and 3 days to resolve the schema inconsistency. inconsistency does not reoccur during the cool down
period, it is considered as completely resolved. Otherwise,
V. IMPLEMENTATION AND TESTING
if it reoccurs again, the system continues to monitor it until
After we proved the concept of alleviating the query the customer completely fix the issue.
failures due to the issues of inconsistency between The Schema issue advisor which preserves preprocessed
application queries and database schema, the enhanced telemetry about the errors, implements the anomaly
SQL Database Schema Advisor was implemented. detection model and keeps the detected anomalies for each
A. Implementation details customer database is based on one instance of the Azure
Proposed model takes preprocessed counts of errors in a SQL Database. Portal and e-mail notification channels
15 minute interval classified by actual error messages and existed in the expanded platform, so the model database
the database where the error occurred. Actual error was added as an additional data source to the existing
message is chosen for classification criterion to notification channel.
differentiate between errors with same error code but with B. Continuous testing
different error messages. After the proposed concept was implemented on top of
For the purpose of the database schema issue advisor, existing technologies that the Azure stack offers,
only errors caused by the schema inconsistencies are being continuous testing and evaluation were carried out. In the
tracked. Following errors collected in this way cover 95% service world, it is very important to know what is
of all schema inconsistency errors: happening with the provided service, so the continuous
- 201 – Procedure '…' expects parameter '…'. validation is unavoidable. Once the customers start using
- 207 – Invalid column name '…'. the schema issue advisor, if there is a problem with their
- 208 – Invalid object name '…'. database, they will rely on notifications about the issue
- 2812 – Could not find stored procedure '…'. from the Azure SQL database platform. It also might
- 8144 – Procedure '…' has too many arguments. happen that some part of the pipeline does not work
The rest 5% of errors caused by the schema properly which prevents sending the notifications to the
inconsistencies are very rare and do not produce users. If a customer detects a problem with its database
significant anomalies. In order to reduce the amount of without being informed from the advisor, the customer
data being collected and processed, the focus was given would notice and report the lack of the notification that it
only on the enlisted, most frequent errors. used to receive. Another requirement in case of the
The anomaly detection algorithm is based on pipeline failure, is to exactly know which part of the
comparison between counts of collected schema pipeline failed so that the engineers can investigate it
inconsistency errors during the baseline period and relatively quickly and fix the issue.
monitoring period. The baseline period consists of 60 2- For this purpose, a dedicated task called runner was
hour intervals (5 days in recent history) which is used as a implemented. The periodic test executed by this runner
reference period for comparison with the current 2-hour takes one production Azure SQL database that serves for
monitoring period. For a fair comparison to total error testing purposes. It repeatedly and continuously executes
count in the monitoring period, the error count in the the test queries against the test database that generate the
baseline period is normalized to a 2-hour time interval. schema inconsistency errors. Then it waits for a predefined
Drift period which consists of 12 2-hour intervals (one period of time, for the telemetry to become available and
day) serves for the long-lasting anomalies to settle in order for the issue to be detected by the anomaly detection
not to affect the baseline period error count when the algorithm. In the end it checks if the raw telemetry data
anomaly reoccurs soon after it is being resolved from the source has the information about the errors, whether the
customer perspective. anomaly detection model detected it and if the appropriate
The 2-hour interval duration was taken after a careful notification appears.
analysis over telemetry data in order to increase the This model of continuous testing and validation can be
probability that an anomaly lasts for some time and to applied for any kind of pipeline or the process that
avoid transient issues which are not as significant for the includes different mutually connected components. In
customers. system based on multiple services, it is desirable to have a
Once the monitoring is started, the anomaly algorithm is clear separation between different components in order to
executed every 15 minutes, and the baseline, drift and make investigation process easier once the failure occurs.
monitoring periods are shifted in 15-minute increments
simultaneously. When a schema inconsistency anomaly is VI. EVALUATION
detected for one database, this anomaly is further During the period of 10 months, there were more than
monitored and tracked continuously. After it is being 200,000 detected anomalies published on the Azure portal.
These anomalies were detected on production databases usually rely on this advisor to catch their issues after the
utilized by different customer workloads and applications. deployment. So, in case of such an inconsistency they
Initial implementation of the advisor framework displayed perform the rollback of the problematic deployment
this kind of information only on the Azure portal without immediately. If the issue is not that severe, it takes up to
direct e-mail notifications to the customers. E-mail one working day (between 6 and 8 hours) to fix the issue
notifications to the customers were added later. In last 5 and deploy a new version.
months when e-mails with recommendations were sent to
the users, their number was more than 100,000 (slightly VII. CONCLUSION
more than the number of notifications that were displayed Some advanced features such as telemetry and database
on Azure portal only). advisor in the Azure SQL database enable an analysis of
Regarding the average time to resolve the issue, our the users’ workloads and anomalies that inevitably happen
results show an improvement from 19 hours and 52 in the software development. We tried to provide a better
minutes, for portal notification only, to 12 hours and 54 service to the customers by deep profiling of the errors and
minutes for recommendations that included e-mail direct alerting of the customers with e-mail notifications.
notifications, which is a significant overall improvement of The development of the improved schema issue advisor
54%. If we consider only the recommendations for the is an example of a modern agile development process in
users (9.4% of all schema advisor users) that saw the issue the environment of cloud services. Rapid iterative
through portal notification first, and then saw the issues implementation process was supported by simultaneous
through e-mail or portal after the e-mail was introduced, and continuous testing.
the average time to resolve was significantly reduced from The evaluation confirmed a significant decrease in time
2 days and 45 minutes to 8 hours and 31 minutes, which is to resolve the issue with recommendation sent through e-
an improvement of more than 3x times. mail notification for the customers that eventually saw the
Fig. 1 shows the distribution of the counts of the portal notification. The response rate during the testing
resolved schema issue recommendations according to the phase was higher than the response rate after the schema
time to resolve the issue for the customers that saw the issue advisor was deployed. It was explained by the fact
recommendations through portal notification (with and that customers usually neglect the template-formatted e-
without e-mail notification) that lasted less than 2 days. mails from the platform.
There is a noticeable pile of recommendations without e- Possible future improvement could be oriented towards
mail notification (portal only) for which the time to resolve experimentation with personalized e-mails in order to
is almost 24 hours (gray x). On the contrary, visible piles better attract attention of the customers. Another promising
of recommendations with e-mail notification can be avenue could be to build an expert system. Based on
recognized on the same histogram for the times below 2 gathered telemetry and more sophisticated models even
hours, around 6 hours, between 8 and 16 hours, and 20 more insightful recommendations can be sent to the
hours (black crosses). customers.

REFERENCES
[1] A. Dhiman, Analysis of on-premise to cloud computing migration
strategies for enterprises. Ph.D. Dissertation. Massachusetts
Institute Of Technology, USA, 2011.
[2] T. Boillat and C. Legner, "Why do companies migrate towards
cloud enterprise systems? A post-implementation perspective.", In
Proceeding of the 16th Conference on Business Informatics (CBI).
IEEE, Geneva, Switzerland, 2014, pp. 102–109.
[3] [3] P. Bernstein, I. Cseri, N. Dani, N. Ellis, A. Kalhan, G.
Kakivaya, D. Lomet, D. Lomet, R. Manne, L. Novik, and T. Talius,
"Adapting Microsoft SQL server for cloud computing", In
Proceeding of the 27th International Conference on Data
Engineering. IEEE, Hannover, Germany, 2011, pp. 1255–1263.
[4] Stein S, Rabeler C – Azure SQL Database advisor, Create Index
Fig. 1. The time to resolve the schema issues in hours recommendations, (2016). [Online]. Available:
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-
with and without e mail notifications in a period of 2 days. advisor
[5] W. Lang, F. Bertsch, D. DeWitt, and N. Ellis, "Microsoft Azure
As for the issues resolved in less than 2 hours, more of SQL database telemetry", In Proceeding of the Sixth ACM
them were fixed (about 2000 more or 4.7% of the observed Symposium on Cloud Computing. ACM, Hawaii, USA, 2015, pp.
189–194.
dataset) when e-mail notification was sent compared to
[6] V. Nair, A. Raul, S. Khanduja, V. Bahirwani, Q. Shao, S.
those fixed when the portal notifications was sent. This is Sellamanickam, and S. Dhulipalla, "Learning a hierarchical
an additional indication that the average time to resolve the monitoring system for detecting and diagnosing service issues", In
Proceeding of the 21th ACM SIGKDD International Conference on
schema issue is reduced. Knowledge Discovery and Data Mining. ACM, Sydney, Australia,
However, if we consider the issues resolved between 2 2015, pp. 2029–2038.
and 6 hours, the issues resolved with portal notifications [7] Amazon Web Services – Amazon RDS for SQL Server, (2016).
only slightly prevail (about 380 more). It can be explained [Online]. Available: https://aws.amazon.com/rds/sqlserver/
by the fact that the active users of the schema issue advisor

Anda mungkin juga menyukai