SQLIA is still an ongoing issue that affects all organisation types including private, governments and business hosted web applications across the world as intruders exploit vulnerable web applications to pilfer protected data from the database with damaging data security ramifications. Pilfered or leaked data can then be used in various forms of criminality including extortion. The emerging computing of big data and cloud-hosted services posed a more functional issue to SQLIA mitigation that existed before now that involves strings lookup of SQLIA signatures. ML which provides an alternative method to SQLIA string signature lookup lacks an existing robust data set with few that exist being obsolete to train a classifier in advancing SQLIA mitigation. In this thesis, we examined SQLIA artefact of valid web requests, including SQL tokens and SQLIA signatures to derive a pattern-driven data set to train a supervised learning model in applying AI techniques to mitigate SQLIA.
In applying ML to SQLIA problems, there is a need for a data set which prompted in our research goals the following research questions. If a pattern-driven data set can be extracted from both expected web requests and SQL tokens including existing SQLIA signatures be used to train a supervised learning model and be validated, including this pattern-driven data set be extracted from any web application type context for SQLIA mitigation? Below are the novel contributions across the thesis chapters which answer the research goal of the thesis of using a pattern-driven data set obtained from the very web application type to be protected in training an ML model to mitigate SQLIA.
• We reviewed the existing literature on SQLIA to establish the need for a change from on-premise mitigation that includes source code scanning and query comparison of some sort to a SQLIA mitigation that is amenable to emerging computing of increasing big data and cloud-hosted services that can predict SQLIA on web requests in-transition to the back-end database.
• We presented an ontology in Chapter 4 and related publications [61], [62] for crafting a pattern-driven data set using R string API from the web application type with a technique to encode the data set into vectors required to train a supervised learning model in MAML studio. We trained a supervised learning classification algorithm with this pattern-driven data set and validated the trained model under various algorithms. We observed high performance metrics statistical measures, including
159
cross-validation. The success of this conceptual approach in Chapter 4 has led to further work in Chapters 5 and 6 by employing string hashing vectorisation in-place of manual encoding. Also, we demonstrated a proof of concept of how the proposal will be applied in a real-world web application.
• We further in Chapter 5 and related publication [63] the numeric encoding of features presented in Chapter 4 with hashing vectorisation to obtain vector matrices to train the classification algorithms. In answering the research question if the intended web application type can produce the artefact for a pattern-driven data set, we implemented a web application that expects dictionary words as a valid input while elements of SQL tokens and SQLIA type signatures substitution at the SQLI hotspots is predicted as SQLIA positive. The pattern-driven data set is used to train a supervised learning model employing a TC LR and TC SVM classification algorithms with the better- evaluated and cross-validated classifier selected to predict SQLIA. The selected, trained TC SVM model is exposed as a web service which is then consumed in a web form for input validation, and proxy API at the cloud SDN for intercepting web requests in-transition to a back-end database for SQLIA analysis.
• We demonstrated in Chapter 6 and related publication [64] a more robust method of pattern-driven data set procurement based on the web application type; we derived a pattern-driven data set using FSA and SFA as against R string API technique presented in Chapter 5 to derive related member strings. The referred web application expects dictionary words as a valid input while elements of SQL tokens and SQLIA type substitution predicted at the SQLI hotspots are predicted as SQLIA positive. The pattern-driven data set is used to train a supervised learning model employing a TC LR and TC SVM classification algorithms with the better-evaluated and cross- validated classifier selected to predict SQLIA. The trained TC SVM model is exposed as a web service which is then consumed in a web form for input validation and a proxy API at the cloud SDN for intercepting web request for SQLIA analysis. We observed using the SFA technique to derive related member strings that we could generate a pattern-driven data set with features of related member strings of any size to train a classifier for SQLIA mitigation for real-world application.
We conclude in this thesis that a pattern-driven data set to train a classifier can be inferred from SQLIA types, SQL tokens and expected valid web requests of a web application type. The recent advancement in AI platforms like Azure ML provides the
160
MAML studio functionality to build, train, validate and cross-validate an ML model as presented in the thesis which is then exposed as a web service that is consumed in multi- layer (client form and cloud SDN proxy) in ongoing SQLIA prediction and prevention.