• No results found

Big Data Quality Assurance

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Quality Assurance"

Copied!
32
0
0

Loading.... (view fulltext now)

Full text

(1)

Copyright

1

_____________________

Roger Clarke

Xamax Consultancy, Canberra

Visiting Professor in Computer Science, ANU and in Cyberspace Law & Policy, UNSW

Redefining R&D Needs for Australian Cyber Security Australian Centre for Cyber Security (ACCS)

16 November 2015

http://www.rogerclarke.com/EC/BDQA {.html, .pdf}

Big Data

(2)

Copyright

2

Vroom, Vroom

The 'Hype' Factor in Big Data

• Volume

• Velocity

• Variety

(3)

Copyright

3

Vroom, Vroom

The 'Hype' Factor in Big Data

• Volume • Velocity • Variety • Value • Veracity Laney 2001, Livingston 2013

(4)

Copyright

4

Opportunities in the Security Area

• Network traffic

• Open Source Intelligence

• Social media postings – public

• Social media postings – organisation-internal

...

• Streams of data from eObjects

(5)

Copyright

5

Use Categories for Big Data Analytics

Population Focus • Hypothesis Testing • Population Inferencing • Profile Construction • Individual Focus • Outlier Discovery

• Inferencing about Individuals

• Inconsistencies

(6)

Copyright

6

Data

A symbol, sign or measure that is accessible to a person or an artefact

Empirical Data represents a real-world phenomenon

Synthetic Data does not

Quantitative Data on a Ratio Scale

is suitable for powerful statistical techniques

Quantitative Data on Ordinal and Cardinal Scales

is suitable for less powerful techniques

Qualitative Data on a Nominal scale is subject to limited analytical processes

(7)

Copyright

7

Data Collection

Data Collection is:

• for a purpose

• selective

Data Collection processes are constrained by cost,

which inevitably compromises the quality of the data

Data may be compressed at or after collection, e.g.

through sampling, averaging and filtering of outliers

(8)

Copyright 8 Entity and Attributes Real World Abstract World Record: Entifier + Data-Items Record: Identifier + Data-Items Identity and Attributes Record: Nym + Data-Items Identity and Attributes m n m n 1 1 1 n n n

Data needs to be Associated with (Id)Entities

(9)

Copyright

9

Data Quality Factors

Assessable at time of collection

• D1 – Syntactic Validity

• D2 – Appropriate (Id)entity Association

• D3 – Appropriate Attribute Association

• D4 – Appropriate Attribute Signification

• D5 – Accuracy

• D6 – Precision

• D7 – Temporal Applicability

(10)

Copyright

10

Information Quality Factors

Assessable only at time of use

• I1 – Theoretical Relevance • I2 – Practical Relevance • I3 – Currency • I4 – Completeness • I5 – Controls • I6 – Auditability http://www.rogerclarke.com/EC/BDBR.html#Tab1

(11)

Copyright

11

Data Quality Falls Over Time

Data Integrity deteriorates, as a result of:

• Storage Medium Degradation

• Loss of Context

• Changes in Context

• Changes in Business Processes

(12)

Copyright

12

Data Quality Falls Over Time

Data Integrity deteriorates, as a result of:

• Storage Medium Degradation • Loss of Context

• Change of Context

• Changes in Business Processes

• Loss of Associated (Meta)Data, e.g.

• Provenance of the data

• The Scale against which it was measured • Valid Domain-Values when it was recorded

• Contextual Information to enable interpretation

(13)

Copyright 13

Key

Decision Quality

Factors

• Appropriateness of the Inferencing Technique • Data Meaning • Data Relevance • Transparency • Process • Criteria http://www.rogerclarke.com/EC/BDQF.html#DeQF

(14)

Copyright

14

Transparency

Accountability depends on clarity

about the Decision Process

and the Decision Criteria

In practice, Transparency is highly variable:

Manual decisions – Often poorly-documented • Algorithmic languages

Process & criteria explicit (or at least extractable)

Rule-based 'Expert Systems' software

Process implicit; Criteria implicit

'Neural Network' software

(15)

Copyright

2014-15 15

Transparency & Accountability:

The Quality of Published Research Data

• Of 18 microarray studies, only 2 were fully

reproducible using the archived data [27]

• Of 19 papers in population genetics,

• 30% of analyses could not be reproduced • 35% of datasets were incorrectly or

insufficiently described [9]

• Of 100 datasets in nonmolecular biology,

56% were incomplete, and 64% were

archived such that reuse was much impaired

Roche DG, Kruuk LEB, Lanfear R, Binning SA (2015) Public Data Archiving in Ecology and Evolution: How Well Are We Doing? PLoS Biol 13(11), 10 Nov 2015

(16)

Copyright

2014-15 16

Transparency & Accountability:

The State of Play with Research Publications

• " ... reproducibility of much research has become

questionable, if not impossible [because it is] shrouded by the opaque use of computers"

• "... new 'scopes' ... see new patterns in ... data ..." • Empirical articles need to be supported by:

the data, defined in open formatsthe data model

the code

Marwick B. (2015) 'How computers broke science – and what we can do to fix it' The Conversation,

9 November 2015, at https://theconversation.com/

(17)

Copyright

17

Data Scrubbing / Cleaning / Cleansing

Problems It Tries to Address • Missing Data

• Low and/or Degraded Data Quality • Failed and Spurious Record-Matches • Differing Definitions,

(18)

Copyright

18

Data Scrubbing / Cleaning / Cleansing

Problems It Tries to Address • Missing Data

• Low and/or Degraded Data Quality • Failed and Spurious Record-Matches • Differing Definitions,

Domains, Applicable Dates

How It Works

• Internal Checks

• Inter-Collection Checks

• Algorithmic / Rule-Based Checks

(19)

Copyright

19

Data Scrubbing / Cleaning / Cleansing

Problems It Tries to Address • Missing Data

• Low and/or Degraded Data Quality • Failed and Spurious Record-Matches • Differing Definitions,

Domains, Applicable Dates

How It Works

• Internal Checks

• Inter-Collection Checks

• Algorithmic / Rule-Based Checks

Checks against Reference Data – ??Its Implications

• Better Quality and More Reliable Inferences • Worse Quality and Less Reliable Inferences

(20)

Copyright 2014-15 20

Big Data

Applied to

Security

cf.

Security Risks

arising from

Big Data

Applications

http://www.rogerclarke.com/EC/SSACS.html#App1

(21)

Copyright

21

Contexts of Quality Issues

• A Single Data-Collection

• A Consolidated / Multi-Source Data-Collection

• Scrubbing / Cleaning / Cleansing

• Inferencing Processes

(22)

Copyright

22

Organisational Risks – Internal

Security Considerations

• More Copies lie around

• Consolidation creates Honeypots

• Honeypots attract Attackers

(23)

Copyright

2014-15 23

Hacking Team frequent flyers and locations they visit http://labs.rs/en/metadata/

(24)

Copyright

24

Organisational Risks – Internal

Security Considerations

• More Copies lie around

• Consolidation creates Honeypots

• Honeypots attract Attackers

• Some Attacks succeed

Resource Misallocation

• Negative impacts on ROI

or Public Policy outcomes

(25)

Copyright

25

Scenario – Insider Detection

The Minister gives terse instructions about whistleblowers

(Brutus, Judas Iscariot, Macbeth, Manning, Snowden, ...)

The agency:

• Increases intrusiveness and frequency of employee vetting • Lowers the threshold for positive vetting

• Exercises its powers to gain access to and consolidate:

• border movements • credit history • court records • LEA persons-of-interest lists • financial tracking alerts • all internal communications • social media postings • Applies big data analytics to the consolidated database

(26)

Copyright

26

Personal Risks

Implications for Organisations

• Outlier Discovery

• Inferencing about Individuals

• "A predermined model of infraction" "Probabilistic Cause cf. Probable Cause"

• Non-Human Accuser, Unclear Accusation,

Reversed Onus of Proof, Unchallengeable

(27)

Copyright

27

Personal Risks

Implications for Organisations

Discrimination ‘Unfair’ Discrimination Breaches of Trust • Data Re-Purposing • Data Consolidation • Data Disclosure Morale

(28)

Copyright

28

Organisational Risks – External

Public Civil Actions, e.g. in Negligence • Prosecution / Regulatory Civil Actions:

• Against the Organisation

(29)

Copyright

29

Organisational Risks – External

Public Civil Actions, e.g. in Negligence

Prosecution / Regulatory Civil Actions:

• Against the Organisation

• Against Directors

Public Disquiet / Complaints /

Customer Retention / Brand-Value

(30)

Copyright

30

Risk Management for Big Data Projects

1. Frameworks 2. Data Consolidation 3. Effective Anonymisation 4. Data Scrubbing 5. Decision-Making http://www.rogerclarke.com/EC/BDRM.html

(31)

Copyright

31

Research Opportunities

• Indicators and Contra-Indicators

for particular Data Analytic Techniques

• Scenario Analyses ==>> Case Studies

• Data Scrubbing against external reference-points

• Quality Audit Techniques

for Data Scrubbing and for Inferencing

• Transparency Mechanisms for rule-based,

neural-net and machine-learning analytics

• Integration into QA, TRA and SRMP processes

• Cognitive Load Management incl. anomaly

(32)

Copyright

32

_____________________

Roger Clarke

Xamax Consultancy, Canberra

Visiting Professor in Computer Science, ANU and in Cyberspace Law & Policy, UNSW

Redefining R&D Needs for Australian Cyber Security Australian Centre for Cyber Security (ACCS)

16 November 2015

http://www.rogerclarke.com/EC/BDQA {.html, .pdf}

Big Data

References

Related documents

Leeds Core PoP Pathe News Akamai Virgin Radio Bogons Logicalis UK Pipex BBC Datahop InTechnology INUK RM Education LINX multicast NHS Redstone Google Global Transit

6.7 LSPs be required to provide clarity about: timescales for implementation of functionality which will support SAP/Common Assessment Framework; about support for

Sessions focused on (a) the opportunities and challenges of using big data for policy making and the 2030 Agenda; (b) existing technologies and innovative ways big data is

The TEQSA Regulatory Risk Framework identifies three priority risk consequence areas drawing on the Threshold Standards and Objects of the TEQSA Act, which are key to considering

This thesis explores the convergences and divergences in the personal conceptualizations of play, space and idealized reader constructs as they are manifest in the

In order to understand the pressure profile over wet surface of the sphere and corresponding jetting flow field, four sphere impacting speed conditions are conducted in the hydro

Selected 16 wells did not have the water quality problem in case of physical and chemical properties except nitrate nitrogen content.. In the case of pH, all 16 wells have pH

For example, CICS Transaction Gateway for z/OS V8R0 can pass user security identity information (a distributed identity) from a J2EE client in WebSphere Application Server across