Reducing data scarcity - Future directions

8.3 Future directions

8.3.1 Reducing data scarcity

Medical data is far from scarce. However, well curated, annotated public data still are, unfortunately. The machine learning community has been working on ways to alleviate the need for large datasets of finely annotated data for decades and some methods have recently found their way to medical image analysis. Two main trends can be distinguished: (1) methods that work with coarser annotations, thereby alleviating the burden to outline every finding or annotate every sample in a dataset and (2) methods that stimulate sharing of medical data or facilitate training of models without explicit sharing.

• Coarse annotations

– Semi-supervised learning

In the introduction, I described the commonly used distinction between supervised and un- supervised learning. In supervised learning, the parameters of a model are fit to a dataset

D={xn, yn}Nn=1of inputxand outputyexamples. In the past several decades, many vari-

ations on these concepts have been explored. Semi-supervised learning is one of these and refers to the concept where the parameters of the model are fit to two datasets, labeled data

DL = {xn, yn}Nn=1and unlabeled dataDU ={xn}Nn=1, whereDLandDUare generally

assumed to be sampled from the same data-generating function. A simple example of a semi- supervised learning algorithm, is to use a clustering method to group data based on similar characteristics and assign each group the most occurring label of the annotated samples in the

CHAPTER 8. GENERAL DISCUSSION

particular group.

Using this concept, massive amounts of unannotated data can be used to improve the performance of learning algorithms. Additionally, the concept could be used in an ‘online’ setting. Imagine a product running some classifier on incoming, unannotated data. Each scan from every customer the system classifies is unlabeled, but can be added to the training data to improve the performance of the system. Related to this concept isactive learning: a paradigm where algorithms query annotators for labels and try to get the optimal amount of information out of the novel labels, thereby minimizing annotation effort.

– Weakly labeled data

In this thesis, I have worked with data that contained annotations in the form of contours around lesions or microcalcifications in images. We can see this, somewhat ad-hoc, as an{image + contour-label}-in→ {image-label}-out setting, where the input is the training phase and output the prediction phase. Annotating contours is time consuming and in our case unnecessary, since CNNs simply work with patches. Rough centers of lesions would, therefore, be sufficient input.

An even faster way to annotate data, would be to provide labels on an image level rather than for every finding in the scan. This effectively removes information and is especially complicated when multiple diseases are present in an image, but can reduce annotation time of large datasets. Working with annotations on an image, rather than bounding box or contour level is referred to asweakly labeledlearning. For clarity, this is different, though closely related to the slightly more complicated task ofmultiple instance learning(MIL). Working with weakly labeled data can be seen as an{image, image-label}-in→ {image-label}-out system and MIL as an{image, image-label}-in→ {contour-label}-out system.

– Unstructured labels

The idea of working with weakly labeled images can be taken one step further by simply not using labels at all anymore, but instead work with unstructured natural language labels provided by doctors. Whether this removes or adds information is difficult to say: contours provide exact delineations and possibly the joint knowledge of a radiology report and an experienced annotator. A full report may contain some additional information not conveyed by the contour and also carry the sentiment of the radiologist. In any case, working with this effectively ren- ders the annotator unnecessary and systems learning from natural sentences could save time and money.

Interesting studies would investigate the amount of data that is needed to obtain a certain performance with all these types of annotations. This way, the benefits and cost of annotation can be carefully considered before starting a new project.

• Data sharing

– Differential privacy and privacy preserving deep learningAs mentioned above, a potential downside of deep learning methods is that large troves of annotated training data are typically required to make the models work properly. Institutions are reluctant or unable to share their data, some to maintain a competitive edge over other institutes, but some also purely because of privacy laws. Even if data is shared and properly anonymized, it could still reveal sensitive information. Although deep neural networks are difficult to interpret, there is evidence that at least part of the training data can be traced back by something called a ‘model-inversion’ attack [83], even if the attacker only has access to the model’s in- and output. This information can, for instance, stem from overfitting, where the model memorizes part of the training data. This has been shown to expose a patient’s identity in a genomics project [84].

Differential privacy [68, 133, 2] algorithms constitute methods that aggregate statistics about datasets, revealing as little information as possible about the identity of individual samples.

CHAPTER 8. GENERAL DISCUSSION

Building on these concepts, methods have been proposed [237] that distribute the training of deep neural networks and ensure privacy is preserved. This way, deep neural networks can be trained on small datasets from individual institutions and aggregated in a center controlling the training process, without actually sharing any data explicitly. This concept is referred to asprivate-multiparty machine learningand had a dedicated workshop at a major machine learning conference in 20167.

– Block chain technologyBlock chain technology, a revolutionary new technology blazoned as the internet of the 21st century, lies at the foundation of the recently emerging cryptocurren- cies such as Bitcoin and Ethereum that are taking the world by storm, at the time of writing. It relies on an immutable and distributed ledger, an electronic record of transactions, where new transactions are appended and shared with all copies of the ledger that is owned by everyone and no-one. Hacking of medical data is rife and medical records sell for sometimes orders of magnitude more than credit card details. Additionally, patients have little control over their data.

Using block chain technology, electronic health records (EHR) go into a ledger and are safely transmitted to other institutions8. This way, patients will have more control over their data with improved security and ease of access for research institutes. The MedRec system [7] is a prototype developed by the MIT media lab relying on the Ethereum [33] block chain. The miners of the system are medical researches, who are rewarded with data instead of digital currency. This technology is already implemented in the Bowhead system9_{, which analyzes}

blood and saliva samples that patients can sell anonymously and transmit through a blockchain to research institutes.

In document Computer aided diagnosis of breast cancer in mammography using deep neural networks (Page 105-107)