Protecting Machine Learning Data from Cybercriminals

As any organisation in the sector will know, data is at the very core of Life Sciences. Whether an organisation is undertaking primary research, managing clinical trials, designing innovative technologies, or manufacturing products, data will be present that must be guarded from attacks. Unfortunately, these attacks have been on the upswing.

Looking at trends in the industry, the annual Verizon Data Breach and Incident Response (DBIR) report highlights a significant uptick in incidents reported, with growth of over seventy percent in breaches. While this does reflect broader trends noted in the report, the growth in the life sciences sector is greater than average; attackers are increasingly targeting the sector. Now that this attention has been focused, it is unlikely to be diverted unless the sector proves so resilient that attempts to extract data or funds prove fruitless.

The importance of this trend is amplified by attackers having become more aggressive where data theft is concerned, both in Life Sciences and elsewhere. The time between compromise and the exfiltration of data is shrinking, and attacks that previously only hid data for extortion purposes have shifted to a model where data theft and the threat of data leaking is the standard mode of operations.

The means to address many of the underlying issues that can leave organisations vulnerable to the majority of attacks are well known; good cyber hygiene and operating practices can protect against the majority of cyber incidents. What is less clear is how to address some of the emerging technology uses cases within Life Sciences; in particular the need to consider the increasing use of Machine Learning (ML) and AI models in compound research and development.

ML has been identified by the industry as a major source of innovation; the application of ML principles and practices to data analysis and compound design is reaping major benefits for the industry. That said, ML solutions come with their own set of cyber security challenges, and some of these challenges might not be immediately apparent to a cyber team unused to dealing with it.

Data security in machine learning (ML)

In an ML environment, data is used as both the trainer and the target for analysis. In both cases, the protection of data is key, but this protection goes much further than the traditional requirements for confidentiality seen in other industries.

Training, validation and test data is what is used to establish an ML model. These sets of carefully modelled and structured data are intended to provide the ML model being implemented with a comprehensive view of both its target outcomes and the operational data that can lead to those outcomes, to then validate that model, and finally to test.

As such, these data sets are a make or break element in the establishment of your ML model; corrupted data can create a model which is inexplicably inaccurate, or which entirely misses matches that could be caught by a fully enabled model.

Similarly, live data being examined by your operational ML model can only produce results as good as the data itself; consistency and accuracy are key for models. Data preparation for live analysis needs an appropriate level of rigour to ensure your ML can maximise its chances of achieving its goals.

For all data cases, your data scientists will be expending considerable effort preparing data sets. The mechanics of this can be left to the experts, but cyber security needs to become involved when these data sets are completed; they need protection against everything from theft, because of the value of well-tuned data sets, through to malicious manipulation by attackers looking to sabotage your efforts.

Your teams need to be able to track data sets effectively; mechanisms need to be in place to ensure teams only work with the right data, and that the right data has been kept safe and secure from manipulation. Without this, your teams are open to issues ranging from accidental data alteration that disrupts model training through to deliberate data poisoning attacks that disrupt or entirely corrupt your ML environment.

Model integrity

In ML deployments that operate in an online/Continual Learning model, businesses cannot always rely on data preparation to maintain previous performance. New data feeding into the model may spur changes which are better suited to a small subset of new data, but which fail to capture pattern matches that the model managed during its initial training. This performance drift can be caused by variations in available data, but could also be caused by deliberately manipulated data sets, potentially even manipulated by malicious insiders.

Regardless of the source of this data, it is important to note that data preparation is not sufficient to address this issue; once mature, ML models are typically all but impenetrable to their maintainers, making it difficult to know how data 'should' look. Further to this, data shaped to simply match previous data sets may remove the potential benefits of online processing, leaving the model at best maintaining state and at worst introducing overfitting risks that rob the model of the flexibility your business seeks to leverage.

To address this, ongoing monitoring is needed to detect this drift in operations such that appropriate action can be taken. This may take the form of model re-building, re-training with weighted data sets, or re-training with custom data sets designed to accommodate both the original training data and new data derived from recent operations.

ML offers tremendous opportunities for life sciences and other sectors, but care needs to be taken to ensure that cyber security teams are aware of the challenges they bring such that your cyber team can support you appropriately. With the right training and awareness, your cyber team should have no issue working closely with your developers and data scientists to create and maintain appropriately secured ML models that bring significant benefits for your business.