How to learn in the presence of untrustworthy intermediaries

By Ari Karchmer

Alice, a secret hypothesis, and a hostile lab: a fictional story (allegedly)

Alice is a graduate student. She studies molecular biology, and her goal is to develop a model for the relationship between a molecule’s structure – i.e., its size, shape, or other attributes – and its activity: whether the molecule binds to some fixed protein (for example).

Being a superstar researcher, Alice has already developed a promising hypothesis model. In order to fine tune it, she encodes her hypothesis into a sequence of experiments to run in her research group’s lab. The experiments will identify the best possible parameters for her model, and may be adaptive in nature (i.e., performed sequentially and depending on previous results).

A complication arises: what if Alice is concerned about having her hypothesis – and any expert knowledge used to influence that hypothesis – revealed? Her experiments are done in her university’s public lab, and in particular, a lab mate named Eve the eavesdropper will observe both the experiments and the results. The rest of this post is concerned with the following pertinent question:

How can Alice design her experiments so that she is guaranteed to obtain ideal model parameters, while preventing Eve from gaining any knowledge about Alice’s model, or even any expert knowledge used to build the model (like the hypothesis and experiment design)?

Covert Learning: how to learn with an untrusted intermediary

A recent research paper [1] with my coauthor Prof. Ran Canetti tackles this question. Besides even coming up with a suitable way to formalize the question, the work develops “Covert Learning” protocols that obtain the following high-level guarantees, described in the context of the above story:

  1. Learning: If Alice designs experiments according to the protocol, she can leverage them into ideal parameters for her model (with very high probability).
  2. Hypothesis-hiding: If Alice designs experiments according to the protocol, then any adversarial entity that observes all the experiments cannot obtain any information about the underlying hypothesis behind the experiments. Here, Alice might have some secret information (like the outcome of a sequence of coin flips) to help her.
  3. Concept-hiding: If Alice designs experiments according to the protocol, then any adversarial entity that observes all the experiments and the results of the experiments cannot obtain any information about the underlying natural phenomena that govern the “structure vs. activity” relationship.
  4. Verifiability: If Alice designs experiments according to the protocol, then even if an adversary changes the results to sabotage her model, Alice can identify when this has occurred and abort the process (with very high probability).

As mentioned previously, the purpose of this blog post is to illuminate how one might begin to think about the first two points.

Learning, and why hypothesis-hiding is a natural goal

At the most basic level, we want Alice to handle the presence of untrustworthy intermediaries without losing her ability to conduct her research effectively. Indeed, the learning guarantee Alice obtains is that, as long as she follows the protocol, the chances she doesn’t figure out the ideal parameters to her model is low. This guarantee is essentially as strong as those that are achieved by typical algorithms from computational learning theory (which do not consider any untrustworthy intermediaries like Eve).

The novelty of our work lies in the ability hide the hypothesis that determines designed experiments. Let us gain a slightly more formal understanding of what a hypothesis in this context really is.

One of the prevailing frameworks for learning (in the computational sense) uses a parameter called a hypothesis class. The hypothesis class is essentially some set of possible condensed descriptions of whatever natural phenomenon is attempting to be modeled. In our running example of the bio lab, the natural phenomenon is the structure vs. activity relationship. Now, because there are a (possibly infinitely) large number of very complex structure activity relationships, the hypothesis class is usually just a small subset of such relationships. That is to say, the true description may not be contained in the hypothesis class. Therefore, learning the hypothesis class means identifying the best description within the class that approximates the real-world observations.

The astute reader may now recognize that the choice of hypothesis class is crucial. Indeed, the accuracy of the resulting description of the real world is tethered to how “good” the hypothesis class is. “Good” in this sense refers to how close the descriptions contained in the hypothesis class are to the “ground truth.” Because of this, it’s clear that knowing a good hypothesis class is valuable information, and our hypothesis-hiding guarantee is well-motivated. To the best of our knowledge, all known “normal” learning algorithms fail to provide any meaningful hypothesis-hiding guarantee. In fact, some well-known examples exist that fully reveal their hypothesis (we will not dive into these on this post).

Formalizing hypothesis-hiding

The basic idea to formalize and achieve hypothesis-hiding is to force the learning algorithms to (ostensibly) conform to a single type. To elaborate, we propose to enforce the constraint that – no matter the chosen hypothesis class – the generated experiments all look alike. The fields of cryptography and secure multiparty computation already give us tools to formalize this. Specifically, concepts like the real vs. ideal world security paradigm and computational indistinguishability allow us to define hypothesis-hiding as the guarantee that (informally):

There exists an efficient algorithm, called a simulator (with access to random coin flips), such that for any hypothesis class H, the distribution over the output of the simulator and the experiments selected by a learning algorithm for H are computationally indistinguishable.

This basically means that any practical adversary cannot distinguish a world where the class H is being learned, as opposed to a world where the class H* is being learned. More generally, we are requiring that the distribution of experiments chosen by a learning protocol for any specific class H are reproducible by the simulator (which has no knowledge of H). Without getting too technical, this paradigm ensures that no information about the underlying hypothesis class of the learning protocol is revealed to any practical adversary.

So, how do you actually achieve hypothesis-hiding? 

One basic technique, which we use extensively in our recent work [1], is to produce “pseudorandom masks” for each of the experiments we desire to run. The “masks” are essentially a way to take any experiment and randomize it so that it could presumably be used to learn any other hypothesis class (rather than tailored to a specific class). Thus, when viewed together, a set of masked experiments is indistinguishable from a set of random (literally) experiments.

The pseudorandom masks are enough to satisfy hypothesis-hiding. But what would be the point if we can’t simultaneously achieve the basic learning guarantee (recall: Alice needs to be able to learn the best description within the hypothesis class, using the experimental results)? Our solution is to choose the masks in such a way that they encode some secret information (one can think of this as the outcome of a sequence of coin flips executed in the privacy of Alice’s home). Then, we design a way for Alice to leverage this secret information to decode the results on the masked experiments, into results on the unmasked experiments. Obviously, the unmasked experiments should be chosen to implement any particular learning algorithm that Alice would like. For the more technical reader: we show how to generate the pseudorandom masks using a learning parity with noise (LPN) type of pseudorandom distribution, and then leverage the natural homomorphic properties of LPN to undo the masks. Since the unmasked experiments are chosen according to Alice’s desired learning algorithm, she can now process the results into the ideal model for the structure vs. activity relationship!

In Conclusion

I’ll end this blog post by elaborating on real world applications of hypothesis-hiding learning algorithms. The story at the beginning of this piece, perhaps surprisingly, is not too different from what occurs in the real world in the drug discovery industry. A recent trend in the industry involves delegating elements of the R&D process to “well-equipped and specialized third parties who can carry out the necessary biological experiments on behalf of the drug companies” [1]. This paradigm has greatly enhanced the efficiency of the drug discovery process (for more information, see [2] and the references therein). However, the outsourcing of experiments carries within it the inherent risk of exposure of the experimental design (intellectual property). As mentioned in our work [1], much of the proprietary knowledge underpinning pharmaceutical science is now generated through outsourcing [2]. Hypothesis-hiding learning protocols offer a potential solution to this problem, by providing the guarantee that no intellectual property behind the experimental design is revealed to the contracted third party, never mind any other competing firms.

There are many more potential real-world applications of hypothesis-hiding learning protocols (and more generally, the full covert learning protocols), so we will conclude by merely listing a few:

  1. Other fields falling in the umbrella of outsourcing of automated scientific discovery, e.g., functional genomics, program synthesis, and VLSI testing.
  2. As attacks against machine learning as a service (MLaaS) on the cloud, where interfaces to machine learning models are exposed to the public and in danger of being stolen.

Bibliography

[1] Ran Canetti and Ari Karchmer. Covert learning: How to learn with an untrusted intermediary. In Kobbi Nissim and Brent Waters, editors, Theory of Cryptography, pages 1–31, Cham, 2021. Springer International Publishing.

[2] David E Clark. Outsourcing lead optimization: the eye of the storm. Drug discovery
today, 16(3-4):147–157, 2011.

Ari Karchmer is fourth year PhD student at Boston University advised by Ran Canetti. His research interests are at the intersection of cryptography and computational learning theory, and theoretical computer science more broadly. You can visit his personal website at www.arikarchmer.com.

View all posts