Takeaways from ODSC Europe

Last week Martine Ros and I visited London for the Open Data Science Europe Conference. What a relief to be on a relatively non-commercial data science conference. We were able catch up on some serious research.

My takeaways:

1) Ethical AI is essentially about methodology. About applying basic and widespread methodological concepts already known, practiced and institutionalized in other scientific disciplines.

2) Testing and tuning algorithms will be automated. Without all the manual tweaking, we have more time for better business analysis, applying methodology, data governance & data management.

3) Under the hood, machine learning techniques are still developing. Neural networks are not only mimicking human reasoning, but also memory and attention. Awesome.

I’ll explain them in a bit more detail.

1) Ethical AI is essentially about methodology

A lot of talks on ODSC Europe were related to ethical AI. Interpretability, explainability, reliability, safety and fairness are all hugely important of course. However, sometimes it seems to me we are very busy reinventing the wheel.

An algorithm represents knowledge. Knowledge on ‘how to do things’ (for example: driving a car) or knowledge on ‘how thing are’ (for example: the chance of an insurance claim being fraudulent or not). The first type of knowledge is sometimes called a skill. The second one propositional knowledge. I will focus on this type of knowledge. Before we used algorithms, we used protocols or rule-based systems to represent knowledge. And of course, we still use them. An example are medical protocols or complex rules for calculating the risk of an insurance claim. The difference with algorithms is, that algorithms can be made very adaptive and self-learning. Protocols and rules are often quite static. There are controlled processes for updating them. This makes it difficult to incorporate new insights in a timely manner.

A lot of those protocols and rule-based systems (not all of them) are based on scientific evidence. It is impossible for any human to understand all the science behind all decisions influenced by protocols or rules. This is why we have methodology in science. It enables trust. Trust in decisions based on evidence we are not always able to understand. This empirical scientific method worked very well for centuries. (At least for a long time. Nowadays serious amounts of people distrust scientists if outcomes are not in line with their own views and opinions and instead, they trust lying politicians…)

During the years I’ve been working within the Utrecht University Hospital I’ve learned a lot about the way knowledge is derived in a medical scientific institute. How quality systems controlling this process work. How the rights and safety of patients and participants monitored. I’ve contributed myself to make the way data is collected and preserved part of this quality system. It is remarkable how much of the already established way of doing medical scientific research is about eliminating bias, interpretability, reliability and fairness.

It seems to me the science part in data science does not really refer to applying empirical scientific methods to validate the trustworthiness of analysis and interpretation of the results. More likely it refers to inventors, the Gary Gearloose type of scientist. If we want our algorithms to be trustworthy, we need to think like empirical scientists in the design of how we create and monitor algorithms.

I would recommend to all people of all disciplines who are currently debating on how to achieve trustworthy and ethical AI, to read up about history and philosophy of science and follow a course on research methodology and statistics. And to inform themselves how medical research is regulated to control both the impact on participants and people impacted by the outcomes of studies. The CCMO website is an exellent resource. Of course, there are differences, but the main concepts and ideas are applicable on how algorithms can be regulated.

2) Testing & Tuning will be automated within 10 years

Ironically, the work of a data scientist can easily be automated by…. algorithms. This so called AutoAI or AutoML is not even a complex algorithm, but simply a method of selecting the highest performing algorithm by looping through candidate algorithms and their tuning options. Methodology only becomes more important when the data scientist is automated. For example, we need to think upfront what the algorithm should optimize for. Minimizing false positives? Or minimizing false negatives? And we need to define a metric for fairness, so the AutoAI can take that into account. This can’t be done without a thorough understanding of the problem we are solving.

AutoAI can’t automate the understanding and availability of data. There is a causal relationship between the trustworthiness of algorithms and the trustworthiness of the data the algorithm consumes. The way organizations care for their data-assets will become even more important when the use of algorithms increases. Instead of regarding data as a side-effect of applications, we should treat it as our primary asset and make sure data is understandable, accessible and usable without a lot of fixing downstream. This cannot be achieved within the scope of a single project. It requires major changes in how applications are architected and developed. Examples here in the Netherlands are: Common ground and Registratie aan de bron . Dutch Tax Office (my current client) is also taking huge steps on becoming data centric.

3) Under the hood

Most popular techniques in AI and data science are known for decades. The only difference is they are widely available, closed or open source, for everybody to use. I was surprised and impressed by on how they evolved over the past few years. Neural nets are not only used for reasoning, but recurrent neural nets are kind of mimicking memory. Add transformers to them, and they also have something that can be explained as attention. The huge number of parameters to be trained require huge training sets and quite a bit of compute. But not every big problem is solved with big data. There are very good algorithms available which don’t require thousands of parameters to be trained, for example Gaussian curves.

Multi-agent systems are less futuristic as they were 13 years ago when I had to incorporate the theory into my master thesis. Still quite far away, though. Micheal Woolridge was quite convincing that multi agents are the future of AI… The only question is when our Tesla’s are going to communicate with each other…

To conclude. The sexiest job on earth is going to be automated soon. I suggest we start thinking about a successor. I would vote for methodologist or data architect. Because we are doomed without sound methodology and good quality data.