The Agency of Computer Vision Models as Optical Instruments
As a PosDoc for REACT I focussed on how images shape the memory of activism and memory in activism. As a part of this focus, I hoped to demonstrate how newly-developed distant reading and viewing techniques could shed new light on these two important aspects of the memory-activism nexus. For example, in our paper “Quantifying Iconicity in 940K Online Circulations of 26 Iconic Photographs” we used document embeddings to study the relationship between the 940.000 online circulations of 26 iconic photographs and the text surrounding them on the webpage. I applied different distant reading/viewing techniques in three other forthcoming articles.
During the research for these articles, I also became interested in the implications of these ML techniques in the present. Using concepts from STS (Latour) and visual culture studies (Crary), we started studying visual ML techniques as optical instruments. This research resulted in the article “The agency of computer vision models as optical instruments” which was just published in Visual Communication Journal.
Industry and governments have deployed computer vision models to make high-stake decisions in society. While they are often presented as neutral and objective, scholars have recognized that bias in these models might lead to the reproduction of racial, social, cultural and economic inequity. A growing body of work situates the provenance of bias in the collection and annotation of datasets that are needed to train computer vision models. This article moves from studying bias in computer vision models to the agency that is commonly attributed to them: the fact that they are universally seen as being able to make biased decisions. Building on the work of Bruno Latour and Jonathan Crary, the authors discuss computer vision models as agential optical instruments in the production of contemporary visuality. They analyse five interconnected research steps – task selection, category selection, data collection, data labelling and evaluation – of six widely cited benchmark datasets, published during a critical stage in the development of the field (2004–2020): Caltech 101, Caltech 256, PASCAL VOC, ImageNet, MS COCO and Google Open Images. They found that, despite all sorts of justifications, the selection of categories is not based on any general notion of visuality, but depends heavily upon perceived practical applications, the availability of downloadable images and, in conjunction with data collection, favours categories that can be unambiguously described by text. Second, the reliance on Flickr for data collection introduces a temporal bias in computer vision datasets. Third, by comparing aggregate accuracy rates and ‘human’ performance, the dataset papers introduce a false dichotomy between the agency of computer vision models and human observers. In general, the authors argue that the agency of datasets is produced by obscuring the power and subjective choices of its creators and the countless hours of highly disciplined labour of crowd workers.