Scuba is Facebook’s fast slice-and-dice data store. It stores thousands of tables in about 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second, scanning 100 billion rows per second, with most response times under 1 second.
Interested parties like to explain the culture industry in technological terms. Its millions of participants, they argue, demand reproduction processes that inevitably lead to the use of standard processes to meet the same needs at countless locations […] In reality, a cycle of manipulation and retroactive need is unifying the system ever more tightly.
Theodor Adorno and Max Horkheimer, “The Culture Industry: enlightenment as mass deception”, in Dialectic of Enlightenment, 1944.
Being able to iterate quickly on thousands of models will require being able to train and score models simultaneously. This approach allows Cisco (an H20 customer) to run 60,000 propensity to buy models every three months, or to allow Google to not only have a model for every individual, but to have multiple models for every person based on the time of the day.
Alex Woodie, , 9 February 2015
One of the important aspects of the “big data” revolution is how it has affected media and culture industries. Note that I am not saying “digital culture” or “digital media” because today all cultural industries create digital products that are disseminated online. This includes games, movies, music, TV shows, e-books, online advertising, apps, etc. So I don’t think we need to add the word “digital” anymore.
The companies that sell cultural goods and services online (for example, Amazon, Apple, Spotify, Netflix), organize and make searchable information and knowledge (Google), provide recommendations (Yelp, TripAdvisor), enable social communication and information sharing (Facebook, QQ, WhatsApp, Twitter, etc.) and media sharing (Instagram, Pinterest, YouTube, etc.) all rely on computational analysis of massive media data sets and data streams. This includes information about online behaviour (browsing pages, following links, sharing posts and “liking”, purchasing,), traces of physical activity (posting on social media networks in a particular place at a particular time), records of interaction (online gameplay) and cultural “content” – songs, images, books, movies, messages, and posts. Similarly, human-computer interaction – for example, using a voice-user interface – also depends on computational analysis of countless hours of, in this case, voice commands.
For example, to make its search service possible, Google continuously analyses the full content and mark-up of billions of web pages. It looks at every page on the web it can reach – the text, layout, fonts used, images and so on, using over 200 signals in total). To be able to recommend music, the streaming services analyse the characteristics of millions of songs. For example, Echonest, which powers Spotify, has used its algorithms to analyse 36,774,820 songs by 3,230,888 artists. Spam detection involves analysis of texts of numerous emails. Amazon analyses purchases of millions of people to recommend books. Contextual advertising systems such as AdSense analyse the content of web pages in order to automatically select relevant ads for display on those pages. Video game companies capture the gaming actions of millions of players to optimize game design. Facebook algorithms analyse all updates by all your friends to automatically select which ones to show in your feed. And it does so for every one of Facebook’s 1.5 billion users. According to estimates, in 2014 Facebook was processing 600 terabytes of fresh data per day.
The development of algorithms and software systems that make all this analysis possible is carried out by researchers in a number of academic fields including machine learning, data mining, computer vision, music information retrieval, computational linguistics, natural language processing and other areas of computer science. The newer term “data science” refers to professionals with advanced computer science degrees who know contemporary algorithms and methods for data analysis (described by the overlapping umbrella terms of “data mining”, “machine learning” and “AI”), as well as classical statistics. Using current technologies, they can implement the gathering, analysis, reporting and storage of big data. To speed up the progress of research, most top companies share many parts of their key code. For example, on 9 November 2015, Google open sourced TensorFlow, the data and media analysis system that powers many of its services. Companies also open sourced their software systems for organizing massive datasets, such as Cassandra and Hive (Facebook).
The practices involved in the massive analysis of content and interaction data across media and culture industries were established between approximately 1995 (early web search engines) and 2010 (when Facebook reached 500 million users). Today they are routinely carried out by every large media company on a daily basis – and increasingly in real-time.
This is the new “big data” stage in the development of modern technological media. It follows on from previous stages such as mass reproduction (1500-), broadcasting (1920-) and the Web (1993-). Since the industry does not have a single term for referring to all of the practices described here, we can go ahead and coin a temporary name. Let’s call them media analytics.
To the best of my knowledge, media and communications scholars have yet to clearly describe this novel aspect of contemporary media. After around 2013, we start seeing more discussions of social and political issues around the use of large-scale consumer and social media data and automatic algorithms. These discussions cover data and law, data and privacy, data and labour, etc. The events at the NYC-based Data & Society Institute offer many examples of such discussions. As did the Governing Algorithms conference at NYU in 2013, and the Digital Labor conference at New School for Social Research in 2014. In terms of publications, the academic journal Big Data and Society, from foundation in 2014 onward, is of central significance.
However, I have not yet seen these discussions or publications cover the idea I am proposing here – which is to think of media analytics as the primary determinant of the new condition of the culture industry, marking a new stage in media history. The algorithmic analysis of “cultural data” and the customization of cultural products is at work not only in a few visible areas such as Google Search and Facebook news feeds that have already been discussed – it is also at work in all platforms and services where people share, purchase and interact with cultural goods and with each other. When Adorno and Horkheimer were writing Dialectic of Enlightenment, interpersonal interactions were not yet directly part of the culture industry. But in “software culture”, they too have become “industrialized” – organized by interfaces, using the conventions and tools of social networks and messaging apps, and influenced in certain ways by algorithms processing all interaction data and making decisions about what content, updates and information to show and when to show it.
Why do I call it a “stage”, as opposed to just a trend or one element of the contemporary culture industry? Because in many cases, media analytics involves the automatic computational processing and analysis of every cultural artefact in a given industry (such as the music industry, as represented by music streaming services) and of every user interaction with services that hundreds of millions of people use daily (i.e., Facebook or Baidu). It’s the new logic of how media works and how it functions in society. In short, it is crucial both at a practical and a theoretical level. Any future discussion of media and communications, or media and communications theory, has to start dealing with this situation.
The companies that are key players in “big media” data processing are newer ones that have developed with the web – Google, Amazon, Ebay, Facebook, etc. – rather than older twentieth-century cultural industry players such as the movie studios or book publishers. Therefore, what is being analysed and optimized between 1995 and today is mostly distribution, marketing, advertising, discovery and recommendations, that is, how customers find, purchase and “use” cultural products. As I already noted, the same computational paradigms are also implemented by social networks and online retailers. From this perspective, the users of these networks and services become “products”. For example, Amazon’s algorithms analyse data about what goods people look at and what they purchase, and use this analysis to provide personal recommendations to each Amazon user. And at the same time, Facebook algorithms analyse what people do on Facebook to select what content appears in each person’s news feed.
Media analytics is the key aspect of the “materiality” of media today. In other words: materiality now is not only about hardware, or databases, or media authoring, publishing and sharing software as it was in the early 2000s. It is about technologies such as Hadoop, Storm and computing clusters, paradigms such as supervised machine learning, the particular data analysis trends such as “deep learning”, and basic machine learning algorithms such as k-means, decision trees and kNN. Materiality is Facebook “scanning 100 billion rows per second” and Google processing 100+ petabytes of data per day (2014 estimate), and also automatically creating “multiple [predictive] models for every person based on the time of the day”.
At this point you, the reader, may get impatient and wonder when I will actually deliver what critics and media theorists are supposed to deliver when they talk about contemporary life and in particular technology: a critique of what I am describing. Why am I not invoking “capitalism”, “commodity”, “fetishism” or “resistance”? Does not the media analytics paradigm represent another step in capitalism’s rationalization of everything? Where is my moralistic judgment?
None of this is coming. Why? Because, in contrast to what media commentators like to tell you, I believe that computing and data analysis technologies are neutral. They don’t come with built-in social and economic ideologies and effects, and they are hardly the tools of capitalism and profit-making. Exactly the same analytics algorithms (k-means cluster analysis, Principal Component Analysis, and so on) or mass data processing technologies (Cassandra, MongoDB, etc.) are used to analyse people’s behaviour in social networks, look for cures for cancer, spy on potential terrorists, select ads that appear in your YouTube video, study the human microbiome, motivate people to lead healthy lifestyles, get more people to vote for a particular candidate during presidential elections (think Obama in 2012) and so on. They are used by both for-profit and not-for-profit organizations, by the United States, Russia, Brazil, China and everybody else, in many thousands of applications. They are used to control and to liberate, to create new knowledge and to limit what we know, to help find love and to encourage us to consume more.
This does not mean that the adoption of large-scale data processing and analysis across the culture industry does not change it in many significant ways. Nor does it mean that it is now any less of an “industry”, in the sense of having distinct forms of organization. On the contrary – some of marketing and advertising techniques, ways of interaction with customers, and presenting cultural products are very new, and they all came to rely on large scale media analytics in the last few years. The cultural (as opposed to economic or social) effects of these developments have not been yet systematically studied by either industry or academic researchers, but one thing is clear – the same data analysis methods and ways to gather data that are used in the culture industry can be used to research at least some of its cultural effects. Such analysis will gradually emerge, and we can already give it a name: computational media studies.