-
Latent Directions: A Simple Pathway to Bias Mitigation in Generative AI
Authors:
Carolina Lopez Olmos,
Alexandros Neophytou,
Sunando Sengupta,
Dim P. Papadopoulos
Abstract:
Mitigating biases in generative AI and, particularly in text-to-image models, is of high importance given their growing implications in society. The biased datasets used for training pose challenges in ensuring the responsible development of these models, and mitigation through hard prompting or embedding alteration, are the most common present solutions. Our work introduces a novel approach to ac…
▽ More
Mitigating biases in generative AI and, particularly in text-to-image models, is of high importance given their growing implications in society. The biased datasets used for training pose challenges in ensuring the responsible development of these models, and mitigation through hard prompting or embedding alteration, are the most common present solutions. Our work introduces a novel approach to achieve diverse and inclusive synthetic images by learning a direction in the latent space and solely modifying the initial Gaussian noise provided for the diffusion process. Maintaining a neutral prompt and untouched embeddings, this approach successfully adapts to diverse debiasing scenarios, such as geographical biases. Moreover, our work proves it is possible to linearly combine these learned latent directions to introduce new mitigations, and if desired, integrate it with text embedding adjustments. Furthermore, text-to-image models lack transparency for assessing bias in outputs, unless visually inspected. Thus, we provide a tool to empower developers to select their desired concepts to mitigate. The project page with code is available online.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Visual Context-Aware Person Fall Detection
Authors:
Aleksander Nagaj,
Zenjie Li,
Dim P. Papadopoulos,
Kamal Nasrollahi
Abstract:
As the global population ages, the number of fall-related incidents is on the rise. Effective fall detection systems, specifically in healthcare sector, are crucial to mitigate the risks associated with such events. This study evaluates the role of visual context, including background objects, on the accuracy of fall detection classifiers. We present a segmentation pipeline to semi-automatically s…
▽ More
As the global population ages, the number of fall-related incidents is on the rise. Effective fall detection systems, specifically in healthcare sector, are crucial to mitigate the risks associated with such events. This study evaluates the role of visual context, including background objects, on the accuracy of fall detection classifiers. We present a segmentation pipeline to semi-automatically separate individuals and objects in images. Well-established models like ResNet-18, EfficientNetV2-S, and Swin-Small are trained and evaluated. During training, pixel-based transformations are applied to segmented objects, and the models are then evaluated on raw images without segmentation. Our findings highlight the significant influence of visual context on fall detection. The application of Gaussian blur to the image background notably improves the performance and generalization capabilities of all models. Background objects such as beds, chairs, or wheelchairs can challenge fall detection systems, leading to false positive alarms. However, we demonstrate that object-specific contextual transformations during training effectively mitigate this challenge. Further analysis using saliency maps supports our observation that visual context is crucial in classification tasks. We create both dataset processing API and segmentation pipeline, available at https://github.com/A-NGJ/image-segmentation-cli.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
DID:RING: Ring Signatures using Decentralised Identifiers For Privacy-Aware Identity
Authors:
Dimitrios Kasimatis,
Sam Grierson,
William J. Buchanan,
Chris Eckl,
Pavlos Papadopoulos,
Nikolaos Pitropakis,
Craig Thomson,
Baraq Ghaleb
Abstract:
Decentralised identifiers have become a standardised element of digital identity architecture, with supra-national organisations such as the European Union adopting them as a key component for a unified European digital identity ledger. This paper delves into enhancing security and privacy features within decentralised identifiers by integrating ring signatures as an alternative verification metho…
▽ More
Decentralised identifiers have become a standardised element of digital identity architecture, with supra-national organisations such as the European Union adopting them as a key component for a unified European digital identity ledger. This paper delves into enhancing security and privacy features within decentralised identifiers by integrating ring signatures as an alternative verification method. This allows users to identify themselves through digital signatures without revealing which public key they used. To this end, the study proposed a novel decentralised identity method showcased in a decentralised identifier-based architectural framework. Additionally, the investigation assesses the repercussions of employing this new method in the verification process, focusing specifically on privacy and security aspects. Although ring signatures are an established asset of cryptographic protocols, this paper seeks to leverage their capabilities in the evolving domain of digital identities.
△ Less
Submitted 11 March, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
CrashCar101: Procedural Generation for Damage Assessment
Authors:
Jens Parslov,
Erik Riise,
Dim P. Papadopoulos
Abstract:
In this paper, we are interested in addressing the problem of damage assessment for vehicles, such as cars. This task requires not only detecting the location and the extent of the damage but also identifying the damaged part. To train a computer vision system for the semantic part and damage segmentation in images, we need to manually annotate images with costly pixel annotations for both part ca…
▽ More
In this paper, we are interested in addressing the problem of damage assessment for vehicles, such as cars. This task requires not only detecting the location and the extent of the damage but also identifying the damaged part. To train a computer vision system for the semantic part and damage segmentation in images, we need to manually annotate images with costly pixel annotations for both part categories and damage types. To overcome this need, we propose to use synthetic data to train these models. Synthetic data can provide samples with high variability, pixel-accurate annotations, and arbitrarily large training sets without any human intervention. We propose a procedural generation pipeline that damages 3D car models and we obtain synthetic 2D images of damaged cars paired with pixel-accurate annotations for part and damage categories. To validate our idea, we execute our pipeline and render our CrashCar101 dataset. We run experiments on three real datasets for the tasks of part and damage segmentation. For part segmentation, we show that the segmentation models trained on a combination of real data and our synthetic data outperform all models trained only on real data. For damage segmentation, we show the sim2real transfer ability of CrashCar101.
△ Less
Submitted 11 November, 2023;
originally announced November 2023.
-
Learning the What and How of Annotation in Video Object Segmentation
Authors:
Thanos Delatolas,
Vicky Kalogeiton,
Dim P. Papadopoulos
Abstract:
Video Object Segmentation (VOS) is crucial for several applications, from video editing to video data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. This annotation process, however, is tedious and time-consum…
▽ More
Video Object Segmentation (VOS) is crucial for several applications, from video editing to video data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. This annotation process, however, is tedious and time-consuming. To reduce this annotation cost, in this paper, we propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation. Unlike the traditional approach, we introduce an agent that predicts iteratively both which frame ("What") to annotate and which annotation type ("How") to use. Then, the annotator annotates only the selected frame that is used to update a VOS module, leading to significant gains in annotation time. We conduct experiments on the MOSE and the DAVIS datasets and we show that: (a) EVA-VOS leads to masks with accuracy close to the human agreement 3.5x faster than the standard way of annotating videos; (b) our frame selection achieves state-of-the-art performance; (c) EVA-VOS yields significant performance gains in terms of annotation time compared to all other methods and baselines.
△ Less
Submitted 11 November, 2023; v1 submitted 7 November, 2023;
originally announced November 2023.
-
The Devil is in the Details: Analyzing the Lucrative Ad Fraud Patterns of the Online Ad Ecosystem
Authors:
Emmanouil Papadogiannakis,
Nicolas Kourtellis,
Panagiotis Papadopoulos,
Evangelos P. Markatos
Abstract:
The online advertising market has recently reached the 500 billion dollar mark, and to accommodate the need to match a user with the highest bidder at a fraction of a second, it has moved towards a complex automated model involving numerous agents and middle men. Stimulated by potential revenue and the lack of transparency, bad actors have found ways to abuse it, circumvent restrictions, and gener…
▽ More
The online advertising market has recently reached the 500 billion dollar mark, and to accommodate the need to match a user with the highest bidder at a fraction of a second, it has moved towards a complex automated model involving numerous agents and middle men. Stimulated by potential revenue and the lack of transparency, bad actors have found ways to abuse it, circumvent restrictions, and generate substantial revenue from objectionable and even illegal content. To make matters worse, they often receive advertisements from respectable companies which have nothing to do with these illegal activities. Altogether, advertiser money is funneled towards unknown entities, supporting their objectionable operations and maintaining their existence.
In this project, we work towards understanding the extent of the problem and shed light on how shady agents take advantage of gaps in the ad ecosystem to monetize their operations. We study over 7 million websites and examine how state-of-the-art standards associated with online advertising are applied. We discover and present actual practices observed in the wild and show that publishers are able to monetize objectionable and illegal content and generate thousands of dollars of revenue on a monthly basis.
△ Less
Submitted 14 June, 2023;
originally announced June 2023.
-
P4L: Privacy Preserving Peer-to-Peer Learning for Infrastructureless Setups
Authors:
Ioannis Arapakis,
Panagiotis Papadopoulos,
Kleomenis Katevas,
Diego Perino
Abstract:
Distributed (or Federated) learning enables users to train machine learning models on their very own devices, while they share only the gradients of their models usually in a differentially private way (utility loss). Although such a strategy provides better privacy guarantees than the traditional centralized approach, it requires users to blindly trust a centralized infrastructure that may also b…
▽ More
Distributed (or Federated) learning enables users to train machine learning models on their very own devices, while they share only the gradients of their models usually in a differentially private way (utility loss). Although such a strategy provides better privacy guarantees than the traditional centralized approach, it requires users to blindly trust a centralized infrastructure that may also become a bottleneck with the increasing number of users. In this paper, we design and implement P4L: a privacy preserving peer-to-peer learning system for users to participate in an asynchronous, collaborative learning scheme without requiring any sort of infrastructure or relying on differential privacy. Our design uses strong cryptographic primitives to preserve both the confidentiality and utility of the shared gradients, a set of peer-to-peer mechanisms for fault tolerance and user churn, proximity and cross device communications. Extensive simulations under different network settings and ML scenarios for three real-life datasets show that P4L provides competitive performance to baselines, while it is resilient to different poisoning attacks. We implement P4L and experimental results show that the performance overhead and power consumption is minimal (less than 3mAh of discharge).
△ Less
Submitted 26 February, 2023;
originally announced February 2023.
-
Towards The Creation Of The Future Fish Farm
Authors:
Pavlos Papadopoulos,
William J Buchanan,
Sarwar Sayeed,
Nikolaos Pitropakis
Abstract:
A fish farm is an area where fish raise and bred for food. Fish farm environments support the care and management of seafood within a controlled environment. Over the past few decades, there has been a remarkable increase in the calorie intake of protein attributed to seafood. Along with this, there are significant opportunities within the fish farming industry for economic development. Determinin…
▽ More
A fish farm is an area where fish raise and bred for food. Fish farm environments support the care and management of seafood within a controlled environment. Over the past few decades, there has been a remarkable increase in the calorie intake of protein attributed to seafood. Along with this, there are significant opportunities within the fish farming industry for economic development. Determining the fish diseases, monitoring the aquatic organisms, and examining the imbalance in the water element are some key factors that require precise observation to determine the accuracy of the acquired data. Similarly, due to the rapid expansion of aquaculture, new technologies are constantly being implemented in this sector to enhance efficiency. However, the existing approaches have often failed to provide an efficient method of farming fish. This work has kept aside the traditional approaches and opened up new dimensions to perform accurate analysis by adopting a distributed ledger technology. Our work analyses the current state-of-the-art of fish farming and proposes a fish farm ecosystem that relies on a private-by-design architecture based on the Hyperledger Fabric private-permissioned distributed ledger technology. The proposed method puts forward accurate and secure storage of the retrieved data from multiple sensors across the ecosystem so that the adhering entities can exercise their decision based on the acquired data. This study demonstrates a proof-of-concept to signify the efficiency and usability of the future fish farm.
△ Less
Submitted 2 January, 2023;
originally announced January 2023.
-
FNDaaS: Content-agnostic Detection of Fake News sites
Authors:
Panagiotis Papadopoulos,
Dimitris Spithouris,
Evangelos P. Markatos,
Nicolas Kourtellis
Abstract:
Automatic fake news detection is a challenging problem in misinformation spreading, and it has tremendous real-world political and social impacts. Past studies have proposed machine learning-based methods for detecting such fake news, focusing on different properties of the published news articles, such as linguistic characteristics of the actual content, which however have limitations due to the…
▽ More
Automatic fake news detection is a challenging problem in misinformation spreading, and it has tremendous real-world political and social impacts. Past studies have proposed machine learning-based methods for detecting such fake news, focusing on different properties of the published news articles, such as linguistic characteristics of the actual content, which however have limitations due to the apparent language barriers. Departing from such efforts, we propose FNDaaS, the first automatic, content-agnostic fake news detection method, that considers new and unstudied features such as network and structural characteristics per news website. This method can be enforced as-a-Service, either at the ISP-side for easier scalability and maintenance, or user-side for better end-user privacy. We demonstrate the efficacy of our method using data crawled from existing lists of 637 fake and 1183 real news websites, and by building and testing a proof of concept system that materializes our proposal. Our analysis of data collected from these websites shows that the vast majority of fake news domains are very young and appear to have lower time periods of an IP associated with their domain than real news ones. By conducting various experiments with machine learning classifiers, we demonstrate that FNDaaS can achieve an AUC score of up to 0.967 on past sites, and up to 77-92% accuracy on newly-flagged ones.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Transforming EU Governance: The Digital Integration through EBSI and GLASS
Authors:
Dimitrios Kasimatis,
William J Buchanan,
Mwarwan Abubakar,
Owen Lo,
Christos Chrysoulas,
Nikolaos Pitropakis,
Pavlos Papadopoulos,
Sarwar Sayeed,
Marc Sel
Abstract:
Traditionally, government systems managed citizen identities through disconnected data systems, using simple identifiers and paper-based processes, limiting digital trust and requiring citizens to request identity verification documents. The digital era offers a shift towards unique digital identifiers for each citizen, enabling a 'citizen wallet' for easier access to personal documents like acade…
▽ More
Traditionally, government systems managed citizen identities through disconnected data systems, using simple identifiers and paper-based processes, limiting digital trust and requiring citizens to request identity verification documents. The digital era offers a shift towards unique digital identifiers for each citizen, enabling a 'citizen wallet' for easier access to personal documents like academic records and licences, with enhanced security through digital signatures. The European Commission's initiative for a digital wallet for every EU citizen aims to improve mobility and integration, leveraging the European Blockchain Services Infrastructure (EBSI) for harmonised citizen integration. This paper discusses how EBSI and the GLASS project can advance governance and streamline access to identity documents.
△ Less
Submitted 19 April, 2024; v1 submitted 6 December, 2022;
originally announced December 2022.
-
The Hitchhiker's Guide to Facebook Web Tracking with Invisible Pixels and Click IDs
Authors:
Paschalis Bekos,
Panagiotis Papadopoulos,
Evangelos P. Markatos,
Nicolas Kourtellis
Abstract:
Over the past years, advertisement companies have used various tracking methods to persistently track users across the web. Such tracking methods usually include first and third-party cookies, cookie synchronization, as well as a variety of fingerprinting mechanisms. Facebook (FB) recently introduced a new tagging mechanism that attaches a one-time tag as a URL parameter (FBCLID) on outgoing links…
▽ More
Over the past years, advertisement companies have used various tracking methods to persistently track users across the web. Such tracking methods usually include first and third-party cookies, cookie synchronization, as well as a variety of fingerprinting mechanisms. Facebook (FB) recently introduced a new tagging mechanism that attaches a one-time tag as a URL parameter (FBCLID) on outgoing links to other websites. Although such a tag does not seem to have enough information to persistently track users, we demonstrate that despite its ephemeral nature, when combined with FB Pixel, it can aid in persistently monitoring user browsing behavior across i) different websites, ii) different actions on each website, iii) time, i.e., both in the past as well as in the future. We refer to this online monitoring of users as FB web tracking. We find that FB Pixel tracks a wide range of user activities on websites with alarming detail, especially on websites classified as sensitive categories under GDPR. Also, we show how the FBCLID tag can be used to match, and thus de-anonymize, activities of online users performed in the distant past (even before those users had a FB account) tracked by FB Pixel. In fact, by combining this tag with cookies that have rolling expiration dates, FB can also keep track of users' browsing activities in the future as well. Our experimental results suggest that 23% of the 10k most popular websites have adopted this technology, and can contribute to this activity tracking on the web. Furthermore, our longitudinal study shows that this type of user activity tracking can go as far back as 2015. Simply said, if a user creates for the first time a FB account today, FB could, under some conditions, match their anonymously collected past web browsing activity to their newly created FB profile, from as far back as 2015 and continue tracking their activity in the future.
△ Less
Submitted 28 March, 2023; v1 submitted 1 August, 2022;
originally announced August 2022.
-
YouTubers Not madeForKids: Detecting Channels Sharing Inappropriate Videos Targeting Children
Authors:
Myrsini Gkolemi,
Panagiotis Papadopoulos,
Evangelos P. Markatos,
Nicolas Kourtellis
Abstract:
In the last years, hundreds of new Youtube channels have been creating and sharing videos targeting children, with themes related to animation, superhero movies, comics, etc. Unfortunately, many of these videos are inappropriate for consumption by their target audience, due to disturbing, violent, or sexual scenes. In this paper, we study YouTube channels found to post suitable or disturbing video…
▽ More
In the last years, hundreds of new Youtube channels have been creating and sharing videos targeting children, with themes related to animation, superhero movies, comics, etc. Unfortunately, many of these videos are inappropriate for consumption by their target audience, due to disturbing, violent, or sexual scenes. In this paper, we study YouTube channels found to post suitable or disturbing videos targeting kids in the past. We identify a clear discrepancy between what YouTube assumes and flags as inappropriate content and channel, vs. what is found to be disturbing content and still available on the platform, targeting kids. In particular, we find that almost 60\% of videos that were manually annotated and classified as disturbing by an older study in 2019 (a collection bootstrapped with Elsa and other keywords related to children videos), are still available on YouTube in mid 2021. In the meantime, 44% of channels that uploaded such disturbing videos, have yet to be suspended and their videos to be removed. For the first time in literature, we also study the "madeForKids" flag, a new feature that YouTube introduced in the end of 2019, and compare its application to the channels that shared disturbing videos, as flagged from the previous study. Apparently, these channels are less likely to be set as "madeForKids" than those sharing suitable content. In addition, channels posting disturbing videos utilize their channel features such as keywords, description, topics, posts, etc., to appeal to kids (e.g., using game-related keywords). Finally, we use a collection of such channel and content features to train ML classifiers able to detect, at channel creation time, when a channel will be related to disturbing content uploads. These classifiers can help YouTube moderators reduce such incidences, pointing to potentially suspicious accounts without analyzing actual videos.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
Learning Program Representations for Food Images and Cooking Recipes
Authors:
Dim P. Papadopoulos,
Enrique Mora,
Nadiia Chepurko,
Kuan Wei Huang,
Ferda Ofli,
Antonio Torralba
Abstract:
In this paper, we are interested in modeling a how-to instructional procedure, such as a cooking recipe, with a meaningful and rich high-level representation. Specifically, we propose to represent cooking recipes and food images as cooking programs. Programs provide a structured representation of the task, capturing cooking semantics and sequential relationships of actions in the form of a graph.…
▽ More
In this paper, we are interested in modeling a how-to instructional procedure, such as a cooking recipe, with a meaningful and rich high-level representation. Specifically, we propose to represent cooking recipes and food images as cooking programs. Programs provide a structured representation of the task, capturing cooking semantics and sequential relationships of actions in the form of a graph. This allows them to be easily manipulated by users and executed by agents. To this end, we build a model that is trained to learn a joint embedding between recipes and food images via self-supervision and jointly generate a program from this embedding as a sequence. To validate our idea, we crowdsource programs for cooking recipes and show that: (a) projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results; (b) generating programs from images leads to better recognition results compared to predicting raw cooking instructions; and (c) we can generate food images by manipulating programs via optimizing the latent code of a GAN. Code, data, and models are available online.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
GLASS: A Citizen-Centric Distributed Data-Sharing Model within an e-Governance Architecture
Authors:
Owen Lo,
William J. Buchanan,
Sarwar Sayeed,
Pavlos Papadopoulos,
Nikolaos Pitropakis,
Christos Chrysoulas
Abstract:
E-governance is a process that aims to enhance a government's ability to simplify all the processes that may involve government, citizens, businesses, and so on. The rapid evolution of digital technologies has often created the necessity for the establishment of an e-Governance model. There is often a need for an inclusive e-governance model with integrated multiactor governance services and where…
▽ More
E-governance is a process that aims to enhance a government's ability to simplify all the processes that may involve government, citizens, businesses, and so on. The rapid evolution of digital technologies has often created the necessity for the establishment of an e-Governance model. There is often a need for an inclusive e-governance model with integrated multiactor governance services and where a single market approach can be adopted. e-Governance often aims to minimise bureaucratic processes, while at the same time including a digital-by-default approach to public services. This aims at administrative efficiency and the reduction of bureaucratic processes. It can also improve government capabilities, and enhances trust and security, which brings confidence in governmental transactions. However, solid implementations of a distributed data sharing model within an e-governance architecture is far from a reality; hence, citizens of European countries often go through the tedious process of having their confidential information verified. This paper focuses on the sinGLe sign-on e-GovernAnce Paradigm based on a distributed file-exchange network for security, transparency, cost-effectiveness and trust (GLASS) model, which aims to ensure that a citizen can control their relationship with governmental agencies. The paper thus proposes an approach that integrates a permissioned blockchain with the InterPlanetary File System (IPFS). This method demonstrates how we may encrypt and store verifiable credentials of the GLASS ecosystem, such as academic awards, ID documents and so on, within IPFS in a secure manner and thus only allow trusted users to read a blockchain record, and obtain the encryption key. This allows for the decryption of a given verifiable credential that stored on IPFS. This paper outlines the creation of a demonstrator that proves the principles of the GLASS approach.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
Who Funds Misinformation? A Systematic Analysis of the Ad-related Profit Routines of Fake News sites
Authors:
Emmanouil Papadogiannakis,
Panagiotis Papadopoulos,
Evangelos P. Markatos,
Nicolas Kourtellis
Abstract:
Fake news is an age-old phenomenon, widely assumed to be associated with political propaganda published to sway public opinion. Yet, with the growth of social media, it has become a lucrative business for Web publishers. Despite many studies performed and countermeasures proposed, unreliable news sites have increased in the last years their share of engagement among the top performing news sources…
▽ More
Fake news is an age-old phenomenon, widely assumed to be associated with political propaganda published to sway public opinion. Yet, with the growth of social media, it has become a lucrative business for Web publishers. Despite many studies performed and countermeasures proposed, unreliable news sites have increased in the last years their share of engagement among the top performing news sources. Stifling fake news impact depends on our efforts in limiting the (economic) incentives of fake news producers.
In this paper, we aim at enhancing the transparency around these exact incentives, and explore: Who supports the existence of fake news websites via paid ads, either as an advertiser or an ad seller? Who owns these websites and what other Web business are they into? We are the first to systematize the auditing process of fake news revenue flows. We identify the companies that advertise in fake news websites and the intermediary companies responsible for facilitating those ad revenues. We study more than 2,400 popular news websites and show that well-known ad networks, such as Google and IndexExchange, have a direct advertising relation with more than 40% of fake news websites. Using a graph clustering approach on 114.5K sites, we show that entities who own fake news sites, also operate other types of websites pointing to the fact that owning a fake news website is part of a broader business operation.
△ Less
Submitted 17 February, 2023; v1 submitted 10 February, 2022;
originally announced February 2022.
-
Leveraging Google's Publisher-specific IDs to Detect Website Administration
Authors:
Emmanouil Papadogiannakis,
Panagiotis Papadopoulos,
Evangelos P. Markatos,
Nicolas Kourtellis
Abstract:
Digital advertising is the most popular way for content monetization on the Internet. Publishers spawn new websites, and older ones change hands with the sole purpose of monetizing user traffic. In this ever-evolving ecosystem, it is challenging to effectively answer questions such as: Which entities monetize what websites? What categories of websites does an average entity typically monetize on a…
▽ More
Digital advertising is the most popular way for content monetization on the Internet. Publishers spawn new websites, and older ones change hands with the sole purpose of monetizing user traffic. In this ever-evolving ecosystem, it is challenging to effectively answer questions such as: Which entities monetize what websites? What categories of websites does an average entity typically monetize on and how diverse are these websites? How has this website administration ecosystem changed across time?
In this paper, we propose a novel, graph-based methodology to detect administration of websites on the Web, by exploiting the ad-related publisher-specific IDs. We apply our methodology across the top 1 million websites and study the characteristics of the created graphs of website administration. Our findings show that approximately 90% of the websites are associated each with a single publisher, and that small publishers tend to manage less popular websites. We perform a historical analysis of up to 8 million websites, and find a new, constantly rising number of (intermediary) publishers that control and monetize traffic from hundreds of websites, seeking a share of the ad-market pie. We also observe that over time, websites tend to move from big to smaller administrators.
△ Less
Submitted 10 February, 2022;
originally announced February 2022.
-
Ransomware: Analysing the Impact on Windows Active Directory Domain Services
Authors:
Grant McDonald,
Pavlos Papadopoulos,
Nikolaos Pitropakis,
Jawad Ahmad,
William J. Buchanan
Abstract:
Ransomware has become an increasingly popular type of malware across the past decade and continues to rise in popularity due to its high profitability. Organisations and enterprises have become prime targets for ransomware as they are more likely to succumb to ransom demands as part of operating expenses to counter the cost incurred from downtime. Despite the prevalence of ransomware as a threat t…
▽ More
Ransomware has become an increasingly popular type of malware across the past decade and continues to rise in popularity due to its high profitability. Organisations and enterprises have become prime targets for ransomware as they are more likely to succumb to ransom demands as part of operating expenses to counter the cost incurred from downtime. Despite the prevalence of ransomware as a threat towards organisations, there is very little information outlining how ransomware affects Windows Server environments, and particularly its proprietary domain services such as Active Directory. Hence, we aim to increase the cyber situational awareness of organisations and corporations that utilise these environments. Dynamic analysis was performed using three ransomware variants to uncover how crypto-ransomware affects Windows Server-specific services and processes. Our work outlines the practical investigation undertaken as WannaCry, TeslaCrypt, and Jigsaw were acquired and tested against several domain services. The findings showed that none of the three variants stopped the processes and decidedly left all domain services untouched. However, although the services remained operational, they became uniquely dysfunctional as ransomware encrypted the files pertaining to those services
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents
Authors:
Ethan Weber,
Dim P. Papadopoulos,
Agata Lapedriza,
Ferda Ofli,
Muhammad Imran,
Antonio Torralba
Abstract:
Natural disasters, such as floods, tornadoes, or wildfires, are increasingly pervasive as the Earth undergoes global warming. It is difficult to predict when and where an incident will occur, so timely emergency response is critical to saving the lives of those endangered by destructive events. Fortunately, technology can play a role in these situations. Social media posts can be used as a low-lat…
▽ More
Natural disasters, such as floods, tornadoes, or wildfires, are increasingly pervasive as the Earth undergoes global warming. It is difficult to predict when and where an incident will occur, so timely emergency response is critical to saving the lives of those endangered by destructive events. Fortunately, technology can play a role in these situations. Social media posts can be used as a low-latency data source to understand the progression and aftermath of a disaster, yet parsing this data is tedious without automated methods. Prior work has mostly focused on text-based filtering, yet image and video-based filtering remains largely unexplored. In this work, we present the Incidents1M Dataset, a large-scale multi-label dataset which contains 977,088 images, with 43 incident and 49 place categories. We provide details of the dataset construction, statistics and potential biases; introduce and train a model for incident detection; and perform image-filtering experiments on millions of images on Flickr and Twitter. We also present some applications on incident analysis to encourage and enable future work in computer vision for humanitarian aid. Code, data, and models are available at http://incidentsdataset.csail.mit.edu.
△ Less
Submitted 11 January, 2022;
originally announced January 2022.
-
Privacy-preserving and Trusted Threat Intelligence Sharing using Distributed Ledgers
Authors:
Hisham Ali,
Pavlos Papadopoulos,
Jawad Ahmad,
Nikolaos Pitropakis,
Zakwan Jaroucheh,
William J. Buchanan
Abstract:
Threat information sharing is considered as one of the proactive defensive approaches for enhancing the overall security of trusted partners. Trusted partner organizations can provide access to past and current cybersecurity threats for reducing the risk of a potential cyberattack - the requirements for threat information sharing range from simplistic sharing of documents to threat intelligence sh…
▽ More
Threat information sharing is considered as one of the proactive defensive approaches for enhancing the overall security of trusted partners. Trusted partner organizations can provide access to past and current cybersecurity threats for reducing the risk of a potential cyberattack - the requirements for threat information sharing range from simplistic sharing of documents to threat intelligence sharing. Therefore, the storage and sharing of highly sensitive threat information raises considerable concerns regarding constructing a secure, trusted threat information exchange infrastructure. Establishing a trusted ecosystem for threat sharing will promote the validity, security, anonymity, scalability, latency efficiency, and traceability of the stored information that protects it from unauthorized disclosure. This paper proposes a system that ensures the security principles mentioned above by utilizing a distributed ledger technology that provides secure decentralized operations through smart contracts and provides a privacy-preserving ecosystem for threat information storage and sharing regarding the MITRE ATT\&CK framework.
△ Less
Submitted 19 December, 2021;
originally announced December 2021.
-
Measuring the (Over)use of Service Workers for In-Page Push Advertising Purposes
Authors:
George Pantelakis,
Panagiotis Papadopoulos,
Nicolas Kourtellis,
Evangelos P. Markatos
Abstract:
Rich offline experience, periodic background sync, push notification functionality, network requests control, improved performance via requests caching are only a few of the functionalities provided by the Service Worker (SW) API. This new technology, supported by all major browsers, can significantly improve users' experience by providing the publisher with the technical foundations that would no…
▽ More
Rich offline experience, periodic background sync, push notification functionality, network requests control, improved performance via requests caching are only a few of the functionalities provided by the Service Worker (SW) API. This new technology, supported by all major browsers, can significantly improve users' experience by providing the publisher with the technical foundations that would normally require a native application. Albeit the capabilities of this new technique and its important role in the ecosystem of Progressive Web Apps (PWAs), it is still unclear what is their actual purpose on the web, and how publishers leverage the provided functionality in their web applications. In this study, we shed light in the real world deployment of SWs, by conducting the first large scale analysis of the prevalence of SWs in the wild. We see that SWs are becoming more and more popular, with the adoption increased by 26% only within the last 5 months. Surprisingly, besides their fruitful capabilities, we see that SWs are being mostly used for In-Page Push Advertising, in 65.08% of the SWs that connect with 3rd parties. We highlight that this is a relatively new way for advertisers to bypass ad-blockers and render ads on the user's displays natively.
△ Less
Submitted 29 March, 2022; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Scaling up instance annotation via label propagation
Authors:
Dim P. Papadopoulos,
Ethan Weber,
Antonio Torralba
Abstract:
Manually annotating object segmentation masks is very time-consuming. While interactive segmentation methods offer a more efficient alternative, they become unaffordable at a large scale because the cost grows linearly with the number of annotated masks. In this paper, we propose a highly efficient annotation scheme for building large datasets with object segmentation masks. At a large scale, imag…
▽ More
Manually annotating object segmentation masks is very time-consuming. While interactive segmentation methods offer a more efficient alternative, they become unaffordable at a large scale because the cost grows linearly with the number of annotated masks. In this paper, we propose a highly efficient annotation scheme for building large datasets with object segmentation masks. At a large scale, images contain many object instances with similar appearance. We exploit these similarities by using hierarchical clustering on mask predictions made by a segmentation model. We propose a scheme that efficiently searches through the hierarchy of clusters and selects which clusters to annotate. Humans manually verify only a few masks per cluster, and the labels are propagated to the whole cluster. Through a large-scale experiment to populate 1M unlabeled images with object segmentation masks for 80 object classes, we show that (1) we obtain 1M object segmentation masks with an total annotation time of only 290 hours; (2) we reduce annotation time by 76x compared to manual annotation; (3) the segmentation quality of our masks is on par with those from manually annotated datasets. Code, data, and models are available online.
△ Less
Submitted 5 October, 2021;
originally announced October 2021.
-
Evaluating Tooling and Methodology when Analysing Bitcoin Mixing Services After Forensic Seizure
Authors:
Edward Henry Young,
Christos Chrysoulas,
Nikolaos Pitropakis,
Pavlos Papadopoulos,
William J Buchanan
Abstract:
Little or no research has been directed to analysis and researching forensic analysis of the Bitcoin mixing or 'tumbling' service themselves. This work is intended to examine effective tooling and methodology for recovering forensic artifacts from two privacy focused mixing services namely Obscuro which uses the secure enclave on intel chips to provide enhanced confidentiality and Wasabi wallet wh…
▽ More
Little or no research has been directed to analysis and researching forensic analysis of the Bitcoin mixing or 'tumbling' service themselves. This work is intended to examine effective tooling and methodology for recovering forensic artifacts from two privacy focused mixing services namely Obscuro which uses the secure enclave on intel chips to provide enhanced confidentiality and Wasabi wallet which uses CoinJoin to mix and obfuscate crypto currencies. These wallets were set up on VMs and then several forensic tools used to examine these VM images for relevant forensic artifacts. These forensic tools were able to recover a broad range of forensic artifacts and found both network forensics and logging files to be a useful source of artifacts to deanonymize these mixing services.
△ Less
Submitted 5 October, 2021;
originally announced October 2021.
-
GLASS: Towards Secure and Decentralized eGovernance Services using IPFS
Authors:
Christos Chrysoulas,
Amanda Thomson,
Nikolaos Pitropakis,
Pavlos Papadopoulos,
Owen Lo,
William J. Buchanan,
George Domalis,
Nikos Karacapilidis,
Dimitris Tsakalidis,
Dimitris Tsolis
Abstract:
The continuously advancing digitization has provided answers to the bureaucratic problems faced by eGovernance services. This innovation led them to an era of automation it has broadened the attack surface and made them a popular target for cyber attacks. eGovernance services utilize internet, which is currently a location addressed system where whoever controls the location controls not only the…
▽ More
The continuously advancing digitization has provided answers to the bureaucratic problems faced by eGovernance services. This innovation led them to an era of automation it has broadened the attack surface and made them a popular target for cyber attacks. eGovernance services utilize internet, which is currently a location addressed system where whoever controls the location controls not only the content itself, but the integrity of that content, and the access to that content. We propose GLASS, a decentralised solution which combines the InterPlanetary File System (IPFS) with Distributed Ledger technology and Smart Contracts to secure EGovernance services. We also create a testbed environment where we measure the IPFS performance.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
POW-HOW: An enduring timing side-channel to evade online malware sandboxes
Authors:
Antonio Nappa,
Panagiotis Papadopoulos,
Matteo Varvello,
Daniel Aceituno Gomez,
Juan Tapiador,
Andrea Lanzi
Abstract:
Online malware scanners are one of the best weapons in the arsenal of cybersecurity companies and researchers. A fundamental part of such systems is the sandbox that provides an instrumented and isolated environment (virtualized or emulated) for any user to upload and run unknown artifacts and identify potentially malicious behaviors. The provided API and the wealth of information inthe reports pr…
▽ More
Online malware scanners are one of the best weapons in the arsenal of cybersecurity companies and researchers. A fundamental part of such systems is the sandbox that provides an instrumented and isolated environment (virtualized or emulated) for any user to upload and run unknown artifacts and identify potentially malicious behaviors. The provided API and the wealth of information inthe reports produced by these services have also helped attackers test the efficacy of numerous techniques to make malware hard to detect.The most common technique used by malware for evading the analysis system is to monitor the execution environment, detect the presence of any debugging artifacts, and hide its malicious behavior if needed. This is usually achieved by looking for signals suggesting that the execution environment does not belong to a the native machine, such as specific memory patterns or behavioral traits of certain CPU instructions.
In this paper, we show how an attacker can evade detection on such online services by incorporating a Proof-of-Work (PoW) algorithm into a malware sample. Specifically, we leverage the asymptotic behavior of the computational cost of PoW algorithms when they run on some classes of hardware platforms to effectively detect a non bare-metal environment of the malware sandbox analyzer. To prove the validity of this intuition, we design and implement the POW-HOW framework, a tool to automatically implement sandbox detection strategies and embed a test evasion program into an arbitrary malware sample. Our empirical evaluation shows that the proposed evasion technique is durable, hard to fingerprint, and reduces existing malware detection rate by a factor of 10. Moreover, we show how bare-metal environments cannot scale with actual malware submissions rates for consumer services.
△ Less
Submitted 5 October, 2021; v1 submitted 7 September, 2021;
originally announced September 2021.
-
THEMIS: A Decentralized Privacy-Preserving Ad Platform with Reporting Integrity
Authors:
Gonçalo Pestana,
Iñigo Querejeta-Azurmendi,
Panagiotis Papadopoulos,
Benjamin Livshits
Abstract:
Online advertising fuels the (seemingly) free internet. However, although users can access most of the web services free of charge, they pay a heavy coston their privacy. They are forced to trust third parties and intermediaries, who not only collect behavioral data but also absorb great amounts of ad revenues. Consequently, more and more users opt out from advertising by resorting to ad blockers,…
▽ More
Online advertising fuels the (seemingly) free internet. However, although users can access most of the web services free of charge, they pay a heavy coston their privacy. They are forced to trust third parties and intermediaries, who not only collect behavioral data but also absorb great amounts of ad revenues. Consequently, more and more users opt out from advertising by resorting to ad blockers, thus costing publishers millions of dollars in lost ad revenues. Albeit there are various privacy-preserving advertising proposals (e.g.,Adnostic, Privad, Brave Ads) from both academia and industry, they all rely on centralized management that users have to blindly trust without being able to audit, while they also fail to guarantee the integrity of the per-formance analytics they provide to advertisers.
In this paper, we design and deploy THEMIS, a novel, decentralized and privacy-by-design ad platform that requires zero trust by users. THEMIS (i) provides auditability to its participants, (ii) rewards users for viewing ads, and (iii) allows advertisers to verify the performance and billing reports of their ad campaigns. By leveraging smart contracts and zero-knowledge schemes, we implement a prototype of THEMIS and early performance evaluation results show that it can scale linearly on a multi sidechain setup while it supports more than 51M users on a single-sidechain.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
Launching Adversarial Attacks against Network Intrusion Detection Systems for IoT
Authors:
Pavlos Papadopoulos,
Oliver Thornewill von Essen,
Nikolaos Pitropakis,
Christos Chrysoulas,
Alexios Mylonas,
William J. Buchanan
Abstract:
As the internet continues to be populated with new devices and emerging technologies, the attack surface grows exponentially. Technology is shifting towards a profit-driven Internet of Things market where security is an afterthought. Traditional defending approaches are no longer sufficient to detect both known and unknown attacks to high accuracy. Machine learning intrusion detection systems have…
▽ More
As the internet continues to be populated with new devices and emerging technologies, the attack surface grows exponentially. Technology is shifting towards a profit-driven Internet of Things market where security is an afterthought. Traditional defending approaches are no longer sufficient to detect both known and unknown attacks to high accuracy. Machine learning intrusion detection systems have proven their success in identifying unknown attacks with high precision. Nevertheless, machine learning models are also vulnerable to attacks. Adversarial examples can be used to evaluate the robustness of a designed model before it is deployed. Further, using adversarial examples is critical to creating a robust model designed for an adversarial environment. Our work evaluates both traditional machine learning and deep learning models' robustness using the Bot-IoT dataset. Our methodology included two main approaches. First, label poisoning, used to cause incorrect classification by the model. Second, the fast gradient sign method, used to evade detection measures. The experiments demonstrated that an attacker could manipulate or circumvent detection with significant probability.
△ Less
Submitted 26 April, 2021;
originally announced April 2021.
-
Practical Defences Against Model Inversion Attacks for Split Neural Networks
Authors:
Tom Titcombe,
Adam J. Hall,
Pavlos Papadopoulos,
Daniele Romanini
Abstract:
We describe a threat model under which a split network-based federated learning system is susceptible to a model inversion attack by a malicious computational server. We demonstrate that the attack can be successfully performed with limited knowledge of the data distribution by the attacker. We propose a simple additive noise method to defend against model inversion, finding that the method can si…
▽ More
We describe a threat model under which a split network-based federated learning system is susceptible to a model inversion attack by a malicious computational server. We demonstrate that the attack can be successfully performed with limited knowledge of the data distribution by the attacker. We propose a simple additive noise method to defend against model inversion, finding that the method can significantly reduce attack efficacy at an acceptable accuracy trade-off on MNIST. Furthermore, we show that NoPeekNN, an existing defensive method, protects different information from exposure, suggesting that a combined defence is necessary to fully protect private user data.
△ Less
Submitted 21 April, 2021; v1 submitted 12 April, 2021;
originally announced April 2021.
-
PyVertical: A Vertical Federated Learning Framework for Multi-headed SplitNN
Authors:
Daniele Romanini,
Adam James Hall,
Pavlos Papadopoulos,
Tom Titcombe,
Abbas Ismail,
Tudor Cebere,
Robert Sandmann,
Robin Roehm,
Michael A. Hoeh
Abstract:
We introduce PyVertical, a framework supporting vertical federated learning using split neural networks. The proposed framework allows a data scientist to train neural networks on data features vertically partitioned across multiple owners while keeping raw data on an owner's device. To link entities shared across different datasets' partitions, we use Private Set Intersection on IDs associated wi…
▽ More
We introduce PyVertical, a framework supporting vertical federated learning using split neural networks. The proposed framework allows a data scientist to train neural networks on data features vertically partitioned across multiple owners while keeping raw data on an owner's device. To link entities shared across different datasets' partitions, we use Private Set Intersection on IDs associated with data points. To demonstrate the validity of the proposed framework, we present the training of a simple dual-headed split neural network for a MNIST classification task, with data samples vertically distributed across two data owners and a data scientist.
△ Less
Submitted 14 April, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
Privacy and Trust Redefined in Federated Machine Learning
Authors:
Pavlos Papadopoulos,
Will Abramson,
Adam J. Hall,
Nikolaos Pitropakis,
William J. Buchanan
Abstract:
A common privacy issue in traditional machine learning is that data needs to be disclosed for the training procedures. In situations with highly sensitive data such as healthcare records, accessing this information is challenging and often prohibited. Luckily, privacy-preserving technologies have been developed to overcome this hurdle by distributing the computation of the training and ensuring th…
▽ More
A common privacy issue in traditional machine learning is that data needs to be disclosed for the training procedures. In situations with highly sensitive data such as healthcare records, accessing this information is challenging and often prohibited. Luckily, privacy-preserving technologies have been developed to overcome this hurdle by distributing the computation of the training and ensuring the data privacy to their owners. The distribution of the computation to multiple participating entities introduces new privacy complications and risks. In this paper, we present a privacy-preserving decentralised workflow that facilitates trusted federated learning among participants. Our proof-of-concept defines a trust framework instantiated using decentralised identity technologies being developed under Hyperledger projects Aries/Indy/Ursa. Only entities in possession of Verifiable Credentials issued from the appropriate authorities are able to establish secure, authenticated communication channels authorised to participate in a federated learning workflow related to mental health data.
△ Less
Submitted 30 March, 2021; v1 submitted 29 March, 2021;
originally announced March 2021.
-
The Rise and Fall of Fake News sites: A Traffic Analysis
Authors:
Manolis Chalkiadakis,
Alexandros Kornilakis,
Panagiotis Papadopoulos,
Evangelos P. Markatos,
Nicolas Kourtellis
Abstract:
Over the past decade, we have witnessed the rise of misinformation on the Internet, with online users constantly falling victims of fake news. A multitude of past studies have analyzed fake news diffusion mechanics and detection and mitigation techniques. However, there are still open questions about their operational behavior such as: How old are fake news websites? Do they typically stay online…
▽ More
Over the past decade, we have witnessed the rise of misinformation on the Internet, with online users constantly falling victims of fake news. A multitude of past studies have analyzed fake news diffusion mechanics and detection and mitigation techniques. However, there are still open questions about their operational behavior such as: How old are fake news websites? Do they typically stay online for long periods of time? Do such websites synchronize with each other their up and down time? Do they share similar content through time? Which third-parties support their operations? How much user traffic do they attract, in comparison to mainstream or real news websites? In this paper, we perform a first of its kind investigation to answer such questions regarding the online presence of fake news websites and characterize their behavior in comparison to real news websites. Based on our findings, we build a content-agnostic ML classifier for automatic detection of fake news websites (i.e. accuracy) that are not yet included in manually curated blacklists.
△ Less
Submitted 16 March, 2021;
originally announced March 2021.
-
User Tracking in the Post-cookie Era: How Websites Bypass GDPR Consent to Track Users
Authors:
Emmanouil Papadogiannakis,
Panagiotis Papadopoulos,
Nicolas Kourtellis,
Evangelos P. Markatos
Abstract:
During the past few years, mostly as a result of the GDPR and the CCPA, websites have started to present users with cookie consent banners. These banners are web forms where the users can state their preference and declare which cookies they would like to accept, if such option exists. Although requesting consent before storing any identifiable information is a good start towards respecting the us…
▽ More
During the past few years, mostly as a result of the GDPR and the CCPA, websites have started to present users with cookie consent banners. These banners are web forms where the users can state their preference and declare which cookies they would like to accept, if such option exists. Although requesting consent before storing any identifiable information is a good start towards respecting the user privacy, yet previous research has shown that websites do not always respect user choices. Furthermore, considering the ever decreasing reliance of trackers on cookies and actions browser vendors take by blocking or restricting third-party cookies, we anticipate a world where stateless tracking emerges, either because trackers or websites do not use cookies, or because users simply refuse to accept any.
In this paper, we explore whether websites use more persistent and sophisticated forms of tracking in order to track users who said they do not want cookies. Such forms of tracking include first-party ID leaking, ID synchronization, and browser fingerprinting. Our results suggest that websites do use such modern forms of tracking even before users had the opportunity to register their choice with respect to cookies. To add insult to injury, when users choose to raise their voice and reject all cookies, user tracking only intensifies. As a result, users' choices play very little role with respect to tracking: we measured that more than 75% of tracking activities happened before users had the opportunity to make a selection in the cookie consent banner, or when users chose to reject all cookies.
△ Less
Submitted 10 February, 2022; v1 submitted 17 February, 2021;
originally announced February 2021.
-
Asymmetric Private Set Intersection with Applications to Contact Tracing and Private Vertical Federated Machine Learning
Authors:
Nick Angelou,
Ayoub Benaissa,
Bogdan Cebere,
William Clark,
Adam James Hall,
Michael A. Hoeh,
Daniel Liu,
Pavlos Papadopoulos,
Robin Roehm,
Robert Sandmann,
Phillipp Schoppmann,
Tom Titcombe
Abstract:
We present a multi-language, cross-platform, open-source library for asymmetric private set intersection (PSI) and PSI-Cardinality (PSI-C). Our protocol combines traditional DDH-based PSI and PSI-C protocols with compression based on Bloom filters that helps reduce communication in the asymmetric setting. Currently, our library supports C++, C, Go, WebAssembly, JavaScript, Python, and Rust, and ru…
▽ More
We present a multi-language, cross-platform, open-source library for asymmetric private set intersection (PSI) and PSI-Cardinality (PSI-C). Our protocol combines traditional DDH-based PSI and PSI-C protocols with compression based on Bloom filters that helps reduce communication in the asymmetric setting. Currently, our library supports C++, C, Go, WebAssembly, JavaScript, Python, and Rust, and runs on both traditional hardware (x86) and browser targets. We further apply our library to two use cases: (i) a privacy-preserving contact tracing protocol that is compatible with existing approaches, but improves their privacy guarantees, and (ii) privacy-preserving machine learning on vertically partitioned data.
△ Less
Submitted 18 November, 2020;
originally announced November 2020.
-
A Privacy-Preserving Healthcare Framework Using Hyperledger Fabric
Authors:
Charalampos Stamatellis,
Pavlos Papadopoulos,
Nikolaos Pitropakis,
Sokratis Katsikas,
William J Buchanan
Abstract:
Electronic health record (EHR) management systems require the adoption of effective technologies when health information is being exchanged. Current management approaches often face risks that may expose medical record storage solutions to common security attack vectors. However, healthcare-oriented blockchain solutions can provide a decentralized, anonymous and secure EHR handling approach. This…
▽ More
Electronic health record (EHR) management systems require the adoption of effective technologies when health information is being exchanged. Current management approaches often face risks that may expose medical record storage solutions to common security attack vectors. However, healthcare-oriented blockchain solutions can provide a decentralized, anonymous and secure EHR handling approach. This paper presents PREHEALTH, a privacy-preserving EHR management solution that uses distributed ledger technology and an Identity Mixer (Idemix). The paper describes a proof-of-concept implementation that uses the Hyperledger Fabric's permissioned blockchain framework. The proposed solution is able to store patient records effectively whilst providing anonymity and unlinkability. Experimental performance evaluation results demonstrate the scheme's efficiency and feasibility for real-world scale deployment.
△ Less
Submitted 27 January, 2021; v1 submitted 18 November, 2020;
originally announced November 2020.
-
Review and Critical Analysis of Privacy-preserving Infection Tracking and Contact Tracing
Authors:
William J Buchanan,
Muhammad Ali Imran,
Masood Ur-Rehman,
Lei Zhang,
Qammer H. Abbasi,
Christos Chrysoulas,
David Haynes,
Nikolaos Pitropakis,
Pavlos Papadopoulos
Abstract:
The outbreak of viruses have necessitated contact tracing and infection tracking methods. Despite various efforts, there is currently no standard scheme for the tracing and tracking. Many nations of the world have therefore, developed their own ways where carriers of disease could be tracked and their contacts traced. These are generalized methods developed either in a distributed manner giving ci…
▽ More
The outbreak of viruses have necessitated contact tracing and infection tracking methods. Despite various efforts, there is currently no standard scheme for the tracing and tracking. Many nations of the world have therefore, developed their own ways where carriers of disease could be tracked and their contacts traced. These are generalized methods developed either in a distributed manner giving citizens control of their identity or in a centralised manner where a health authority gathers data on those who are carriers. This paper outlines some of the most significant approaches that have been established for contact tracing around the world. A comprehensive review on the key enabling methods used to realise the infrastructure around these infection tracking and contact tracing methods is also presented and recommendations are made for the most effective way to develop such a practice.
△ Less
Submitted 10 September, 2020;
originally announced September 2020.
-
Detecting natural disasters, damage, and incidents in the wild
Authors:
Ethan Weber,
Nuria Marzo,
Dim P. Papadopoulos,
Aritro Biswas,
Agata Lapedriza,
Ferda Ofli,
Muhammad Imran,
Antonio Torralba
Abstract:
Responding to natural disasters, such as earthquakes, floods, and wildfires, is a laborious task performed by on-the-ground emergency responders and analysts. Social media has emerged as a low-latency data source to quickly understand disaster situations. While most studies on social media are limited to text, images offer more information for understanding disaster and incident scenes. However, n…
▽ More
Responding to natural disasters, such as earthquakes, floods, and wildfires, is a laborious task performed by on-the-ground emergency responders and analysts. Social media has emerged as a low-latency data source to quickly understand disaster situations. While most studies on social media are limited to text, images offer more information for understanding disaster and incident scenes. However, no large-scale image datasets for incident detection exists. In this work, we present the Incidents Dataset, which contains 446,684 images annotated by humans that cover 43 incidents across a variety of scenes. We employ a baseline classification model that mitigates false-positive errors and we perform image filtering experiments on millions of social media images from Flickr and Twitter. Through these experiments, we show how the Incidents Dataset can be used to detect images with incidents in the wild. Code, data, and models are available online at http://incidentsdataset.csail.mit.edu.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Privacy Preserving Passive DNS
Authors:
Pavlos Papadopoulos,
Nikolaos Pitropakis,
William J. Buchanan,
Owen Lo,
Sokratis Katsikas
Abstract:
The Domain Name System (DNS) was created to resolve the IP addresses of the web servers to easily remembered names. When it was initially created, security was not a major concern; nowadays, this lack of inherent security and trust has exposed the global DNS infrastructure to malicious actors. The passive DNS data collection process creates a database containing various DNS data elements, some of…
▽ More
The Domain Name System (DNS) was created to resolve the IP addresses of the web servers to easily remembered names. When it was initially created, security was not a major concern; nowadays, this lack of inherent security and trust has exposed the global DNS infrastructure to malicious actors. The passive DNS data collection process creates a database containing various DNS data elements, some of which are personal and need to be protected to preserve the privacy of the end users. To this end, we propose the use of distributed ledger technology. We use Hyperledger Fabric to create a permissioned blockchain, which only authorized entities can access. The proposed solution supports queries for storing and retrieving data from the blockchain ledger, allowing the use of the passive DNS database for further analysis, e.g. for the identification of malicious domain names. Additionally, it effectively protects the DNS personal data from unauthorized entities, including the administrators that can act as potential malicious insiders, and allows only the data owners to perform queries over these data. We evaluated our proposed solution by creating a proof-of-concept experimental setup that passively collects DNS data from a network and then uses the distributed ledger technology to store the data in an immutable ledger, thus providing a full historical overview of all the records.
△ Less
Submitted 14 August, 2020;
originally announced August 2020.
-
THEMIS: Decentralized and Trustless Ad Platform with Reporting Integrity
Authors:
Gonçalo Pestana,
Iñigo Querejeta-Azurmendi,
Panagiotis Papadopoulos,
Benjamin Livshits
Abstract:
Online advertising fuels the (seemingly) free internet. However, although users can access most websites free of charge, they need to pay a heavy cost on their privacy and blindly trust third parties and intermediaries that absorb great amounts of adrevenues and user data. This is one of the reasons users opt out from advertising by resorting ad blockers thatin turn cost publishers millions of dol…
▽ More
Online advertising fuels the (seemingly) free internet. However, although users can access most websites free of charge, they need to pay a heavy cost on their privacy and blindly trust third parties and intermediaries that absorb great amounts of adrevenues and user data. This is one of the reasons users opt out from advertising by resorting ad blockers thatin turn cost publishers millions of dollars in lost adrevenues. Existing privacy-preserving advertising approaches(e.g., Adnostic, Privad, Brave Ads) from both industry and academia cannot guarantee the integrity of the performance analytics they provide to advertisers, while they also rely on centralized management that users have to trust without being able to audit. In this paper, we propose THEMIS, a novel privacy-by-design ad platform that is decentralized and requires zero trust from users. THEMIS (i) provides auditability to all participants, (ii) rewards users for viewing ads, and (iii) allows advertisers to verify the performance and billing reports of their ad campaigns. To demonstrate the feasibility and practicability of our approach, we implemented a prototype of THEMIS using a combination of smart contracts and zero-knowledge schemes. Performance evaluation results show that during adreward payouts, THEMIS can support more than 51M users on a single-sidechain setup or 153M users ona multi-sidechain setup, thus proving that THEMIS scales linearly.
△ Less
Submitted 4 August, 2020; v1 submitted 10 July, 2020;
originally announced July 2020.
-
A Distributed Trust Framework for Privacy-Preserving Machine Learning
Authors:
Will Abramson,
Adam James Hall,
Pavlos Papadopoulos,
Nikolaos Pitropakis,
William J Buchanan
Abstract:
When training a machine learning model, it is standard procedure for the researcher to have full knowledge of both the data and model. However, this engenders a lack of trust between data owners and data scientists. Data owners are justifiably reluctant to relinquish control of private information to third parties. Privacy-preserving techniques distribute computation in order to ensure that data r…
▽ More
When training a machine learning model, it is standard procedure for the researcher to have full knowledge of both the data and model. However, this engenders a lack of trust between data owners and data scientists. Data owners are justifiably reluctant to relinquish control of private information to third parties. Privacy-preserving techniques distribute computation in order to ensure that data remains in the control of the owner while learning takes place. However, architectures distributed amongst multiple agents introduce an entirely new set of security and trust complications. These include data poisoning and model theft. This paper outlines a distributed infrastructure which is used to facilitate peer-to-peer trust between distributed agents; collaboratively performing a privacy-preserving workflow. Our outlined prototype sets industry gatekeepers and governance bodies as credential issuers. Before participating in the distributed learning workflow, malicious actors must first negotiate valid credentials. We detail a proof of concept using Hyperledger Aries, Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) to establish a distributed trust architecture during a privacy-preserving machine learning experiment. Specifically, we utilise secure and authenticated DID communication channels in order to facilitate a federated learning workflow related to mental health care data.
△ Less
Submitted 3 June, 2020;
originally announced June 2020.
-
Phishing URL Detection Through Top-level Domain Analysis: A Descriptive Approach
Authors:
Orestis Christou,
Nikolaos Pitropakis,
Pavlos Papadopoulos,
Sean McKeown,
William J. Buchanan
Abstract:
Phishing is considered to be one of the most prevalent cyber-attacks because of its immense flexibility and alarmingly high success rate. Even with adequate training and high situational awareness, it can still be hard for users to continually be aware of the URL of the website they are visiting. Traditional detection methods rely on blocklists and content analysis, both of which require time-cons…
▽ More
Phishing is considered to be one of the most prevalent cyber-attacks because of its immense flexibility and alarmingly high success rate. Even with adequate training and high situational awareness, it can still be hard for users to continually be aware of the URL of the website they are visiting. Traditional detection methods rely on blocklists and content analysis, both of which require time-consuming human verification. Thus, there have been attempts focusing on the predictive filtering of such URLs. This study aims to develop a machine-learning model to detect fraudulent URLs which can be used within the Splunk platform. Inspired from similar approaches in the literature, we trained the SVM and Random Forests algorithms using malicious and benign datasets found in the literature and one dataset that we created. We evaluated the algorithms' performance with precision and recall, reaching up to 85% precision and 87% recall in the case of Random Forests while SVM achieved up to 90% precision and 88% recall using only descriptive features.
△ Less
Submitted 13 May, 2020;
originally announced May 2020.
-
Computing multiple solutions of topology optimization problems
Authors:
Ioannis P. A. Papadopoulos,
Patrick E. Farrell,
Thomas M. Surowiec
Abstract:
Topology optimization problems often support multiple local minima due to a lack of convexity. Typically, gradient-based techniques combined with continuation in model parameters are used to promote convergence to more optimal solutions; however, these methods can fail even in the simplest cases. In this paper, we present an algorithm to perform a systematic exploratory search for the solutions of…
▽ More
Topology optimization problems often support multiple local minima due to a lack of convexity. Typically, gradient-based techniques combined with continuation in model parameters are used to promote convergence to more optimal solutions; however, these methods can fail even in the simplest cases. In this paper, we present an algorithm to perform a systematic exploratory search for the solutions of the optimization problem via second-order methods without a good initial guess. The algorithm combines the techniques of deflation, barrier methods and primal-dual active set solvers in a novel way. We demonstrate this approach on several numerical examples, observe mesh-independence in certain cases and show that multiple distinct local minima can be recovered.
△ Less
Submitted 12 January, 2021; v1 submitted 24 April, 2020;
originally announced April 2020.
-
Stop Tracking Me Bro! Differential Tracking Of User Demographics On Hyper-partisan Websites
Authors:
Pushkal Agarwal,
Sagar Joglekar,
Panagiotis Papadopoulos,
Nishanth Sastry,
Nicolas Kourtellis
Abstract:
Websites with hyper-partisan, left or right-leaning focus offer content that is typically biased towards the expectations of their target audience. Such content often polarizes users, who are repeatedly primed to specific (extreme) content, usually reflecting hard party lines on political and socio-economic topics. Though this polarization has been extensively studied with respect to content, it i…
▽ More
Websites with hyper-partisan, left or right-leaning focus offer content that is typically biased towards the expectations of their target audience. Such content often polarizes users, who are repeatedly primed to specific (extreme) content, usually reflecting hard party lines on political and socio-economic topics. Though this polarization has been extensively studied with respect to content, it is still unknown how it associates with the online tracking experienced by browsing users, especially when they exhibit certain demographic characteristics. For example, it is unclear how such websites enable the ad-ecosystem to track users based on their gender or age. In this paper, we take a first step to shed light and measure such potential differences in tracking imposed on users when visiting specific party-line's websites. For this, we design and deploy a methodology to systematically probe such websites and measure differences in user tracking. This methodology allows us to create user personas with specific attributes like gender and age and automate their browsing behavior in a consistent and repeatable manner. Thus, we systematically study how personas are being tracked by these websites and their third parties, especially if they exhibit particular demographic properties. Overall, we test 9 personas on 556 hyper-partisan websites and find that right-leaning websites tend to track users more intensely than left-leaning, depending on user demographics, using both cookies and cookie synchronization methods and leading to more costly delivered ads.
△ Less
Submitted 30 March, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
ZKSENSE: A Friction-less Privacy-Preserving Human Attestation Mechanism for Mobile Devices
Authors:
Iñigo Querejeta-Azurmendi,
Panagiotis Papadopoulos,
Matteo Varvello,
Antonio Nappa,
Jiexin Zhang,
Benjamin Livshits
Abstract:
Recent studies show that 20.4% of the internet traffic originates from automated agents. To identify and block such ill-intentioned traffic, mechanisms that verify the humanness of the user are widely deployed, with CAPTCHAs being the most popular. Traditional CAPTCHAs require extra user effort (e.g., solving mathematical puzzles), which can severely downgrade the end-user's experience, especially…
▽ More
Recent studies show that 20.4% of the internet traffic originates from automated agents. To identify and block such ill-intentioned traffic, mechanisms that verify the humanness of the user are widely deployed, with CAPTCHAs being the most popular. Traditional CAPTCHAs require extra user effort (e.g., solving mathematical puzzles), which can severely downgrade the end-user's experience, especially on mobile, and provide sporadic humanness verification of questionable accuracy. More recent solutions like Google's reCAPTCHA v3, leverage user data, thus raising significant privacy concerns. To address these issues, we present zkSENSE: the first zero-knowledge proof-based humanness attestation system for mobile devices. zkSENSE moves the human attestation to the edge: onto the user's very own device, where humanness of the user is assessed in a privacy-preserving and seamless manner. zkSENSE achieves this by classifying motion sensor outputs of the mobile device, based on a model trained by using both publicly available sensor data and data collected from a small group of volunteers. To ensure the integrity of the process, the classification result is enclosed in a zero-knowledge proof of humanness that can be safely shared with a remote server. We implement zkSENSE as an Android service to demonstrate its effectiveness and practicality. In our evaluation, we show that zkSENSE successfully verifies the humanness of a user across a variety of attacking scenarios and demonstrates 92% accuracy. On a two years old Samsung S9, zkSENSE's attestation takes around 3 seconds (when visual CAPTCHAs need 9.8 seconds) and consumes a negligible amount of battery.
△ Less
Submitted 7 September, 2021; v1 submitted 18 November, 2019;
originally announced November 2019.
-
The coin that never sleeps. The privacy preserving usage of Bitcoin in a longitudinal analysis as a speculative asset
Authors:
Emmanouil Karampinakis,
Michalis Pachilakis,
Panagiotis Papadopoulos,
Antonis Krithinakis,
Evangelos P. Markatos
Abstract:
Bitcoin is the first and undoubtedly most successful cryptocurrecny to date with a market capitalization of more than 100 billion dollars. Today, Bitcoin has more than 100,000 supporting merchants and more than 3 million active users. Besides the trust it enjoys among people, Bitcoin lacks of a basic feature a substitute currency must have: stability of value. Hence, although the use of Bitcoin as…
▽ More
Bitcoin is the first and undoubtedly most successful cryptocurrecny to date with a market capitalization of more than 100 billion dollars. Today, Bitcoin has more than 100,000 supporting merchants and more than 3 million active users. Besides the trust it enjoys among people, Bitcoin lacks of a basic feature a substitute currency must have: stability of value. Hence, although the use of Bitcoin as a mean of payment is relative low, yet the wild ups and downs of its value lure investors to use it as useful asset to yield a trading profit. In this study, we explore this exact nature of Bitcoin aiming to shed light in the newly emerged and rapid growing marketplace of cryptocurencies and compare the investmet landscape and patterns with the most popular traditional stock market of Dow Jones. Our results show that most of Bitcoin addresses are used in the correct fashion to preserve security and privacy of the transactions and that the 24/7 open market of Bitcoin is not affected by any political incidents of the offline world, in contrary with the traditional stock markets. Also, it seems that there are specific longitudes that lead the cryptocurrency in terms of bulk of transactions, but there is not the same correlation with the volume of the coins being transferred.
△ Less
Submitted 18 November, 2019; v1 submitted 6 November, 2019;
originally announced November 2019.
-
Filter List Generation for Underserved Regions
Authors:
Alexander Sjosten,
Peter Snyder,
Antonio Pastor,
Panagiotis Papadopoulos,
Benjamin Livshits
Abstract:
Filter lists play a large and growing role in protecting and assisting web users. The vast majority of popular filter lists are crowd-sourced, where a large number of people manually label resources related to undesirable web resources (e.g., ads, trackers, paywall libraries), so that they can be blocked by browsers and extensions. Because only a small percentage of web users participate in the ge…
▽ More
Filter lists play a large and growing role in protecting and assisting web users. The vast majority of popular filter lists are crowd-sourced, where a large number of people manually label resources related to undesirable web resources (e.g., ads, trackers, paywall libraries), so that they can be blocked by browsers and extensions. Because only a small percentage of web users participate in the generation of filter lists, a crowd-sourcing strategy works well for blocking either uncommon resources that appear on "popular" websites, or resources that appear on a large number of "unpopular" websites. A crowd-sourcing strategy will perform poorly for parts of the web with small "crowds", such as regions of the web serving languages with (relatively) few speakers. This work addresses this problem through the combination of two novel techniques: (i) deep browser instrumentation that allows for the accurate generation of request chains, in a way that is robust in situations that confuse existing measurement techniques, and (ii) an ad classifier that uniquely combines perceptual and page-context features to remain accurate across multiple languages. We apply our unique two-step filter list generation pipeline to three regions of the web that currently have poorly maintained filter lists: Sri Lanka, Hungary, and Albania. We generate new filter lists that complement existing filter lists. Our complementary lists block an additional 3,349 of ad and ad-related resources (1,771 unique) when applied to 6,475 pages targeting these three regions. We hope that this work can be part of an increased effort at ensuring that the security, privacy, and performance benefits of web resource blocking can be shared with all users, and not only those in dominant linguistic or economic regions.
△ Less
Submitted 24 January, 2020; v1 submitted 16 October, 2019;
originally announced October 2019.
-
VPN0: A Privacy-Preserving Decentralized Virtual Private Network
Authors:
Matteo Varvello,
Iñigo Querejeta Azurmendi,
Antonio Nappa,
Panagiotis Papadopoulos,
Goncalo Pestana,
Ben Livshits
Abstract:
Distributed Virtual Private Networks (dVPNs) are new VPN solutions aiming to solve the trust-privacy concern of a VPN's central authority by leveraging a distributed architecture. In this paper, we first review the existing dVPN ecosystem and debate on its privacy requirements. Then, we present VPN0, a dVPN with strong privacy guarantees and minimal performance impact on its users. VPN0 guarantees…
▽ More
Distributed Virtual Private Networks (dVPNs) are new VPN solutions aiming to solve the trust-privacy concern of a VPN's central authority by leveraging a distributed architecture. In this paper, we first review the existing dVPN ecosystem and debate on its privacy requirements. Then, we present VPN0, a dVPN with strong privacy guarantees and minimal performance impact on its users. VPN0 guarantees that a dVPN node only carries traffic it has "whitelisted", without revealing its whitelist or knowing the traffic it tunnels. This is achieved via three main innovations. First, an attestation mechanism which leverages TLS to certify a user visit to a specific domain. Second, a zero knowledge proof to certify that some incoming traffic is authorized, e.g., falls in a node's whitelist, without disclosing the target domain. Third, a dynamic chain of VPN tunnels to both increase privacy and guarantee service continuation while traffic certification is in place. The paper demonstrates VPN0 functioning when integrated with several production systems, namely BitTorrent DHT and ProtonVPN.
△ Less
Submitted 30 September, 2019;
originally announced October 2019.
-
No More Chasing Waterfalls: A Measurement Study of the Header Bidding Ad-Ecosystem
Authors:
Michalis Pachilakis,
Panagiotis Papadopoulos,
Evangelos P. Markatos,
Nicolas Kourtellis
Abstract:
In recent years, Header Bidding (HB) has gained popularity among web publishers, challenging the status quo in the ad ecosystem. Contrary to the traditional waterfall standard, HB aims to give back to publishers control of their ad inventory, increase transparency, fairness and competition among advertisers, resulting in higher ad-slot prices. Although promising, little is known about how this ad…
▽ More
In recent years, Header Bidding (HB) has gained popularity among web publishers, challenging the status quo in the ad ecosystem. Contrary to the traditional waterfall standard, HB aims to give back to publishers control of their ad inventory, increase transparency, fairness and competition among advertisers, resulting in higher ad-slot prices. Although promising, little is known about how this ad protocol works: What are HB's possible implementations, who are the major players, and what is its network and UX overhead? To address these questions, we design and implement HBDetector: a novel methodology to detect HB auctions on a website at real time. By crawling 35,000 top Alexa websites, we collect and analyze a dataset of 800k auctions. We find that: (i) 14.28% of top websites utilize HB. (ii) Publishers prefer to collaborate with a few Demand Partners who also dominate the waterfall market. (iii) HB latency can be significantly higher (up to 3x in median case) than waterfall.
△ Less
Submitted 26 September, 2019; v1 submitted 24 July, 2019;
originally announced July 2019.
-
YourAdvalue: Measuring Advertising Price Dynamics without Bankrupting User Privacy
Authors:
Michalis Pachilakis,
Panagiotis Papadopoulos,
Nikolaos Laoutaris,
Evangelos P. Markatos,
Nicolas Kourtellis
Abstract:
The Real Time Bidding (RTB) protocol is by now more than a decade old. During this time, a handful of measurement papers have looked at bidding strategies, personal information flow, and cost of display advertising through RTB. In this paper, we present YourAdvalue, a privacy-preserving tool for displaying to end-users in a simple and intuitive manner their advertising value as seen through RTB. U…
▽ More
The Real Time Bidding (RTB) protocol is by now more than a decade old. During this time, a handful of measurement papers have looked at bidding strategies, personal information flow, and cost of display advertising through RTB. In this paper, we present YourAdvalue, a privacy-preserving tool for displaying to end-users in a simple and intuitive manner their advertising value as seen through RTB. Using YourAdvalue, we measure desktop RTB prices in the wild, and compare them with desktop and mobile RTB prices reported by past work. We present how it estimates ad prices that are encrypted, and how it preserves user privacy while reporting results back to a data-server for analysis. We deployed our system, disseminated its browser extension, and collected data from 200 users, including 12000 ad impressions over 11 months.
By analyzing this dataset, we show that desktop RTB prices have grown 4.6X over desktop RTB prices measured in 2013, and 3.8X over mobile RTB prices measured in 2015. We also study how user demographics associate with the intensity of RTB ecosystem tracking, leading to higher ad prices. We find that exchanging data between advertisers and/or data brokers through cookie-synchronization increases the median value of displayed ads by 19%. We also find that female and younger users are more targeted, suffering more tracking (via cookie synchronization) than male or elder users. As a result of this targeting in our dataset, the advertising value (i) of women is 2.4X higher than that of men, (ii) of 25-34 year-olds is 2.5X higher than that of 35-44 year-olds, (iii) is most expensive on weekends and early mornings.
△ Less
Submitted 4 November, 2021; v1 submitted 24 July, 2019;
originally announced July 2019.
-
How to make a pizza: Learning a compositional layer-based GAN model
Authors:
Dim P. Papadopoulos,
Youssef Tamaazousti,
Ferda Ofli,
Ingmar Weber,
Antonio Torralba
Abstract:
A food recipe is an ordered set of instructions for preparing a particular dish. From a visual perspective, every instruction step can be seen as a way to change the visual appearance of the dish by adding extra objects (e.g., adding an ingredient) or changing the appearance of the existing ones (e.g., cooking the dish). In this paper, we aim to teach a machine how to make a pizza by building a ge…
▽ More
A food recipe is an ordered set of instructions for preparing a particular dish. From a visual perspective, every instruction step can be seen as a way to change the visual appearance of the dish by adding extra objects (e.g., adding an ingredient) or changing the appearance of the existing ones (e.g., cooking the dish). In this paper, we aim to teach a machine how to make a pizza by building a generative model that mirrors this step-by-step procedure. To do so, we learn composable module operations which are able to either add or remove a particular ingredient. Each operator is designed as a Generative Adversarial Network (GAN). Given only weak image-level supervision, the operators are trained to generate a visual layer that needs to be added to or removed from the existing image. The proposed model is able to decompose an image into an ordered sequence of layers by applying sequentially in the right order the corresponding removing modules. Experimental results on synthetic and real pizza images demonstrate that our proposed model is able to: (1) segment pizza toppings in a weaklysupervised fashion, (2) remove them by revealing what is occluded underneath them (i.e., inpainting), and (3) infer the ordering of the toppings without any depth ordering supervision. Code, data, and models are available online.
△ Less
Submitted 6 June, 2019;
originally announced June 2019.
-
The Blind Men and the Internet: Multi-Vantage Point Web Measurements
Authors:
Jordan Jueckstock,
Shaown Sarker,
Peter Snyder,
Panagiotis Papadopoulos,
Matteo Varvello,
Benjamin Livshits,
Alexandros Kapravelos
Abstract:
In this paper, we design and deploy a synchronized multi-vantage point web measurement study to explore the comparability of web measurements across vantage points (VPs). We describe in reproducible detail the system with which we performed synchronized crawls on the Alexa top 5K domains from four distinct network VPs: research university, cloud datacenter, residential network, and Tor gateway pro…
▽ More
In this paper, we design and deploy a synchronized multi-vantage point web measurement study to explore the comparability of web measurements across vantage points (VPs). We describe in reproducible detail the system with which we performed synchronized crawls on the Alexa top 5K domains from four distinct network VPs: research university, cloud datacenter, residential network, and Tor gateway proxy. Apart from the expected poor results from Tor, we observed no shocking disparities across VPs, but we did find significant impact from the residential VP's reliability and performance disadvantages. We also found subtle but distinct indicators that some third-party content consistently avoided crawls from our cloud VP. In summary, we infer that cloud VPs do fail to observe some content of interest to security and privacy researchers, who should consider augmenting cloud VPs with alternate VPs for cross-validation. Our results also imply that the added visibility provided by residential VPs over university VPs is marginal compared to the infrastructure complexity and network fragility they introduce.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
Keeping out the Masses: Understanding the Popularity and Implications of Internet Paywalls
Authors:
Panagiotis Papadopoulos,
Peter Snyder,
Dimitrios Athanasakis,
Benjamin Livshits
Abstract:
Funding the production of quality online content is a pressing problem for content producers. The most common funding method, online advertising, is rife with well-known performance and privacy harms, and an intractable subject-agent conflict: many users do not want to see advertisements, depriving the site of needed funding.
Because of these negative aspects of advertisement-based funding, payw…
▽ More
Funding the production of quality online content is a pressing problem for content producers. The most common funding method, online advertising, is rife with well-known performance and privacy harms, and an intractable subject-agent conflict: many users do not want to see advertisements, depriving the site of needed funding.
Because of these negative aspects of advertisement-based funding, paywalls are an increasingly popular alternative for websites. This shift to a "pay-for-access" web is one that has potentially huge implications for the web and society. Instead of a system where information (nominally) flows freely, paywalls create a web where high quality information is available to fewer and fewer people, leaving the rest of the web users with less information, that might be also less accurate and of lower quality. Despite the potential significance of a move from an "advertising-but-open" web to a "paywalled" web, we find this issue understudied.
This work addresses this gap in our understanding by measuring how widely paywalls have been adopted, what kinds of sites use paywalls, and the distribution of policies enforced by paywalls. A partial list of our findings include that (i) paywall use is accelerating (2x more paywalls every 6 months), (ii) paywall adoption differs by country (e.g. 18.75% in US, 12.69% in Australia), (iii) paywalls change how users interact with sites (e.g. higher bounce rates, less incoming links), (iv) the median cost of an annual paywall access is $108 per site, and (v) paywalls are in general trivial to circumvent.
Finally, we present the design of a novel, automated system for detecting whether a site uses a paywall, through the combination of runtime browser instrumentation and repeated programmatic interactions with the site. We intend this classifier to augment future, longitudinal measurements of paywall use and behavior.
△ Less
Submitted 7 May, 2020; v1 submitted 18 February, 2019;
originally announced March 2019.