Adam's Arxiv FrontPageGenerated on 2024-10-17. This frontpage is made by scraping arxiv and by runnig a sentence-model that detects if the abstract describes a paper about a topic of interest. One cool feature: it all pretty much runs via Github Actions. |
|
New Datasets |
|
2024-10-16 |
Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection
Current RGB-Thermal Video Object Detection (RGBT VOD) methods still depend on manually aligning data at the image level, which hampers its practical application in real-world scenarios since image pairs captured by multispectral sensors often differ in both fields of view and resolution.To address this limitation, we propose a Multi-modal Dynamic Local fusion Network (MDLNet) designed to handle unaligned RGBT image pairs.Specifically, our proposed Multi-modal Dynamic Local Fusion (MDLF) module includes a set of predefined boxes, each enhanced with random Gaussian noise to generate a dynamic box.Each box selects a local region from the original high-resolution RGB image.This region is then fused with the corresponding information from another modality and reinserted into the RGB.This method adapts to various data alignment scenarios by interacting with local features across different ranges.Simultaneously, we introduce a Cascaded Temporal Scrambler (CTS) within an end-to-end architecture.This module leverages consistent spatiotemporal information from consecutive frames to enhance the representation capability of the current frame while maintaining network efficiency.We have curated an open dataset called UVT-VOD2024 for unaligned RGBT VOD. 0.715It consists of 30,494 pairs of unaligned RGBT images captured directly from a multispectral camera. 0.756We conduct a comprehensive evaluation and comparison with MDLNet and state-of-the-art (SOTA) models, demonstrating the superior effectiveness of MDLNet.We will release our code and UVT-VOD2024 to the public for further research. |
2024-10-16 |
VisAnatomy: An SVG Chart Corpus with Fine-Grained Semantic Labels
Chart corpora, which comprise data visualizations and their semantic labels, are crucial for advancing visualization research.However, the labels in most existing chart corpora are high-level (e.g., chart types), hindering their utility for broader interactive applications like chart reuse, animation, and accessibility.In this paper, we contribute VisAnatomy, a chart corpus containing 942 real-world SVG charts produced by over 50 tools, encompassing 40 chart types and featuring structural and stylistic design variations. 0.715Each chart is augmented with multilevel fine-grained labels on its semantic components, including each graphical element's type, role, and position, hierarchical groupings of elements, group layouts, and visual encodings.We demonstrate the richness of the semantic labels by comparing VisAnatomy with existing corpora.We illustrate the usefulness of VisAnatomy through four applications: chart type classification, chart decomposition, animation authoring, and content navigation for accessibility.Finally, we discuss our plan to improve VisAnatomy and the research opportunities VisAnatomy presents. |
2024-10-16 |
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
Agents powered by large language models have shown remarkable abilities in solving complex tasks.However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making.In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions.We propose a novel data-driven approach for this problem.Firstly, we collect real-world human activities to generate proactive task predictions.These predictions are then labeled by human annotators as either accepted or rejected.The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents.Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. 0.921Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents.Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models.These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration. |
2024-10-16 |
Learning to Predict Usage Options of Product Reviews with LLM-Generated Labels
Annotating large datasets can be challenging. 0.849However, crowd-sourcing is often expensive and can lack quality, especially for non-trivial tasks.We propose a method of using LLMs as few-shot learners for annotating data in a complex natural language task where we learn a standalone model to predict usage options for products from customer reviews.We also propose a new evaluation metric for this scenario, HAMS4, that can be used to compare a set of strings with multiple reference sets.Learning a custom model offers individual control over energy efficiency and privacy measures compared to using the LLM directly for the sequence-to-sequence task.We compare this data annotation approach with other traditional methods and demonstrate how LLMs can enable considerable cost savings.We find that the quality of the resulting data exceeds the level attained by third-party vendor services and that GPT-4-generated labels even reach the level of domain experts.We make the code and generated labels publicly available. |
2024-10-16 |
Data-Driven Gyroscope Calibration
Gyroscopes are inertial sensors that measure the angular velocity of the platforms to which they are attached.To estimate the gyroscope deterministic error terms prior mission start, a calibration procedure is performed.When considering low-cost gyroscopes, the calibration requires a turntable as the gyros are incapable of sensing the Earth turn rate.In this paper, we propose a data-driven framework to estimate the scale factor and bias of a gyroscope.To train and validate our approach, a dataset of 56 minutes was recorded using a turntable. 0.769We demonstrated that our proposed approach outperforms the model-based approach, in terms of accuracy and convergence time.Specifically, we improved the scale factor and bias estimation by an average of 72% during six seconds of calibration time, demonstrating an average of 75% calibration time improvement.That is, instead of minutes, our approach requires only several seconds for the calibration. |
2024-10-16 |
A Claim Decomposition Benchmark for Long-form Answer Verification
The advancement of LLMs has significantly boosted the performance of complex long-form question answering tasks.However, one prominent issue of LLMs is the generated "hallucination" responses that are not factual.Consequently, attribution for each claim in responses becomes a common solution to improve the factuality and verifiability.Existing researches mainly focus on how to provide accurate citations for the response, which largely overlook the importance of identifying the claims or statements for each response.To bridge this gap, we introduce a new claim decomposition benchmark, which requires building system that can identify atomic and checkworthy claims for LLM responses.Specifically, we present the Chinese Atomic Claim Decomposition Dataset (CACDD), which builds on the WebCPM dataset with additional expert annotations to ensure high data quality. 0.761The CACDD encompasses a collection of 500 human-annotated question-answer pairs, including a total of 4956 atomic claims.We further propose a new pipeline for human annotation and describe the challenges of this task.In addition, we provide experiment results on zero-shot, few-shot and fine-tuned LLMs as baselines.The results show that the claim decomposition is highly challenging and requires further explorations.All code and data are publicly available at \url{https://github.com/FBzzh/CACDD}. |
2024-10-16 |
Adaptive Prompt Learning with SAM for Few-shot Scanning Probe Microscope Image Segmentation
The Segment Anything Model (SAM) has demonstrated strong performance in image segmentation of natural scene images.However, its effectiveness diminishes markedly when applied to specific scientific domains, such as Scanning Probe Microscope (SPM) images.This decline in accuracy can be attributed to the distinct data distribution and limited availability of the data inherent in the scientific images.On the other hand, the acquisition of adequate SPM datasets is both time-intensive and laborious as well as skill-dependent.To address these challenges, we propose an Adaptive Prompt Learning with SAM (APL-SAM) framework tailored for few-shot SPM image segmentation.Our approach incorporates two key innovations to enhance SAM: 1) An Adaptive Prompt Learning module leverages few-shot embeddings derived from limited support set to learn adaptively central representatives, serving as visual prompts.This innovation eliminates the need for time-consuming online user interactions for providing prompts, such as exhaustively marking points and bounding boxes slice by slice; 2) A multi-source, multi-level mask decoder specifically designed for few-shot SPM image segmentation is introduced, which can effectively capture the correspondence between the support and query images.To facilitate comprehensive training and evaluation, we introduce a new dataset, SPM-Seg, curated for SPM image segmentation. 0.767Extensive experiments on this dataset reveal that the proposed APL-SAM framework significantly outperforms the original SAM, achieving over a 30% improvement in terms of Dice Similarity Coefficient with only one-shot guidance.Moreover, APL-SAM surpasses state-of-the-art few-shot segmentation methods and even fully supervised approaches in performance.Code and dataset used in this study will be made available upon acceptance. 0.732 |
2024-10-16 |
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses.However, their application in the medical domain is hindered by unique challenges.For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches.Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D.The lack of medical data further compounds these obstacles.To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine.Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data.We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. 0.709Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation.Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks.Our code is publicly available at https://github.com/function2-llx/MMMM. |
2024-10-16 |
MultiCamCows2024 -- A Multi-view Image Dataset for AI-driven Holstein-Friesian Cattle Re-Identification on a Working Farm
We present MultiCamCows2024, a farm-scale image dataset filmed across multiple cameras for the biometric identification of individual Holstein-Friesian cattle exploiting their unique black and white coat-patterns. 0.742Captured by three ceiling-mounted visual sensors covering adjacent barn areas over seven days on a working dairy farm, the dataset comprises 101, 329 images of 90 cows, plus the underlying original CCTV footage. 0.716The dataset is provided alongside full computer vision recognition baselines, that is both a supervised and self-supervised learning framework for individual cow identification trained on cattle tracklets.We report a performance above 96% single image identification accuracy from the dataset and demonstrate that combining data from multiple cameras during learning enhances self-supervised identification.We show that our framework enables fully automatic cattle identification, barring only the simple human verification of tracklet integrity during data collection.Crucially, our study highlights that multi-camera, supervised and self-supervised components in tandem not only deliver highly accurate individual cow identification but also achieve this efficiently with no labelling of cattle identities by humans at all.We argue that this improvement in efficacy has practical implications for livestock management, behaviour analysis, and agricultural monitoring.For full reproducibility and practical ease of use, we publish all key software and code including re-identification components and the species detector with this paper. |
2024-10-16 |
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts.To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date.It includes tasks for identifying dish names and their origins.We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). 0.792Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages.To support future research, we release a knowledge base with annotated food entries and images along with the VQA data. |
2024-10-16 |
Drillboards: Adaptive Visualization Dashboards for Dynamic Personalization of Visualization Experiences
We present drillboards, a technique for adaptive visualization dashboards consisting of a hierarchy of coordinated charts that the user can drill down to reach a desired level of detail depending on their expertise, interest, and desired effort.This functionality allows different users to personalize the same dashboard to their specific needs and expertise.The technique is based on a formal vocabulary of chart representations and rules for merging multiple charts of different types and data into single composite representations.The drillboard hierarchy is created by iteratively applying these rules starting from a baseline dashboard, with each consecutive operation yielding a new dashboard with fewer charts and progressively more abstract and simplified views.We also present an authoring tool for building drillboards and show how it can be applied to an agricultural dataset with hundreds of expert users. 0.845Our evaluation asked three domain experts to author drillboards for their own datasets, which we then showed to casual end-users with favorable outcomes. |
2024-10-15 |
MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation
Multimodal remote sensing data, collected from a variety of sensors, provide a comprehensive and integrated perspective of the Earth's surface. 0.73By employing multimodal fusion techniques, semantic segmentation offers more detailed insights into geographic scenes compared to single-modality approaches.Building upon recent advancements in vision foundation models, particularly the Segment Anything Model (SAM), this study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation.At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data.In addition, a pyramid-based Deep Fusion Module (DFM) is incorporated to further integrate high-level geographic features across multiple scales before decoding.This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data.Experimental results on two well-established fine-resolution multimodal remote sensing datasets, ISPRS Vaihingen and ISPRS Potsdam, confirm that the proposed MANet significantly surpasses current models in the task of multimodal semantic segmentation.The source code for this work will be accessible at https://github.com/sstary/SSRS. |
2024-10-15 |
Enhancing Assamese NLP Capabilities: Introducing a Centralized Dataset Repository
This paper introduces a centralized, open-source dataset repository designed to advance NLP and NMT for Assamese, a low-resource language. 0.779The repository supports various tasks like sentiment analysis, named entity recognition, and machine translation by providing both pre-training and fine-tuning corpora.We review existing datasets, highlighting the need for standardized resources in Assamese NLP, and discuss potential applications in AI-driven research, such as LLMs, OCR, and chatbots.While promising, challenges like data scarcity and linguistic diversity remain.The repository aims to foster collaboration and innovation, promoting Assamese language research in the digital age. |
2024-10-15 |
Do LLMs Have the Generalization Ability in Conducting Causal Inference?
In causal inference, generalization capability refers to the ability to conduct causal inference methods on new data to estimate the causal-effect between unknown phenomenon, which is crucial for expanding the boundaries of knowledge.Studies have evaluated the causal inference capabilities of Large Language Models (LLMs) concerning known phenomena, yet the generalization capabilities of LLMs concerning unseen phenomena remain unexplored.In this paper, we selected four tasks: Causal Path Discovery (CP), Backdoor Adjustment (BA), Factual Inference (FI), and Counterfactual Inference (CI) as representatives of causal inference tasks.To generate evaluation questions about previously unseen phenomena in new data on the four tasks, we propose a benchmark generation framework, which employs randomly generated graphs and node names to formulate questions within hypothetical new causal scenarios.Based on this framework, we compile a benchmark dataset of varying levels of question complexity. 0.739We extensively tested the generalization capabilities of five leading LLMs across four tasks.Experiment results reveal that while LLMs exhibit good generalization performance in solving simple CP, FI, and complex CI questions, they encounter difficulties when tackling BA questions and face obvious performance fluctuations as the problem complexity changes.Furthermore, when the names of phenomena incorporate existing terms, even if these names are entirely novel, their generalization performance can still be hindered by interference from familiar terms. |
2024-10-15 |
Network Representation Learning for Biophysical Neural Network Analysis
The analysis of biophysical neural networks (BNNs) has been a longstanding focus in computational neuroscience.A central yet unresolved challenge in BNN analysis lies in deciphering the correlations between neuronal and synaptic dynamics, their connectivity patterns, and learning process.To address this, we introduce a novel BNN analysis framework grounded in network representation learning (NRL), which leverages attention scores to uncover intricate correlations between network components and their features.Our framework integrates a new computational graph (CG)-based BNN representation, a bio-inspired graph attention network (BGAN) that enables multiscale correlation analysis across BNN representations, and an extensive BNN dataset.The CG-based representation captures key computational features, information flow, and structural relationships underlying neuronal and synaptic dynamics, while BGAN reflects the compositional structure of neurons, including dendrites, somas, and axons, as well as bidirectional information flows between BNN components.The dataset comprises publicly available models from ModelDB, reconstructed using the Python and standardized in NeuroML format, and is augmented with data derived from canonical neuron and synapse models. 0.753To our knowledge, this study is the first to apply an NRL-based approach to the full spectrum of BNNs and their analysis. |
2024-10-15 |
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
Recent advancements in Multi-modal Large Language Models (MLLMs) have opened new avenues for applications in Embodied AI.Building on previous work, EgoThink, we introduce VidEgoThink, a comprehensive benchmark for evaluating egocentric video understanding capabilities.To bridge the gap between MLLMs and low-level control in Embodied AI, we design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.To minimize manual annotation costs, we develop an automatic data generation pipeline based on the Ego4D dataset, leveraging the prior knowledge and multimodal capabilities of GPT-4o. 0.85Three human annotators then filter the generated data to ensure diversity and quality, resulting in the VidEgoThink benchmark.We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.Experimental results indicate that all MLLMs, including GPT-4o, perform poorly across all tasks related to egocentric video understanding.These findings suggest that foundation models still require significant advancements to be effectively applied to first-person scenarios in Embodied AI.In conclusion, VidEgoThink reflects a research trend towards employing MLLMs for egocentric vision, akin to human capabilities, enabling active observation and interaction in the complex real-world environments. |
2024-10-15 |
Robotic Arm Platform for Multi-View Image Acquisition and 3D Reconstruction in Minimally Invasive Surgery
Minimally invasive surgery (MIS) offers significant benefits such as reduced recovery time and minimised patient trauma, but poses challenges in visibility and access, making accurate 3D reconstruction a significant tool in surgical planning and navigation.This work introduces a robotic arm platform for efficient multi-view image acquisition and precise 3D reconstruction in MIS settings.We adapted a laparoscope to a robotic arm and captured ex-vivo images of several ovine organs across varying lighting conditions (operating room and laparoscopic) and trajectories (spherical and laparoscopic).We employed recently released learning-based feature matchers combined with COLMAP to produce our reconstructions.The reconstructions were evaluated against high-precision laser scans for quantitative evaluation.Our results show that whilst reconstructions suffer most under realistic MIS lighting and trajectory, many versions of our pipeline achieve close to sub-millimetre accuracy with an average of 1.05 mm Root Mean Squared Error and 0.82 mm Chamfer distance.Our best reconstruction results occur with operating room lighting and spherical trajectories.Our robotic platform provides a tool for controlled, repeatable multi-view data acquisition for 3D generation in MIS environments which we hope leads to new datasets for training learning-based models. 0.731 |
2024-10-15 |
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting
Time Series Forecasting (TSF) is key functionality in numerous fields, including in finance, weather services, and energy management.While TSF methods are emerging these days, many of them require domain-specific data collection and model training and struggle with poor generalization performance on new domains.Foundation models aim to overcome this limitation.Pre-trained on large-scale language or time series data, they exhibit promising inferencing capabilities in new or unseen data.This has spurred a surge in new TSF foundation models.We propose a new benchmark, FoundTS, to enable thorough and fair evaluation and comparison of such models.FoundTS covers a variety of TSF foundation models, including those based on large language models and those pretrained on time series.Next, FoundTS supports different forecasting strategies, including zero-shot, few-shot, and full-shot, thereby facilitating more thorough evaluations.Finally, FoundTS offers a pipeline that standardizes evaluation processes such as dataset splitting, loading, normalization, and few-shot sampling, thereby facilitating fair evaluations.Building on this, we report on an extensive evaluation of TSF foundation models on a broad range of datasets from diverse domains and with different statistical characteristics.Specifically, we identify pros and cons and inherent limitations of existing foundation models, and we identify directions for future model design.We make our code and datasets available at https://anonymous.4open.science/r/FoundTS-C2B0. 0.887 |
2024-10-15 |
NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models
Large language models (LLMs) combined with tool learning have gained impressive results in real-world applications.During tool learning, LLMs may call multiple tools in nested orders, where the latter tool call may take the former response as its input parameters.However, current research on the nested tool learning capabilities is still under-explored, since the existing benchmarks lack of relevant data instances.To address this problem, we introduce NesTools to bridge the current gap in comprehensive nested tool learning evaluations.NesTools comprises a novel automatic data generation method to construct large-scale nested tool calls with different nesting structures.With manual review and refinement, the dataset is in high quality and closely aligned with real-world scenarios. 0.843Therefore, NesTools can serve as a new benchmark to evaluate the nested tool learning abilities of LLMs.We conduct extensive experiments on 22 LLMs, and provide in-depth analyses with NesTools, which shows that current LLMs still suffer from the complex nested tool learning task. |
2024-10-14 |
Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
As large language models rapidly evolve to support longer context, there is a notable disparity in their capability to generate output at greater lengths.Recent study suggests that the primary cause for this imbalance may arise from the lack of data with long-output during alignment training.In light of this observation, attempts are made to re-align foundation models with data that fills the gap, which result in models capable of generating lengthy output when instructed.In this paper, we explore the impact of data-quality in tuning a model for long output, and the possibility of doing so from the starting points of human-aligned (instruct or chat) models.With careful data curation, we show that it possible to achieve similar performance improvement in our tuned models, with only a small fraction of training data instances and compute.In addition, we assess the generalizability of such approaches by applying our tuning-recipes to several models.our findings suggest that, while capacities for generating long output vary across different models out-of-the-box, our approach to tune them with high-quality data using lite compute, consistently yields notable improvement across all models we experimented on.We have made public our curated dataset for tuning long-writing capability, the implementations of model tuning and evaluation, as well as the fine-tuned models, all of which can be openly-accessed. 0.791 |
2024-10-14 |
QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis
Discovering meaningful insights from a large dataset, known as Exploratory Data Analysis (EDA), is a challenging task that requires thorough exploration and analysis of the data. 0.787Automated Data Exploration (ADE) systems use goal-oriented methods with Large Language Models and Reinforcement Learning towards full automation.However, these methods require human involvement to anticipate goals that may limit insight extraction, while fully automated systems demand significant computational resources and retraining for new datasets.We introduce QUIS, a fully automated EDA system that operates in two stages: insight generation (ISGen) driven by question generation (QUGen).The QUGen module generates questions in iterations, refining them from previous iterations to enhance coverage without human intervention or manually curated examples.The ISGen module analyzes data to produce multiple relevant insights in response to each question, requiring no prior training and enabling QUIS to adapt to new datasets. |
2024-10-14 |
Machine Translation Evaluation Benchmark for Wu Chinese: Workflow and Analysis
We introduce a FLORES+ dataset as an evaluation benchmark for modern Wu Chinese machine translation models and showcase its compatibility with existing Wu data.Wu Chinese is mutually unintelligible with other Sinitic languages such as Mandarin and Yue (Cantonese), but uses a set of Hanzi (Chinese characters) that profoundly overlaps with others.The population of Wu speakers is the second largest among languages in China, but the language has been suffering from significant drop in usage especially among the younger generations.We identify Wu Chinese as a textually low-resource language and address challenges for its machine translation models.Our contributions include: (1) an open-source, manually translated dataset, (2) full documentations on the process of dataset creation and validation experiments, (3) preliminary tools for Wu Chinese normalization and segmentation, and (4) benefits and limitations of our dataset, as well as implications to other low-resource languages. 0.899 |
2024-10-14 |
GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation
Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training.However, the advancement of these models has been hindered by the lack of comprehensive benchmarks.To address this gap, we introduce the General Time Series Forecasting Model Evaluation, GIFT-Eval, a pioneering benchmark aimed at promoting evaluation across diverse datasets.GIFT-Eval encompasses 28 datasets over 144,000 time series and 177 million data points, spanning seven domains, 10 frequencies, multivariate inputs, and prediction lengths ranging from short to long-term forecasts. 0.799To facilitate the effective pretraining and evaluation of foundation models, we also provide a non-leaking pretraining dataset containing approximately 230 billion data points. 0.779Additionally, we provide a comprehensive analysis of 17 baselines, which includes statistical models, deep learning models, and foundation models.We discuss each model in the context of various benchmark characteristics and offer a qualitative analysis that spans both deep learning and foundation models.We believe the insights from this analysis, along with access to this new standard zero-shot time series forecasting benchmark, will guide future developments in time series foundation models.The codebase, datasets, and a leaderboard showing all the results in detail will be available soon. 0.851 |
2024-10-14 |
Improve Meta-learning for Few-Shot Text Classification with All You Can Acquire from the Tasks
Meta-learning has emerged as a prominent technology for few-shot text classification and has achieved promising performance.However, existing methods often encounter difficulties in drawing accurate class prototypes from support set samples, primarily due to probable large intra-class differences and small inter-class differences within the task.Recent approaches attempt to incorporate external knowledge or pre-trained language models to augment data, but this requires additional resources and thus does not suit many few-shot scenarios.In this paper, we propose a novel solution to address this issue by adequately leveraging the information within the task itself.Specifically, we utilize label information to construct a task-adaptive metric space, thereby adaptively reducing the intra-class differences and magnifying the inter-class differences.We further employ the optimal transport technique to estimate class prototypes with query set samples together, mitigating the problem of inaccurate and ambiguous support set samples caused by large intra-class differences.We conduct extensive experiments on eight benchmark datasets, and our approach shows obvious advantages over state-of-the-art models across all the tasks on all the datasets.For reproducibility, all the datasets and codes are available at https://github.com/YvoGao/LAQDA. 0.935 |
2024-10-14 |
Cultural Fidelity in Large-Language Models: An Evaluation of Online Language Resources as a Driver of Model Performance in Value Representation
The training data for LLMs embeds societal values, increasing their familiarity with the language's culture.Our analysis found that 44% of the variance in the ability of GPT-4o to reflect the societal values of a country, as measured by the World Values Survey, correlates with the availability of digital resources in that language.Notably, the error rate was more than five times higher for the languages of the lowest resource compared to the languages of the highest resource.For GPT-4-turbo, this correlation rose to 72%, suggesting efforts to improve the familiarity with the non-English language beyond the web-scraped data.Our study developed one of the largest and most robust datasets in this topic area with 21 country-language pairs, each of which contain 94 survey questions verified by native speakers. 0.867Our results highlight the link between LLM performance and digital data availability in target languages.Weaker performance in low-resource languages, especially prominent in the Global South, may worsen digital divides.We discuss strategies proposed to address this, including developing multilingual LLMs from the ground up and enhancing fine-tuning on diverse linguistic datasets, as seen in African language initiatives. |
2024-10-14 |
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users.Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation.In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. 0.721Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc.To accommodate these formats, we developed over 40 metrics to evaluate these tasks.Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth.We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions. |
2024-10-14 |
BrainMVP: Multi-modal Vision Pre-training for Brain Image Analysis using Multi-parametric MRI
Accurate diagnosis of brain abnormalities is greatly enhanced by the inclusion of complementary multi-parametric MRI imaging data.There is significant potential to develop a universal pre-training model that can be quickly adapted for image modalities and various clinical scenarios.However, current models often rely on uni-modal image data, neglecting the cross-modal correlations among different image modalities or struggling to scale up pre-training in the presence of missing modality data.In this paper, we propose BrainMVP, a multi-modal vision pre-training framework for brain image analysis using multi-parametric MRI scans.First, we collect 16,022 brain MRI scans (over 2.4 million images), encompassing eight MRI modalities sourced from a diverse range of centers and devices. 0.801Then, a novel pre-training paradigm is proposed for the multi-modal MRI data, addressing the issue of missing modalities and achieving multi-modal information fusion.Cross-modal reconstruction is explored to learn distinctive brain image embeddings and efficient modality fusion capabilities.A modality-wise data distillation module is proposed to extract the essence representation of each MR image modality for both the pre-training and downstream application purposes.Furthermore, we introduce a modality-aware contrastive learning module to enhance the cross-modality association within a study.Extensive experiments on downstream tasks demonstrate superior performance compared to state-of-the-art pre-training methods in the medical domain, with Dice Score improvement of 0.28%-14.47% across six segmentation benchmarks and a consistent accuracy improvement of 0.65%-18.07% in four individual classification tasks. |
2024-10-14 |
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts
Adapting medical Large Language Models to local languages can reduce barriers to accessing healthcare services, but data scarcity remains a significant challenge, particularly for low-resource languages.To address this, we first construct a high-quality medical dataset and conduct analysis to ensure its quality. 0.808In order to leverage the generalization capability of multilingual LLMs to efficiently scale to more resource-constrained languages, we explore the internal information flow of LLMs from a multilingual perspective using Mixture of Experts (MoE) modularity.Technically, we propose a novel MoE routing method that employs language-specific experts and cross-lingual routing.Inspired by circuit theory, our routing analysis revealed a Spread Out in the End information flow mechanism: while earlier layers concentrate cross-lingual information flow, the later layers exhibit language-specific divergence.This insight directly led to the development of the Post-MoE architecture, which applies sparse routing only in the later layers while maintaining dense others.Experimental results demonstrate that this approach enhances the generalization of multilingual models to other languages while preserving interpretability.Finally, to efficiently scale the model to 50 languages, we introduce the concept of language family experts, drawing on linguistic priors, which enables scaling the number of languages without adding additional parameters. |
2024-10-14 |
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Recently, diffusion models have achieved great success in mono-channel audio generation.However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions.Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models.To the best of our knowledge, this work represents the first attempt to address these issues.We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. 0.78Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation.Existing audio generation models tend to generate rather random and indistinct spatial audio.To provide accurate guidance for latent diffusion models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance.By leveraging spatial guidance, our unified model not only achieves the objective of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference.Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods.The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules. |
2024-10-14 |
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries.We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets.ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs' knowledge to specify the correlated actors as various attack clues.In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1.We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. 0.882We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks.Code is available at https://github.com/renqibing/ActorAttack. |
2024-10-14 |
SensorBench: Benchmarking LLMs in Coding-Based Sensor Processing
Effective processing, interpretation, and management of sensor data have emerged as a critical component of cyber-physical systems.Traditionally, processing sensor data requires profound theoretical knowledge and proficiency in signal-processing tools.However, recent works show that Large Language Models (LLMs) have promising capabilities in processing sensory data, suggesting their potential as copilots for developing sensing systems. To explore this potential, we construct a comprehensive benchmark, SensorBench, to establish a quantifiable objective.The benchmark incorporates diverse real-world sensor datasets for various tasks. 0.757The results show that while LLMs exhibit considerable proficiency in simpler tasks, they face inherent challenges in processing compositional tasks with parameter selections compared to engineering experts.Additionally, we investigate four prompting strategies for sensor processing and show that self-verification can outperform all other baselines in 48% of tasks.Our study provides a comprehensive benchmark and prompting analysis for future developments, paving the way toward an LLM-based sensor processing copilot. |
2024-10-14 |
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks.However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated.To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers.LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA).This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables.Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models.This significantly reduces the overall evaluation cost.We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination.Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset.By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%).Our dataset is available online on HuggingFace, and our code will be available here. 0.966 |
2024-10-14 |
Depth Any Video with Scalable Synthetic Data
Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results.In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations.First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. 0.85Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency.Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates-even on single frames.At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames.Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency. |
2024-10-10 |
Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network
Human activities recognition is an important task for an intelligent robot, especially in the field of human-robot collaboration, it requires not only the label of sub-activities but also the temporal structure of the activity.In order to automatically recognize both the label and the temporal structure in sequence of human-object interaction, we propose a novel Pyramid Graph Convolutional Network (PGCN), which employs a pyramidal encoder-decoder architecture consisting of an attention based graph convolution network and a temporal pyramid pooling module for downsampling and upsampling interaction sequence on the temporal axis, respectively.The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph.To learn the human-object relations, a new attention graph convolutional network is trained to extract condensed information from the graph representation.To segment action into sub-actions, a novel temporal pyramid pooling module is proposed, which upsamples compressed features back to the original time scale and classifies actions per frame. We explore various attention layers, namely spatial attention, temporal attention and channel attention, and combine different upsampling decoders to test the performance on action recognition and segmentation.We evaluate our model on two challenging datasets in the field of human-object interaction recognition, i.e. Bimanual Actions and IKEA Assembly datasets. 0.746We demonstrate that our classifier significantly improves both framewise action recognition and segmentation, e.g., F1 micro and F1@50 scores on Bimanual Actions dataset are improved by $4.3\%$ and $8.5\%$ respectively. |
2024-10-10 |
Offline Hierarchical Reinforcement Learning via Inverse Optimization
Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards.However, learning hierarchical policies from static offline datasets presents a significant challenge.Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms.In this work, we propose OHIO:a framework for offline reinforcement learning (RL) of hierarchical policies.Our framework leverages knowledge of the policy structure to solve the inverse problem, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy.This approach constructs a dataset suitable for off-the-shelf offline training. 0.788We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness.We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed. |
2024-10-10 |
AI Surrogate Model for Distributed Computing Workloads
Large-scale international scientific collaborations, such as ATLAS, Belle II, CMS, and DUNE, generate vast volumes of data. 0.721These experiments necessitate substantial computational power for varied tasks, including structured data processing, Monte Carlo simulations, and end-user analysis.Centralized workflow and data management systems are employed to handle these demands, but current decision-making processes for data placement and payload allocation are often heuristic and disjointed.This optimization challenge potentially could be addressed using contemporary machine learning methods, such as reinforcement learning, which, in turn, require access to extensive data and an interactive environment.Instead, we propose a generative surrogate modeling approach to address the lack of training data and concerns about privacy preservation.We have collected and processed real-world job submission records, totaling more than two million jobs through 150 days, and applied four generative models for tabular data -- TVAE, CTAGGAN+, SMOTE, and TabDDPM -- to these datasets, thoroughly evaluating their performance.Along with measuring the discrepancy among feature-wise distributions separately, we also evaluate pair-wise feature correlations, distance to closest record, and responses to pre-trained models.Our experiments indicate that SMOTE and TabDDPM can generate similar tabular data, almost indistinguishable from the ground truth.Yet, as a non-learning method, SMOTE ranks the lowest in privacy preservation.As a result, we conclude that the probabilistic-diffusion-model-based TabDDPM is the most suitable generative model for managing job record data. |
2024-10-10 |
Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions
Background: Machine learning methods for clinical named entity recognition and entity normalization systems can utilize both labeled corpora and Knowledge Graphs (KGs) for learning.However, infrequently occurring concepts may have few mentions in training corpora and lack detailed descriptions or synonyms, even in large KGs.For Disease Entity Recognition (DER) and Disease Entity Normalization (DEN), this can result in fewer high quality training examples relative to the number of known diseases.Large Language Model (LLM) generation of synthetic training examples could improve performance in these information extraction tasks. Methods: We fine-tuned a LLaMa-2 13BChat LLM to generate a synthetic corpus containing normalized mentions of concepts from the Unified Medical Language System (UMLS) Disease Semantic Group. 0.769We measured overall and Out of Distribution (OOD) performance for DER and DEN, with and without synthetic data augmentation.We evaluated performance on 3 different disease corpora using 4 different data augmentation strategies, assessed using BioBERT for DER and SapBERT and KrissBERT for DEN. Results:Our synthetic data yielded a substantial improvement for DEN, in all 3 training corpora the top 1 accuracy of both SapBERT and KrissBERT improved by 3-9 points in overall performance and by 20-55 points in OOD data.A small improvement (1-2 points) was also seen for DER in overall performance, but only one dataset showed OOD improvement. Conclusion: LLM generation of normalized disease mentions can improve DEN relative to normalization approaches that do not utilize LLMs to augment data with synthetic mentions.Ablation studies indicate that performance gains for DEN were only partially attributable to improvements in OOD performance.The same approach has only a limited ability to improve DER.We make our software and dataset publicly available. 0.9 |
2024-10-10 |
PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science
Papers, patents, and clinical trials are indispensable types of scientific literature in biomedicine, crucial for knowledge sharing and dissemination.However, these documents are often stored in disparate databases with varying management standards and data formats, making it challenging to form systematic, fine-grained connections among them.To address this issue, we introduce PKG2.0, a comprehensive knowledge graph dataset encompassing over 36 million papers, 1.3 million patents, and 0.48 million clinical trials in the biomedical field. 0.777PKG2.0 integrates these previously dispersed resources through various links, including biomedical entities, author networks, citation relationships, and research projects.Fine-grained biomedical entity extraction, high-performance author name disambiguation, and multi-source citation integration have played a crucial role in the construction of the PKG dataset.Additionally, project data from the NIH Exporter enriches the dataset with metadata of NIH-funded projects and their scholarly outputs. 0.839Data validation demonstrates that PKG2.0 excels in key tasks such as author disambiguation and biomedical entity recognition.This dataset provides valuable resources for biomedical researchers, bibliometric scholars, and those engaged in literature mining. 0.76 |
2024-10-10 |
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions.However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses.Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses.The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to responses with the highest rewards, which are sparse in data.To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions.We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.This dataset is easily integrated with existing direct alignment algorithms and is applicable to any preference dataset. 0.829The experimental results across instruction-following benchmarks including AlpacaEval, MT-Bench, and Arena-Hard-Auto demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models.Additionally, our method improves the average accuracy on various academic benchmarks.When applying our method to on-policy data, the resulting DPO model achieves SOTA results on AlpacaEval.Through ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere dataset expansion.Our code is available at https://github.com/shenao-zhang/reward-augmented-preference. |
2024-10-10 |
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models
Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks.Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities.However, these approaches overlook crucial requirements for prior knowledge of specific concepts, theorems, and tricks to tackle most arithmetic reasoning problems successfully.To address this issue, we propose a novel and effective Teaching-Inspired Integrated Framework, which emulates the instructional process of a teacher guiding students.This method equips LLMs with essential concepts, relevant theorems, and similar problems with analogous solution approaches, facilitating the enhancement of reasoning abilities.Additionally, we introduce two new Chinese datasets, MathMC and MathToF, both with detailed explanations and answers. 0.862Experiments are conducted on nine benchmarks which demonstrates that our approach improves the reasoning accuracy of LLMs.With GPT-4 and our framework, we achieve new state-of-the-art performance on four math benchmarks (AddSub, SVAMP, Math23K and AQuA) with accuracies of 98.2% (+3.3%), 93.9% (+0.2%), 94.3% (+7.2%) and 81.1% (+1.2%).Our data and code are available at https://github.com/SallyTan13/Teaching-Inspired-Prompting. |
2024-10-10 |
DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory
Large language models (LLMs) have achieved reasonable quality improvements in machine translation (MT).However, most current research on MT-LLMs still faces significant challenges in maintaining translation consistency and accuracy when processing entire documents.In this paper, we introduce DelTA, a Document-levEL Translation Agent designed to overcome these limitations.DelTA features a multi-level memory structure that stores information across various granularities and spans, including Proper Noun Records, Bilingual Summary, Long-Term Memory, and Short-Term Memory, which are continuously retrieved and updated by auxiliary LLM-based components.Experimental results indicate that DelTA significantly outperforms strong baselines in terms of translation consistency and quality across four open/closed-source LLMs and two representative document translation datasets, achieving an increase in consistency scores by up to 4.58 percentage points and in COMET scores by up to 3.16 points on average.DelTA employs a sentence-by-sentence translation strategy, ensuring no sentence omissions and offering a memory-efficient solution compared to the mainstream method.Furthermore, DelTA improves pronoun translation accuracy, and the summary component of the agent also shows promise as a tool for query-based summarization tasks.We release our code and data at https://github.com/YutongWang1216/DocMTAgent. 0.882 |
2024-10-10 |
Insight Over Sight? Exploring the Vision-Knowledge Conflicts in Multimodal LLMs
This paper explores the problem of commonsense-level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model's internal commonsense knowledge (see Figure 1).To study this issue, we introduce an automated pipeline, augmented with human-in-the-loop quality control, to establish a benchmark aimed at simulating and assessing the conflicts in MLLMs.Utilizing this pipeline, we have crafted a diagnostic benchmark comprising 374 original images and 1,122 high-quality question-answer (QA) pairs.This benchmark covers two types of conflict target and three question difficulty levels, providing a thorough assessment tool.Through this benchmark, we evaluate the conflict-resolution capabilities of nine representative MLLMs across various model families and find a noticeable over-reliance on textual queries.Drawing on these findings, we propose a novel prompting strategy, "Focus-on-Vision" (FoV), which markedly enhances MLLMs' ability to favor visual data over conflicting textual knowledge.Our detailed analysis and the newly proposed strategy significantly advance the understanding and mitigating of vision-knowledge conflicts in MLLMs.The data and code are made publicly available. 0.774 |
Data Quality |
|
2024-10-16 |
Accurate and Data-Efficient Toxicity Prediction when Annotators Disagree
When annotators disagree, predicting the labels given by individual annotators can capture nuances overlooked by traditional label aggregation. 0.813We introduce three approaches to predicting individual annotator ratings on the toxicity of text by incorporating individual annotator-specific information: a neural collaborative filtering (NCF) approach, an in-context learning (ICL) approach, and an intermediate embedding-based architecture.We also study the utility of demographic information for rating prediction.NCF showed limited utility; however, integrating annotator history, demographics, and survey information permits both the embedding-based architecture and ICL to substantially improve prediction accuracy, with the embedding-based architecture outperforming the other methods.We also find that, if demographics are predicted from survey information, using these imputed demographics as features performs comparably to using true demographic data.This suggests that demographics may not provide substantial information for modeling ratings beyond what is captured in survey responses.Our findings raise considerations about the relative utility of different types of annotator information and provide new approaches for modeling annotators in subjective NLP tasks. |
2024-10-15 |
AIC CTU system at AVeriTeC: Re-framing automated fact-checking as a simple RAG task
This paper describes our $3^{rd}$ place submission in the AVeriTeC shared task in which we attempted to address the challenge of fact-checking with evidence retrieved in the wild using a simple scheme of Retrieval-Augmented Generation (RAG) designed for the task, leveraging the predictive power of Large Language Models.We release our codebase and explain its two modules - the Retriever and the Evidence & Label generator - in detail, justifying their features such as MMR-reranking and Likert-scale confidence estimation.We evaluate our solution on AVeriTeC dev and test set and interpret the results, picking the GPT-4o as the most appropriate model for our pipeline at the time of our publication, with Llama 3.1 70B being a promising open-source alternative.We perform an empirical error analysis to see that faults in our predictions often coincide with noise in the data or ambiguous fact-checks, provoking further research and data augmentation. 0.671 |
2024-10-14 |
Manifold-Aware Local Feature Modeling for Semi-Supervised Medical Image Segmentation
Achieving precise medical image segmentation is vital for effective treatment planning and accurate disease diagnosis.Traditional fully-supervised deep learning methods, though highly precise, are heavily reliant on large volumes of labeled data, which are often difficult to obtain due to the expertise required for medical annotations.This has led to the rise of semi-supervised learning approaches that utilize both labeled and unlabeled data to mitigate the label scarcity issue. 0.614In this paper, we introduce the Manifold-Aware Local Feature Modeling Network (MANet), which enhances the U-Net architecture by incorporating manifold supervision signals.This approach focuses on improving boundary accuracy, which is crucial for reliable medical diagnosis.To further extend the versatility of our method, we propose two variants: MA-Sobel and MA-Canny.The MA-Sobel variant employs the Sobel operator, which is effective for both 2D and 3D data, while the MA-Canny variant utilizes the Canny operator, specifically designed for 2D images, to refine boundary detection.These variants allow our method to adapt to various medical image modalities and dimensionalities, ensuring broader applicability.Our extensive experiments on datasets such as ACDC, LA, and Pancreas-NIH demonstrate that MANet consistently surpasses state-of-the-art methods in performance metrics like Dice and Jaccard scores.The proposed method also shows improved generalization across various semi-supervised segmentation networks, highlighting its robustness and effectiveness.Visual analysis of segmentation results confirms that MANet offers clearer and more accurate class boundaries, underscoring the value of manifold information in medical image segmentation. |
2024-10-14 |
Affinity-Graph-Guided Contractive Learning for Pretext-Free Medical Image Segmentation with Minimal Annotation
The combination of semi-supervised learning (SemiSL) and contrastive learning (CL) has been successful in medical image segmentation with limited annotations.However, these works often rely on pretext tasks that lack the specificity required for pixel-level segmentation, and still face overfitting issues due to insufficient supervision signals resulting from too few annotations.Therefore, this paper proposes an affinity-graph-guided semi-supervised contrastive learning framework (Semi-AGCL) by establishing additional affinity-graph-based supervision signals between the student and teacher network, to achieve medical image segmentation with minimal annotations without pretext.The framework first designs an average-patch-entropy-driven inter-patch sampling method, which can provide a robust initial feature space without relying on pretext tasks.Furthermore, the framework designs an affinity-graph-guided loss function, which can improve the quality of the learned representation and the model generalization ability by exploiting the inherent structure of the data, thus mitigating overfitting.Our experiments indicate that with merely 10% of the complete annotation set, our model approaches the accuracy of the fully annotated baseline, manifesting a marginal deviation of only 2.52%. 0.638Under the stringent conditions where only 5% of the annotations are employed, our model exhibits a significant enhancement in performance surpassing the second best baseline by 23.09% on the dice metric and achieving an improvement of 26.57% on the notably arduous CRAG and ACDC datasets. |
2024-10-14 |
The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels
Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data.A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers.Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher.In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs.Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails.This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! 0.777We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks.In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. 0.7Given the proliferation of SSMs, particularly in large language models, we believe significant efforts should be invested in further delineating their susceptibility to clean-label poisoning, and in developing methods for overcoming this susceptibility. 0.638 |
2024-10-14 |
Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios?
POS tagging plays a fundamental role in numerous applications.While POS taggers are highly accurate in well-resourced settings, they lag behind in cases of limited or missing training data.This paper focuses on POS tagging for languages with limited data.We seek to identify the characteristics of datasets that make them favourable for training POS tagging models without using any labelled training data from the target language. 0.654This is a zero-shot approach.We compare the accuracies of a multilingual large language model (mBERT) fine-tuned on one or more languages related to the target language.Additionally, we compare these results with models trained directly on the target language itself.We do this for three target low-resource languages.Our research highlights the importance of accurate dataset selection for effective zero-shot POS tagging.Particularly, a strong linguistic relationship and high-quality datasets ensure optimal results.For extremely low-resource languages, zero-shot models prove to be a viable option. |
2024-10-10 |
AHA: Human-Assisted Out-of-Distribution Generalization and Detection
Modern machine learning models deployed often encounter distribution shifts in real-world applications, manifesting as covariate or semantic out-of-distribution (OOD) shifts.These shifts give rise to challenges in OOD generalization and OOD detection.This paper introduces a novel, integrated approach AHA (Adaptive Human-Assisted OOD learning) to simultaneously address both OOD generalization and detection through a human-assisted framework by labeling data in the wild.Our approach strategically labels examples within a novel maximum disambiguation region, where the number of semantic and covariate OOD data roughly equalizes.By labeling within this region, we can maximally disambiguate the two types of OOD data, thereby maximizing the utility of the fixed labeling budget.Our algorithm first utilizes a noisy binary search algorithm that identifies the maximal disambiguation region with high probability.The algorithm then continues with annotating inside the identified labeling region, reaping the full benefit of human feedback. 0.652Extensive experiments validate the efficacy of our framework.We observed that with only a few hundred human annotations, our method significantly outperforms existing state-of-the-art methods that do not involve human assistance, in both OOD generalization and OOD detection.Code is publicly available at \url{https://github.com/HaoyueBaiZJU/aha}. |
2024-10-09 |
Learning from Spatio-temporal Correlation for Semi-Supervised LiDAR Semantic Segmentation
We address the challenges of the semi-supervised LiDAR segmentation (SSLS) problem, particularly in low-budget scenarios.The two main issues in low-budget SSLS are the poor-quality pseudo-labels for unlabeled data, and the performance drops due to the significant imbalance between ground-truth and pseudo-labels.This imbalance leads to a vicious training cycle.To overcome these challenges, we leverage the spatio-temporal prior by recognizing the substantial overlap between temporally adjacent LiDAR scans.We propose a proximity-based label estimation, which generates highly accurate pseudo-labels for unlabeled data by utilizing semantic consistency with adjacent labeled data. 0.671Additionally, we enhance this method by progressively expanding the pseudo-labels from the nearest unlabeled scans, which helps significantly reduce errors linked to dynamic classes. 0.624Additionally, we employ a dual-branch structure to mitigate performance degradation caused by data imbalance.Experimental results demonstrate remarkable performance in low-budget settings (i.e., <= 5%) and meaningful improvements in normal budget settings (i.e., 5 - 50%).Finally, our method has achieved new state-of-the-art results on SemanticKITTI and nuScenes in semi-supervised LiDAR segmentation.With only 5% labeled data, it offers competitive results against fully-supervised counterparts.Moreover, it surpasses the performance of the previous state-of-the-art at 100% labeled data (75.2%) using only 20% of labeled data (76.0%) on nuScenes.The code is available on https://github.com/halbielee/PLE. |
2024-10-07 |
Causal Micro-Narratives
We present a novel approach to classify causal micro-narratives from text.These narratives are sentence-level explanations of the cause(s) and/or effect(s) of a target subject.The approach requires only a subject-specific ontology of causes and effects, and we demonstrate it with an application to inflation narratives.Using a human-annotated dataset spanning historical and contemporary US news articles for training, we evaluate several large language models (LLMs) on this multi-label classification task.The best-performing model--a fine-tuned Llama 3.1 8B--achieves F1 scores of 0.87 on narrative detection and 0.71 on narrative classification.Comprehensive error analysis reveals challenges arising from linguistic ambiguity and highlights how model errors often mirror human annotator disagreements. 0.803This research establishes a framework for extracting causal micro-narratives from real-world data, with wide-ranging applications to social science research. |
2024-10-03 |
Online Multi-Label Classification under Noisy and Changing Label Distribution
Multi-label data stream usually contains noisy labels in the real-world applications, namely occuring in both relevant and irrelevant labels. 0.7However, existing online multi-label classification methods are mostly limited in terms of label quality and fail to deal with the case of noisy labels. 0.617On the other hand, the ground-truth label distribution may vary with the time changing, which is hidden in the observed noisy label distribution and difficult to track, posing a major challenge for concept drift adaptation.Motivated by this, we propose an online multi-label classification algorithm under Noisy and Changing Label Distribution (NCLD). 0.625The convex objective is designed to simultaneously model the label scoring and the label ranking for high accuracy, whose robustness to NCLD benefits from three novel works: 1) The local feature graph is used to reconstruct the label scores jointly with the observed labels, and an unbiased ranking loss is derived and applied to learn reliable ranking information. 0.6912) By detecting the difference between two adjacent chunks with the unbiased label cardinality, we identify the change in the ground-truth label distribution and reset the ranking or all information learned from the past to match the new distribution.3) Efficient and accurate updating is achieved based on the updating rule derived from the closed-form optimal model solution.Finally, empirical experimental results validate the effectiveness of our method in classifying instances under NCLD. |
2024-10-03 |
Learning 3D Perception from Others' Predictions
Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality.Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment.We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector.For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area.This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car).Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance.We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. 0.636We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units' predictions via self-training.We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector.We validate our approach on the recently released real-world collaborative driving dataset, using reference cars' predictions as pseudo labels for the ego car.Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units' predictions. |
Benchmarks |
|
2024-10-16 |
Rethinking Bjøntegaard Delta for Compression Efficiency Evaluation: Are We Calculating It Precisely and Reliably?
For decades, the Bj{\o}ntegaard Delta (BD) has been the metric for evaluating codec Rate-Distortion (R-D) performance.Yet, in most studies, BD is determined using just 4-5 R-D data points, could this be sufficient?As codecs and quality metrics advance, does the conventional BD estimation still hold up?Crucially, are the performance improvements of new codecs and tools genuine, or merely artifacts of estimation flaws?This paper addresses these concerns by reevaluating BD estimation.We present a novel approach employing a parameterized deep neural network to model R-D curves with high precision across various metrics, accompanied by a comprehensive R-D dataset.This approach both assesses the reliability of BD calculations and serves as a precise BD estimator. 0.688Our findings advocate for the adoption of rigorous R-D sampling and reliability metrics in future compression research to ensure the validity and reliability of results. |
2024-10-16 |
Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations
Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems.Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors.This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs).By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs.Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors.The LLM models the user's interactions including behaviors and item features in natural languages.Initially, the LLM is warmed up using only natural language-based prompts.We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM.Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. 0.626Further ablation studies validate the effectiveness of our model design and benefits of the TMF. |
2024-10-16 |
Off-dynamics Conditional Diffusion Planners
Offline Reinforcement Learning (RL) offers an attractive alternative to interactive data acquisition by leveraging pre-existing datasets.However, its effectiveness hinges on the quantity and quality of the data samples. 0.61This work explores the use of more readily available, albeit off-dynamics datasets, to address the challenge of data scarcity in Offline RL.We propose a novel approach using conditional Diffusion Probabilistic Models (DPMs) to learn the joint distribution of the large-scale off-dynamics dataset and the limited target dataset.To enable the model to capture the underlying dynamics structure, we introduce two contexts for the conditional model: (1) a continuous dynamics score allows for partial overlap between trajectories from both datasets, providing the model with richer information; (2) an inverse-dynamics context guides the model to generate trajectories that adhere to the target environment's dynamic constraints.Empirical results demonstrate that our method significantly outperforms several strong baselines. 0.876Ablation studies further reveal the critical role of each dynamics context.Additionally, our model demonstrates that by modifying the context, we can interpolate between source and target dynamics, making it more robust to subtle shifts in the environment. |
2024-10-16 |
Leveraging Spatial Attention and Edge Context for Optimized Feature Selection in Visual Localization
Visual localization determines an agent's precise position and orientation within an environment using visual data.It has become a critical task in the field of robotics, particularly in applications such as autonomous navigation.This is due to the ability to determine an agent's pose using cost-effective sensors such as RGB cameras.Recent methods in visual localization employ scene coordinate regression to determine the agent's pose.However, these methods face challenges as they attempt to regress 2D-3D correspondences across the entire image region, despite not all regions providing useful information.To address this issue, we introduce an attention network that selectively targets informative regions of the image.Using this network, we identify the highest-scoring features to improve the feature selection process and combine the result with edge detection.This integration ensures that the features chosen for the training buffer are located within robust regions, thereby improving 2D-3D correspondence and overall localization performance.Our approach was tested on the outdoor benchmark dataset, demonstrating superior results compared to previous methods. 0.826 |
2024-10-16 |
Correction to Local Information Privacy and Its Applications to Data Aggregation
In our previous works, we defined Local Information Privacy (LIP) as a context-aware privacy notion and presented the corresponding privacy-preserving mechanism.Then we claim that the mechanism satisfies epsilon-LIP for any epsilon>0 for arbitrary Px.However, this claim is not completely correct.In this document, we provide a correction to the valid range of privacy parameters of our previously proposed LIP mechanism.Further, we propose efficient algorithms to expand the range of valid privacy parameters.Finally, we discuss the impact of updated results on our original paper's experiments, the rationale of the proposed correction and corrected results. 0.626 |
2024-10-16 |
A Data-driven Contact Estimation Method for Wheeled-Biped Robots
Contact estimation is a key ability for limbed robots, where making and breaking contacts has a direct impact on state estimation and balance control.Existing approaches typically rely on gate-cycle priors or designated contact sensors.We design a contact estimator that is suitable for the emerging wheeled-biped robot types that do not have these features.To this end, we propose a Bayes filter in which update steps are learned from real-robot torque measurements while prediction steps rely on inertial measurements.We evaluate this approach in extensive real-robot and simulation experiments.Our method achieves better performance while being considerably more sample efficient than a comparable deep-learning baseline. 0.724 |
2024-10-16 |
ShapefileGPT: A Multi-Agent Large Language Model Framework for Automated Shapefile Processing
Vector data is one of the two core data structures in geographic information science (GIS), essential for accurately storing and representing geospatial information.Shapefile, the most widely used vector data format, has become the industry standard supported by all major geographic information systems.However, processing this data typically requires specialized GIS knowledge and skills, creating a barrier for researchers from other fields and impeding interdisciplinary research in spatial data analysis.Moreover, while large language models (LLMs) have made significant advancements in natural language processing and task automation, they still face challenges in handling the complex spatial and topological relationships inherent in GIS vector data.To address these challenges, we propose ShapefileGPT, an innovative framework powered by LLMs, specifically designed to automate Shapefile tasks.ShapefileGPT utilizes a multi-agent architecture, in which the planner agent is responsible for task decomposition and supervision, while the worker agent executes the tasks.We developed a specialized function library for handling Shapefiles and provided comprehensive API documentation, enabling the worker agent to operate Shapefiles efficiently through function calling.For evaluation, we developed a benchmark dataset based on authoritative textbooks, encompassing tasks in categories such as geometric operations and spatial queries. 0.627ShapefileGPT achieved a task success rate of 95.24%, outperforming the GPT series models.In comparison to traditional LLMs, ShapefileGPT effectively handles complex vector data analysis tasks, overcoming the limitations of traditional LLMs in spatial analysis.This breakthrough opens new pathways for advancing automation and intelligence in the GIS field, with significant potential in interdisciplinary data analysis and application contexts. |
2024-10-16 |
HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks
Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence.Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning.To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation.HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow.Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage.LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements.Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions.We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges.Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1.Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities.These results underscore key areas for future research to enhance LMMs' capabilities.We have open-sourced our code and benchmark at https://github.com/HumanEval-V/HumanEval-V-Benchmark. 0.69 |
2024-10-16 |
FairGLVQ: Fairness in Partition-Based Classification
Fairness is an important objective throughout society.From the distribution of limited goods such as education, over hiring and payment, to taxes, legislation, and jurisprudence.Due to the increasing importance of machine learning approaches in all areas of daily life including those related to health, security, and equity, an increasing amount of research focuses on fair machine learning.In this work, we focus on the fairness of partition- and prototype-based models.The contribution of this work is twofold: 1) we develop a general framework for fair machine learning of partition-based models that does not depend on a specific fairness definition, and 2) we derive a fair version of learning vector quantization (LVQ) as a specific instantiation.We compare the resulting algorithm against other algorithms from the literature on theoretical and real-world data showing its practical relevance. 0.709 |
2024-10-16 |
The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph
The performance of large language models (LLMs) in natural language processing (NLP) tasks is significantly influenced by the quality and diversity of data used for supervised fine-tuning (SFT).Current data selection methods often focus solely on quality or diversity, leading to underperforming models due to suboptimal training data.In this paper, we introduce GraphFilter, a novel method that represents the dataset as a bipartite graph, linking sentences to their constituent n-grams.This representation effectively captures the relationships between sentences and linguistic patterns, facilitating the selection of sentences that enhance n-gram diversity.To balance quality and diversity during selection, we propose a priority function that combines the quality metric with the diversity metric in a multiplicative manner.GraphFilter iteratively selects high-priority sentences, updates the bipartite graph by removing covered n-grams, and re-calculates priorities to reflect the evolving data landscape.We conduct extensive experiments using three model backbones across six widely used benchmarks. 0.682The results demonstrate that GraphFilter outperforms all nine baseline approaches, achieving superior model performance and computational efficiency. 0.611Our analyses validate the effectiveness of our design choices, examine the subsets selected by GraphFilter and other methods, highlight the importance of instruction diversity, and explore the role of quality and diversity in relation to subset sizes.GraphFilter establishes a new foundation for effective data selection strategies, encouraging further research in data selection for LLMs. |
2024-10-16 |
Data-Driven Gyroscope Calibration
Gyroscopes are inertial sensors that measure the angular velocity of the platforms to which they are attached.To estimate the gyroscope deterministic error terms prior mission start, a calibration procedure is performed.When considering low-cost gyroscopes, the calibration requires a turntable as the gyros are incapable of sensing the Earth turn rate.In this paper, we propose a data-driven framework to estimate the scale factor and bias of a gyroscope.To train and validate our approach, a dataset of 56 minutes was recorded using a turntable.We demonstrated that our proposed approach outperforms the model-based approach, in terms of accuracy and convergence time. 0.711Specifically, we improved the scale factor and bias estimation by an average of 72% during six seconds of calibration time, demonstrating an average of 75% calibration time improvement. 0.633That is, instead of minutes, our approach requires only several seconds for the calibration. |
2024-10-16 |
On the Role of Activation Functions in EEG-To-Text Decoder
In recent years, much interdisciplinary research has been conducted exploring potential use cases of neuroscience to advance the field of information retrieval.Initial research concentrated on the use of fMRI data, but fMRI was deemed to be not suitable for real-world applications, and soon, research shifted towards using EEG data.In this paper, we try to improve the original performance of a first attempt at generating text using EEG by focusing on the less explored area of optimising neural network performance.We test a set of different activation functions and compare their performance. 0.677Our results show that introducing a higher degree polynomial activation function can enhance model performance without changing the model architecture.We also show that the learnable 3rd-degree activation function performs better on the 1-gram evaluation compared to a 3rd-degree non-learnable function.However, when evaluating the model on 2-grams and above, the polynomial function lacks in performance, whilst the leaky ReLU activation function outperforms the baseline. |
2024-10-16 |
Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting
The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy.Although spatio-temporal graph neural networks have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors.Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly arrived data and the detrimental effects of catastrophic forgetting over long-term history.To address these challenges, we propose a novel prompt tuning-based continuous forecasting method, following two fundamental tuning principles guided by empirical and theoretical analysis: expand and compress, which effectively resolve the aforementioned problems with lightweight tuning parameters.Specifically, we integrate the base spatio-temporal graph neural network with a continuous prompt pool, utilizing stored prompts (i.e., few learnable parameters) in memory, and jointly optimize them with the base spatio-temporal graph neural network.This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods.Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority of our method over the state-of-the-art baselines, including effectiveness, efficiency, universality, etc. 0.751 |
2024-10-16 |
The Bayesian Confidence (BACON) Estimator for Deep Neural Networks
This paper introduces the Bayesian Confidence Estimator (BACON) for deep neural networks.Current practice of interpreting Softmax values in the output layer as probabilities of outcomes is prone to extreme predictions of class probability.In this work we extend Waagen's method of representing the terminal layers with a geometric model, where the probability associated with an output vector is estimated with Bayes' Rule using validation data to provide likelihood and normalization values.This estimator provides superior ECE and ACE calibration error compared to Softmax for ResNet-18 at 85% network accuracy, and EfficientNet-B0 at 95% network accuracy, on the CIFAR-10 dataset with an imbalanced test set, except for very high accuracy edge cases. 0.622In addition, when using the ACE metric, BACON demonstrated improved calibration error when estimating probabilities for the imbalanced test set when using actual class distribution fractions. |
2024-10-16 |
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Document Layout Analysis is crucial for real-world document understanding systems, but it encounters a challenging trade-off between speed and accuracy: multimodal methods leveraging both text and visual features achieve higher accuracy but suffer from significant latency, whereas unimodal methods relying solely on visual features offer faster processing speeds at the expense of accuracy.To address this dilemma, we introduce DocLayout-YOLO, a novel approach that enhances accuracy while maintaining speed advantages through document-specific optimizations in both pre-training and model design.For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm, which frames document synthesis as a two-dimensional bin packing problem, generating the large-scale, diverse DocSynth-300K dataset.Pre-training on the resulting DocSynth-300K dataset significantly improves fine-tuning performance across various document types.In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module that is capable of better handling multi-scale variations of document elements.Furthermore, to validate performance across different document types, we introduce a complex and challenging benchmark named DocStructBench. 0.609Extensive experiments on downstream datasets demonstrate that DocLayout-YOLO excels in both speed and accuracy.Code, data, and models are available at https://github.com/opendatalab/DocLayout-YOLO. |
2024-10-16 |
Optimization and Application of Cloud-based Deep Learning Architecture for Multi-Source Data Prediction
This study develops a cloud-based deep learning system for early prediction of diabetes, leveraging the distributed computing capabilities of the AWS cloud platform and deep learning technologies to achieve efficient and accurate risk assessment.The system utilizes EC2 p3.8xlarge GPU instances to accelerate model training, reducing training time by 93.2% while maintaining a prediction accuracy of 94.2%.With an automated data processing and model training pipeline built using Apache Airflow, the system can complete end-to-end updates within 18.7 hours.In clinical applications, the system demonstrates a prediction accuracy of 89.8%, sensitivity of 92.3%, and specificity of 95.1%. 0.641Early interventions based on predictions lead to a 37.5% reduction in diabetes incidence among the target population.The system's high performance and scalability provide strong support for large-scale diabetes prevention and management, showcasing significant public health value. |
2024-10-16 |
Constrained Posterior Sampling: Time Series Generation with Hard Constraints
Generating realistic time series samples is crucial for stress-testing models and protecting user privacy by using synthetic data.In engineering and safety-critical applications, these samples must meet certain hard constraints that are domain-specific or naturally imposed by physics or nature.Consider, for example, generating electricity demand patterns with constraints on peak demand times.This can be used to stress-test the functioning of power grids during adverse weather conditions.Existing approaches for generating constrained time series are either not scalable or degrade sample quality.To address these challenges, we introduce Constrained Posterior Sampling (CPS), a diffusion-based sampling algorithm that aims to project the posterior mean estimate into the constraint set after each denoising update.Notably, CPS scales to a large number of constraints (~100) without requiring additional training.We provide theoretical justifications highlighting the impact of our projection step on sampling.Empirically, CPS outperforms state-of-the-art methods in sample quality and similarity to real time series by around 10% and 42%, respectively, on real-world stocks, traffic, and air quality datasets. 0.612 |
2024-10-16 |
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression
To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices.Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication.To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG).Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators.Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput.Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links.To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks.Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence. 0.819 |
2024-10-16 |
Comparative Analysis of Extrinsic Factors for NER in French
Named entity recognition (NER) is a crucial task that aims to identify structured information, which is often replete with complex, technical terms and a high degree of variability.Accurate and reliable NER can facilitate the extraction and analysis of important information.However, NER for other than English is challenging due to limited data availability, as the high expertise, time, and expenses are required to annotate its data.In this paper, by using the limited data, we explore various factors including model structure, corpus annotation scheme and data augmentation techniques to improve the performance of a NER model for French.Our experiments demonstrate that these approaches can significantly improve the model's F1 score from original CRF score of 62.41 to 79.39. 0.629Our findings suggest that considering different extrinsic factors and combining these techniques is a promising approach for improving NER performance where the size of data is limited. |
2024-10-16 |
JudgeBench: A Benchmark for Evaluating LLM-based Judges
LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models.However, the reliability of LLM-based judges themselves is rarely scrutinized.As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them.Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. 0.696To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges.Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness.Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing.Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges.Data and code are available at https://github.com/ScalerLab/JudgeBench . |
2024-10-16 |
Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models
Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios.Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability.However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed.To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities.Specifically, we create and evolve two sets of prototypes--textual and visual--to progressively capture more accurate multi-modal representations for target classes during test time.Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities.Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency. 0.737Code is available at https://github.com/zhangce01/DPE-CLIP. |
2024-10-16 |
Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media
Does the People's Republic of China (PRC) interfere with European elections through ethnic Chinese diaspora media?This question forms the basis of an ongoing research project exploring how PRC narratives about European elections are represented in Chinese diaspora media, and thus the objectives of PRC news media manipulation.In order to study diaspora media efficiently and at scale, it is necessary to use techniques derived from quantitative text analysis, such as topic modelling.In this paper, we present a pipeline for studying information dynamics in Chinese media.Firstly, we present KeyNMF, a new approach to static and dynamic topic modelling using transformer-based contextual embedding models.We provide benchmark evaluations to demonstrate that our approach is competitive on a number of Chinese datasets and metrics. 0.748Secondly, we integrate KeyNMF with existing methods for describing information dynamics in complex systems.We apply this pipeline to data from five news sites, focusing on the period of time leading up to the 2024 European parliamentary elections.Our methods and results demonstrate the effectiveness of KeyNMF for studying information dynamics in Chinese media and lay groundwork for further work addressing the broader research questions. |
2024-10-15 |
A Bilevel Optimization Framework for Imbalanced Data Classification
Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data.To tackle unresolved problems related to both oversampling and undersampling, we propose a new undersampling approach that: (i) avoids the pitfalls of noise and overlap caused by synthetic data and (ii) avoids the pitfall of under-fitting caused by random undersampling.Instead of undersampling majority data randomly, our method undersamples datapoints based on their ability to improve model loss.Using improved model loss as a proxy measurement for classification performance, our technique assesses a datapoint's impact on loss and rejects those unable to improve it.In so doing, our approach rejects majority datapoints redundant to datapoints already accepted and, thereby, finds an optimal subset of majority training data for classification.The accept/reject component of our algorithm is motivated by a bilevel optimization problem uniquely formulated to identify the optimal training set we seek.Experimental results show our proposed technique with F1 scores up to 10% higher than state-of-the-art methods. 0.817 |
2024-10-15 |
Athena: Retrieval-augmented Legal Judgment Prediction with Large Language Models
Recently, large language models (LLMs) like ChatGPT, LLaMA, and Claude have prevailed in countless domains, including legal scenarios.With LLMs' rapid technological progress, the development of prompt engineering (PE) as an interface between the LLMs and real-world applications has drawn the attention of all developers.Various PE methods have been proposed to overcome real-world challenges, such as few-shot prompting, chain-of-thought, and retrieval-augmented generation (RAG).However, RAG for legal judgment prediction (LJP) is still underexplored.To address this, we propose "Athena", a novel framework cultivating RAG as a core preprocess component to enhance LLMs' performance on specialized tasks.Athena constructs a knowledge base for accusations, attached with a semantic retrieval mechanism through vectorization.Our experiments show that Athena's overall performance has improved significantly, achieving state-of-the-art results on the CAIL2018 dataset.Our ablation study on the in-context window size parameter further reproduces LLMs' "lost-in-the-middle" phenomenon with a relative positional variation.And with moderate hyper-parameter-tuning, we can achieve at most 95% of accuracy accordingly. 0.709We also study the impact of query rewriting and data distribution, providing possible directions for future research based on former analyses. |
2024-10-15 |
A CLIP-Powered Framework for Robust and Generalizable Data Selection
Large-scale datasets have been pivotal to the advancements of deep learning models in recent years, but training on such large datasets invariably incurs substantial storage and computational overhead.Meanwhile, real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.Data selection has shown promise in identifying the most representative samples from the entire dataset, which aims to minimize the performance gap with reduced training costs.Existing works typically rely on single-modality information to assign importance scores for individual samples, which may lead to inaccurate assessments, especially when dealing with noisy or corrupted samples.To address this limitation, we propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.Specifically, our framework consists of three key modules-dataset adaptation, sample scoring, and selection optimization-that together harness extensive pre-trained multimodal knowledge to comprehensively assess sample influence and optimize the selection results through multi-objective optimization.Extensive experiments demonstrate that our approach consistently outperforms existing state-of-the-art baselines on various benchmark datasets. 0.842Notably, our method effectively removes noisy or damaged samples from the dataset, enabling it to achieve even higher performance with less data. 0.626This indicates that it is not only a way to accelerate training but can also improve overall data quality. |
2024-10-15 |
Scalable Indoor Novel-View Synthesis using Drone-Captured 360 Imagery with 3D Gaussian Splatting
Scene reconstruction and novel-view synthesis for large, complex, multi-story, indoor scenes is a challenging and time-consuming task.Prior methods have utilized drones for data capture and radiance fields for scene reconstruction, both of which present certain challenges.First, in order to capture diverse viewpoints with the drone's front-facing camera, some approaches fly the drone in an unstable zig-zag fashion, which hinders drone-piloting and generates motion blur in the captured data.Secondly, most radiance field methods do not easily scale to arbitrarily large number of images.This paper proposes an efficient and scalable pipeline for indoor novel-view synthesis from drone-captured 360 videos using 3D Gaussian Splatting.360 cameras capture a wide set of viewpoints, allowing for comprehensive scene capture under a simple straightforward drone trajectory.To scale our method to large scenes, we devise a divide-and-conquer strategy to automatically split the scene into smaller blocks that can be reconstructed individually and in parallel.We also propose a coarse-to-fine alignment strategy to seamlessly match these blocks together to compose the entire scene.Our experiments demonstrate marked improvement in both reconstruction quality, i.e. PSNR and SSIM, and computation time compared to prior approaches. 0.645 |
2024-10-15 |
TraM : Enhancing User Sleep Prediction with Transformer-based Multivariate Time Series Modeling and Machine Learning Ensembles
This paper presents a novel approach that leverages Transformer-based multivariate time series model and Machine Learning Ensembles to predict the quality of human sleep, emotional states, and stress levels.A formula to calculate the labels was developed, and the various models were applied to user data.Time Series Transformer was used for labels where time series characteristics are crucial, while Machine Learning Ensembles were employed for labels requiring comprehensive daily activity statistics.Time Series Transformer excels in capturing the characteristics of time series through pre-training, while Machine Learning Ensembles select machine learning models that meet our categorization criteria.The proposed model, TraM, scored 6.10 out of 10 in experiments, demonstrating superior performance compared to other methodologies. 0.696The code and configuration for the TraM framework are available at: https://github.com/jin-jae/ETRI-Paper-Contest. |
2024-10-15 |
Secure Stateful Aggregation: A Practical Protocol with Applications in Differentially-Private Federated Learning
Recent advances in differentially private federated learning (DPFL) algorithms have found that using correlated noise across the rounds of federated learning (DP-FTRL) yields provably and empirically better accuracy than using independent noise (DP-SGD).While DP-SGD is well-suited to federated learning with a single untrusted central server using lightweight secure aggregation protocols, secure aggregation is not conducive to implementing modern DP-FTRL techniques without assuming a trusted central server.DP-FTRL based approaches have already seen widespread deployment in industry, albeit with a trusted central curator who provides and applies the correlated noise.To realize a fully private, single untrusted server DP-FTRL federated learning protocol, we introduce secure stateful aggregation: a simple append-only data structure that allows for the private storage of aggregate values and reading linear functions of the aggregates.Assuming Ring Learning with Errors, we provide a lightweight and scalable realization of this protocol for high-dimensional data in a new security/resource model, Federated MPC : where a powerful persistent server interacts with weak, ephemeral clients.We observe that secure stateful aggregation suffices for realizing DP-FTRL-based private federated learning: improving DPFL utility guarantees over the state of the art while maintaining privacy with an untrusted central party.Our approach has minimal overhead relative to existing techniques which do not yield comparable utility. 0.637The secure stateful aggregation primitive and the federated MPC paradigm may be of interest for other practical applications. |
2024-10-15 |
Learning from Imperfect Data: Towards Efficient Knowledge Distillation of Autoregressive Language Models for Text-to-SQL
Large Language Models (LLMs) have shown promising performance in text-to-SQL, which involves translating natural language questions into SQL queries.However, current text-to-SQL LLMs are computationally expensive and challenging to deploy in real-world applications, highlighting the importance of compressing them.To achieve this goal, knowledge distillation (KD) is a common approach, which aims to distill the larger teacher model into a smaller student model.While numerous KD methods for autoregressive LLMs have emerged recently, it is still under-explored whether they work well in complex text-to-SQL scenarios.To this end, we conduct a series of analyses and reveal that these KD methods generally fall short in balancing performance and efficiency. 0.741In response to this problem, we propose to improve the KD with Imperfect Data, namely KID, which effectively boosts the performance without introducing much training budget.The core of KID is to efficiently mitigate the training-inference mismatch by simulating the cascading effect of inference in the imperfect training data.Extensive experiments on 5 text-to-SQL benchmarks show that, KID can not only achieve consistent and significant performance gains (up to +5.83% average score) across all model types and sizes, but also effectively improve the training efficiency. |
2024-10-15 |
GS^3: Efficient Relighting with Triple Gaussian Splatting
We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images.To describe complex appearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian.To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron.To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple.The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage.We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU.Our results compare favorably with state-of-the-art techniques in terms of quality/performance. 0.778Our code and data are publicly available at https://GSrelight.github.io/. |
2024-10-15 |
Summarized Causal Explanations For Aggregate Views (Full version)
SQL queries with group-by and average are frequently used and plotted as bar charts in several data analysis applications.Understanding the reasons behind the results in such an aggregate view may be a highly non-trivial and time-consuming task, especially for large datasets with multiple attributes.Hence, generating automated explanations for aggregate views can allow users to gain better insights into the results while saving time in data analysis.When providing explanations for such views, it is paramount to ensure that they are succinct yet comprehensive, reveal different types of insights that hold for different aggregate answers in the view, and, most importantly, they reflect reality and arm users to make informed data-driven decisions, i.e., the explanations do not only consider correlations but are causal.In this paper, we present CauSumX, a framework for generating summarized causal explanations for the entire aggregate view.Using background knowledge captured in a causal DAG, CauSumX finds the most effective causal treatments for different groups in the view.We formally define the framework and the optimization problem, study its complexity, and devise an efficient algorithm using the Apriori algorithm, LP rounding, and several optimizations. 0.628We experimentally show that our system generates useful summarized causal explanations compared to prior work and scales well for large high-dimensional data |
2024-10-15 |
CoActionGraphRec: Sequential Multi-Interest Recommendations Using Co-Action Graphs
There are unique challenges to developing item recommender systems for e-commerce platforms like eBay due to sparse data and diverse user interests.While rich user-item interactions are important, eBay's data sparsity exceeds other e-commerce sites by an order of magnitude.To address this challenge, we propose CoActionGraphRec (CAGR), a text based two-tower deep learning model (Item Tower and User Tower) utilizing co-action graph layers.In order to enhance user and item representations, a graph-based solution tailored to eBay's environment is utilized.For the Item Tower, we represent each item using its co-action items to capture collaborative signals in a co-action graph that is fully leveraged by the graph neural network component.For the User Tower, we build a fully connected graph of each user's behavior sequence, with edges encoding pairwise relationships.Furthermore, an explicit interaction module learns representations capturing behavior interactions.Extensive offline and online A/B test experiments demonstrate the effectiveness of our proposed approach and results show improved performance over state-of-the-art methods on key metrics. 0.78 |
2024-10-15 |
DeformPAM: Data-Efficient Learning for Long-horizon Deformable Object Manipulation via Preference-based Action Alignment
In recent years, imitation learning has made progress in the field of robotic manipulation.However, it still faces challenges when dealing with complex long-horizon deformable object tasks, such as high-dimensional state spaces, complex dynamics, and multimodal action distributions.Traditional imitation learning methods often require a large amount of data and encounter distributional shifts and accumulative errors in these tasks.To address these issues, we propose a data-efficient general learning framework (DeformPAM) based on preference learning and reward-guided action selection.DeformPAM decomposes long-horizon tasks into multiple action primitives, utilizes 3D point cloud inputs and diffusion models to model action distributions, and trains an implicit reward model using human preference data.During the inference phase, the reward model scores multiple candidate actions, selecting the optimal action for execution, thereby reducing the occurrence of anomalous actions and improving task completion quality.Experiments conducted on three challenging real-world long-horizon deformable object manipulation tasks demonstrate the effectiveness of this method.Results show that DeformPAM improves both task completion quality and efficiency compared to baseline methods even with limited data. 0.615Code and data will be available at https://deform-pam.robotflow.ai. |
2024-10-15 |
Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models
In this paper, we propose Transformer Layer Injection (TLI), a novel method for efficiently upscaling large language models (LLMs) while minimizing computational costs and maintaining model performance.Model scale is a key factor in enhancing the quality of machine learning models, and TLI addresses the challenge of scaling by reducing initial loss, minimizing fine-tuning requirements, and preserving model complexity.Our approach improves upon the conventional Depth Up-Scaling (DUS) technique by injecting new layers into every set of K layers, enabling hidden representations to pass through transformer blocks with minimal disruption.We compare TLI with existing approaches, including Mixture of Experts (MoE) and DUS, and validate its efficiency through experiments on small LLMs (LLama3 1B, 3B, and 8B).Results show that TLI achieves better initialization, requires fewer training steps, and delivers superior accuracy on tasks such as KoBEST and KMCQA, with models performing effectively even without additional training.TLI is demonstrated to be both data-efficient and cost-effective, significantly outperforming existing methods. 0.632Its scalability and simplicity make it a promising solution for upscaling transformer-based models, with potential applications in scaling models from 10B to 405B parameters. |
2024-10-15 |
A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem
The advent of pre-trained vision-language foundation models has revolutionized the field of zero/few-shot (i.e., low-shot) image recognition.The key challenge to address under the condition of limited training data is how to fine-tune pre-trained vision-language models in a parameter-efficient manner.Previously, numerous approaches tackling this challenge have been proposed.Meantime, a few survey papers are also published to summarize these works.However, there still lacks a unified computational framework to integrate existing methods together, identify their nature and support in-depth comparison.As such, this survey paper first proposes a unified computational framework from the perspective of Representer Theorem and then derives many of the existing methods by specializing this framework.Thereafter, a comparative analysis is conducted to uncover the differences and relationships between existing methods.Based on the analyses, some possible variants to improve the existing works are presented.As a demonstration, we extend existing methods by modeling inter-class correlation between representers in reproducing kernel Hilbert space (RKHS), which is implemented by exploiting the closed-form solution of kernel ridge regression.Extensive experiments on 11 datasets are conducted to validate the effectiveness of this method. 0.737Toward the end of this paper, we discuss the limitations and provide further research directions. |
2024-10-15 |
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
Large Language Models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users.Recently, many tool-use benchmark datasets have been proposed. 0.754However, existing datasets have the following limitations: (1).Insufficient evaluation scenarios (e.g., only cover limited tool-use scenes).(2).Extensive evaluation costs (e.g., GPT API costs).To address these limitations, in this work, we propose a multi-granularity tool-use benchmark for large language models called MTU-Bench.For the "multi-granularity" property, our MTU-Bench covers five tool usage scenes (i.e., single-turn and single-tool, single-turn and multiple-tool, multiple-turn and single-tool, multiple-turn and multiple-tool, and out-of-distribution tasks).Besides, all evaluation metrics of our MTU-Bench are based on the prediction results and the ground truth without using any GPT or human evaluation metrics.Moreover, our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios, and we also propose an instruction dataset called MTU-Instruct data to enhance the tool-use abilities of existing LLMs.Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench.Code and data will be released at https: //github.com/MTU-Bench-Team/MTU-Bench.git. |
2024-10-15 |
RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation
The emergence of Segment Anything (SAM) sparked research interest in the field of interactive segmentation, especially in the context of image editing tasks and speeding up data annotation.Unlike common semantic segmentation, interactive segmentation methods allow users to directly influence their output through prompts (e.g. clicks).However, click patterns in real-world interactive segmentation scenarios remain largely unexplored.Most methods rely on the assumption that users would click in the center of the largest erroneous area.Nevertheless, recent studies show that this is not always the case.Thus, methods may have poor performance in real-world deployment despite high metrics in a baseline benchmark. 0.703To accurately simulate real-user clicks, we conducted a large crowdsourcing study of click patterns in an interactive segmentation scenario and collected 475K real-user clicks.Drawing on ideas from saliency tasks, we develop a clickability model that enables sampling clicks, which closely resemble actual user inputs.Using our model and dataset, we propose RClicks benchmark for a comprehensive comparison of existing interactive segmentation methods on realistic clicks.Specifically, we evaluate not only the average quality of methods, but also the robustness w.r.t. 0.693click patterns.According to our benchmark, in real-world usage interactive segmentation models may perform worse than it has been reported in the baseline benchmark, and most of the methods are not robust.We believe that RClicks is a significant step towards creating interactive segmentation methods that provide the best user experience in real-world cases. |
2024-10-15 |
DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate.Prevalent methods usually organize predicted tokens as independent chains or fixed token trees, which fails to generalize to diverse query distributions.In this paper, we propose DySpec, a faster speculative decoding algorithm with a novel dynamic token tree structure.We begin by bridging the draft distribution and acceptance rate from intuitive and empirical clues, and successfully show that the two variables are strongly correlated.Based on this, we employ a greedy strategy to dynamically expand the token tree at run time.Theoretically, we show that our method can achieve optimal results under mild assumptions. 0.705Empirically, DySpec yields a higher acceptance rate and speedup than fixed trees.DySpec can drastically improve the throughput and reduce the latency of token generation across various data distribution and model sizes, which significantly outperforms strong competitors, including Specinfer and Sequoia.Under low temperature setting, DySpec can improve the throughput up to 9.1$\times$ and reduce the latency up to 9.4$\times$ on Llama2-70B. Under high temperature setting, DySpec can also improve the throughput up to 6.21$\times$, despite the increasing difficulty of speculating more than one token per step for draft model. |
2024-10-15 |
LoSAM: Local Search in Additive Noise Models with Unmeasured Confounders, a Top-Down Global Discovery Approach
We address the challenge of causal discovery in structural equation models with additive noise without imposing additional assumptions on the underlying data-generating process.We introduce local search in additive noise model (LoSAM), which generalizes an existing nonlinear method that leverages local causal substructures to the general additive noise setting, allowing for both linear and nonlinear causal mechanisms.We show that LoSAM achieves polynomial runtime, and improves runtime and efficiency by exploiting new substructures to minimize the conditioning set at each step.Further, we introduce a variant of LoSAM, LoSAM-UC, that is robust to unmeasured confounding among roots, a property that is often not satisfied by functional-causal-model-based methods.We numerically demonstrate the utility of LoSAM, showing that it outperforms existing benchmarks. 0.614 |
2024-10-15 |
FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting
Time Series Forecasting (TSF) is key functionality in numerous fields, including in finance, weather services, and energy management.While TSF methods are emerging these days, many of them require domain-specific data collection and model training and struggle with poor generalization performance on new domains.Foundation models aim to overcome this limitation.Pre-trained on large-scale language or time series data, they exhibit promising inferencing capabilities in new or unseen data.This has spurred a surge in new TSF foundation models.We propose a new benchmark, FoundTS, to enable thorough and fair evaluation and comparison of such models. 0.783FoundTS covers a variety of TSF foundation models, including those based on large language models and those pretrained on time series.Next, FoundTS supports different forecasting strategies, including zero-shot, few-shot, and full-shot, thereby facilitating more thorough evaluations.Finally, FoundTS offers a pipeline that standardizes evaluation processes such as dataset splitting, loading, normalization, and few-shot sampling, thereby facilitating fair evaluations.Building on this, we report on an extensive evaluation of TSF foundation models on a broad range of datasets from diverse domains and with different statistical characteristics.Specifically, we identify pros and cons and inherent limitations of existing foundation models, and we identify directions for future model design.We make our code and datasets available at https://anonymous.4open.science/r/FoundTS-C2B0. |
2024-10-14 |
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks.However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated.To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers.LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA).This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables.Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. 0.717This significantly reduces the overall evaluation cost. 0.645We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination.Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset.By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). 0.625Our dataset is available online on HuggingFace, and our code will be available here. |
2024-10-14 |
MMAR: Towards Lossless Multi-Modal Auto-Regressive Prababilistic Modeling
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation.However, we have identifed that recent methods inevitably suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps.To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding.In this way, when the model transits from image generation to understanding through text generation, the backbone model's hidden representation of the image is not limited to the last denoising step.To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals.Through extensive evaluations on 18 image understanding benchmarks, MMAR demonstrates much more superior performance than other joint multi-modal models, matching the method that employs pretrained CLIP vision encoder, meanwhile being able to generate high quality images at the same time.We also showed that our method is scalable with larger data and model size. 0.669 |
2024-10-14 |
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation.However, we have identified that recent methods inevitably suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps.To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding.In this way, when the model transits from image generation to understanding through text generation, the backbone model's hidden representation of the image is not limited to the last denoising step.To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals.Through extensive evaluations on 18 image understanding benchmarks, MMAR demonstrates much more superior performance than other joint multi-modal models, matching the method that employs pretrained CLIP vision encoder, meanwhile being able to generate high quality images at the same time.We also showed that our method is scalable with larger data and model size. 0.669 |
Developer Research |
|
2024-10-03 |
Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
Software engineers mainly write code by editing existing programs. 0.682In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass.One explanation for this is the scarcity of open-sourced edit data.While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer.To fill this gap, we develop a synthetic data generation algorithm called LintSeq.This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs.It outputs edit sequences as text strings consisting of consecutive program diffs.To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples.Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks.We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines.This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries.For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score.Finally, we also pretrain our own tiny LMs for code understanding.We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class.Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode. |
Data Annotation Techniques |
|
Causality Research |
|
2024-10-16 |
ExoTST: Exogenous-Aware Temporal Sequence Transformer for Time Series Prediction
Accurate long-term predictions are the foundations for many machine learning applications and decision-making processes.Traditional time series approaches for prediction often focus on either autoregressive modeling, which relies solely on past observations of the target ``endogenous variables'', or forward modeling, which considers only current covariate drivers ``exogenous variables''. 0.537However, effectively integrating past endogenous and past exogenous with current exogenous variables remains a significant challenge.In this paper, we propose ExoTST, a novel transformer-based framework that effectively incorporates current exogenous variables alongside past context for improved time series prediction.To integrate exogenous information efficiently, ExoTST leverages the strengths of attention mechanisms and introduces a novel cross-temporal modality fusion module.This module enables the model to jointly learn from both past and current exogenous series, treating them as distinct modalities.By considering these series separately, ExoTST provides robustness and flexibility in handling data uncertainties that arise from the inherent distribution shift between historical and current exogenous variables.Extensive experiments on real-world carbon flux datasets and time series benchmarks demonstrate ExoTST's superior performance compared to state-of-the-art baselines, with improvements of up to 10\% in prediction accuracy.Moreover, ExoTST exhibits strong robustness against missing values and noise in exogenous drivers, maintaining consistent performance in real-world situations where these imperfections are common. |
2024-10-16 |
SAT: Data-light Uncertainty Set Merging via Synthetics, Aggregation, and Test Inversion
The integration of uncertainty sets has diverse applications but also presents challenges, particularly when only initial sets and their control levels are available, along with potential dependencies.Examples include merging confidence sets from different distributed sites with communication constraints, as well as combining conformal prediction sets generated by different learning algorithms or data splits.In this article, we introduce an efficient and flexible Synthetic, Aggregation, and Test inversion (SAT) approach to merge various potentially dependent uncertainty sets into a single set.The proposed method constructs a novel class of synthetic test statistics, aggregates them, and then derives merged sets through test inversion.Our approach leverages the duality between set estimation and hypothesis testing, ensuring reliable coverage in dependent scenarios. 0.504The procedure is data-light, meaning it relies solely on initial sets and control levels without requiring raw data, and it adapts to any user-specified initial uncertainty sets, accommodating potentially varying coverage levels.Theoretical analyses and numerical experiments confirm that SAT provides finite-sample coverage guarantees and achieves small set sizes. |
2024-10-16 |
Abnormality Forecasting: Time Series Anomaly Prediction via Future Context Modeling
Identifying anomalies from time series data plays an important role in various fields such as infrastructure security, intelligent operation and maintenance, and space exploration.Current research focuses on detecting the anomalies after they occur, which can lead to significant financial/reputation loss or infrastructure damage. 0.509In this work we instead study a more practical yet very challenging problem, time series anomaly prediction, aiming at providing early warnings for abnormal events before their occurrence.To tackle this problem, we introduce a novel principled approach, namely future context modeling (FCM).Its key insight is that the future abnormal events in a target window can be accurately predicted if their preceding observation window exhibits any subtle difference to normal data. 0.557To effectively capture such differences, FCM first leverages long-term forecasting models to generate a discriminative future context based on the observation data, aiming to amplify those subtle but unusual difference.It then models a normality correlation of the observation data with the forecasting future context to complement the normality modeling of the observation data in foreseeing possible abnormality in the target window. 0.52A joint variate-time attention learning is also introduced in FCM to leverage both temporal signals and features of the time series data for more discriminative normality modeling in the aforementioned two views.Comprehensive experiments on five datasets demonstrate that FCM gains good recall rate (70\%+) on multiple datasets and significantly outperforms all baselines in F1 score.Code is available at https://github.com/mala-lab/FCM. |
2024-10-16 |
Causally-Aware Unsupervised Feature Selection Learning
Unsupervised feature selection (UFS) has recently gained attention for its effectiveness in processing unlabeled high-dimensional data.However, existing methods overlook the intrinsic causal mechanisms within the data, resulting in the selection of irrelevant features and poor interpretability. 0.787Additionally, previous graph-based methods fail to account for the differing impacts of non-causal and causal features in constructing the similarity graph, which leads to false links in the generated graph. 0.814To address these issues, a novel UFS method, called Causally-Aware UnSupErvised Feature Selection learning (CAUSE-FS), is proposed. 0.713CAUSE-FS introduces a novel causal regularizer that reweights samples to balance the confounding distribution of each treatment feature. 0.765This regularizer is subsequently integrated into a generalized unsupervised spectral regression model to mitigate spurious associations between features and clustering labels, thus achieving causal feature selection. 0.618Furthermore, CAUSE-FS employs causality-guided hierarchical clustering to partition features with varying causal contributions into multiple granularities. 0.794By integrating similarity graphs learned adaptively at different granularities, CAUSE-FS increases the importance of causal features when constructing the fused similarity graph to capture the reliable local structure of data. 0.807Extensive experimental results demonstrate the superiority of CAUSE-FS over state-of-the-art methods, with its interpretability further validated through feature visualization. 0.602 |
2024-10-16 |
Off-dynamics Conditional Diffusion Planners
Offline Reinforcement Learning (RL) offers an attractive alternative to interactive data acquisition by leveraging pre-existing datasets.However, its effectiveness hinges on the quantity and quality of the data samples.This work explores the use of more readily available, albeit off-dynamics datasets, to address the challenge of data scarcity in Offline RL.We propose a novel approach using conditional Diffusion Probabilistic Models (DPMs) to learn the joint distribution of the large-scale off-dynamics dataset and the limited target dataset.To enable the model to capture the underlying dynamics structure, we introduce two contexts for the conditional model: (1) a continuous dynamics score allows for partial overlap between trajectories from both datasets, providing the model with richer information; (2) an inverse-dynamics context guides the model to generate trajectories that adhere to the target environment's dynamic constraints. 0.523Empirical results demonstrate that our method significantly outperforms several strong baselines.Ablation studies further reveal the critical role of each dynamics context.Additionally, our model demonstrates that by modifying the context, we can interpolate between source and target dynamics, making it more robust to subtle shifts in the environment. |
2024-10-16 |
Multi-Cause Deconfounding for Recommender Systems with Latent Confounders
In recommender systems, various latent confounding factors (e.g., user social environment and item public attractiveness) can affect user behavior, item exposure, and feedback in distinct ways.These factors may directly or indirectly impact user feedback and are often shared across items or users, making them multi-cause latent confounders. 0.551However, existing methods typically fail to account for latent confounders between users and their feedback, as well as those between items and user feedback simultaneously.To address the problem of multi-cause latent confounders, we propose a multi-cause deconfounding method for recommender systems with latent confounders (MCDCF).MCDCF leverages multi-cause causal effect estimation to learn substitutes for latent confounders associated with both users and items, using user behaviour data. 0.717Specifically, MCDCF treats the multiple items that users interact with and the multiple users that interact with items as treatment variables, enabling it to learn substitutes for the latent confounders that influence the estimation of causality between users and their feedback, as well as between items and user feedback. 0.572Additionally, we theoretically demonstrate the soundness of our MCDCF method.Extensive experiments on three real-world datasets demonstrate that our MCDCF method effectively recovers latent confounders related to users and items, reducing bias and thereby improving recommendation accuracy. |
2024-10-16 |
Mitigating Dual Latent Confounding Biases in Recommender Systems
Recommender systems are extensively utilised across various areas to predict user preferences for personalised experiences and enhanced user engagement and satisfaction.Traditional recommender systems, however, are complicated by confounding bias, particularly in the presence of latent confounders that affect both item exposure and user feedback.Existing debiasing methods often fail to capture the complex interactions caused by latent confounders in interaction data, especially when dual latent confounders affect both the user and item sides. 0.512To address this, we propose a novel debiasing method that jointly integrates the Instrumental Variables (IV) approach and identifiable Variational Auto-Encoder (iVAE) for Debiased representation learning in Recommendation systems, referred to as IViDR.Specifically, IViDR leverages the embeddings of user features as IVs to address confounding bias caused by latent confounders between items and user feedback, and reconstructs the embedding of items to obtain debiased interaction data.Moreover, IViDR employs an Identifiable Variational Auto-Encoder (iVAE) to infer identifiable representations of latent confounders between item exposure and user feedback from both the original and debiased interaction data.Additionally, we provide theoretical analyses of the soundness of using IV and the identifiability of the latent representations.Extensive experiments on both synthetic and real-world datasets demonstrate that IViDR outperforms state-of-the-art models in reducing bias and providing reliable recommendations. |
2024-10-16 |
On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs
Evidence-enhanced detectors present remarkable abilities in identifying malicious social text with related evidence.However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors.This paper explores how to manipulate evidence, simulating potential misuse scenarios including basic pollution, and rephrasing or generating evidence by LLMs.To mitigate its negative impact, we propose three defense strategies from both the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating.Extensive experiments on four malicious social text detection tasks with ten datasets present that evidence pollution, especially the generate strategy, significantly compromises existing detectors.On the other hand, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment, such as the need for annotated data and huge inference costs. 0.514Further analysis illustrates that polluted evidence is of high quality, would compromise the model calibration, and could ensemble to amplify the negative impact. |
2024-10-16 |
Federated Learning and Free-riding in a Competitive Market
Federated learning (FL) is a collaborative technique for training large-scale models while protecting user data privacy.Despite its substantial benefits, the free-riding behavior raises a major challenge for the formation of FL, especially in competitive markets.Our paper explores this under-explored issue on how the free-riding behavior in a competitive market affects firms' incentives to form FL.Competing firms can improve technologies through forming FL to increase the performance of their products, which in turn, affects consumers' product selection and market size.The key complication is whether the free-riding behavior discourages information contribution by participating firms and results in the decomposition of FL, and even free riding does not discourage information contribution, this does not necessarily mean that a firm wants to form FL in a competitive market because free riding may reshape the competition positions of each participating firm and thus forming FL may not be profitable.We build a parsimonious game theoretical model that captures these interactions and our analyses show several new findings.First, even in the presence of the free-riding behavior, competing firms under FL find it optimal to contribute all its available information. 0.502Second, the firm with less amount of information always finds it profitable to free ride; whether its rival (with more amount of information) have an incentive to form FL only when the level of competition or when the gap in information volume is not high.Third, when FL is formed, there exists an "All-Win" situation in which all stakeholders (participating firms, consumers, and social planner) benefit.Last, subsidizing by the free-riding firm can align its rival's incentive to form FL only when the level of competition is intermediate. |
2024-10-15 |
RATE: Score Reward Models with Imperfect Rewrites of Rewrites
This paper concerns the evaluation of reward models used in language modeling.A reward model is a function that takes a prompt and a response and assigns a score indicating how good that response is for the prompt.A key challenge is that reward models are usually imperfect proxies for actual preferences.For example, we may worry that a model trained to reward helpfulness learns to instead prefer longer responses.In this paper, we develop an evaluation method, RATE (Rewrite-based Attribute Treatment Estimators), that allows us to measure the causal effect of a given attribute of a response (e.g., length) on the reward assigned to that response. 0.565The core idea is to use large language models to rewrite responses to produce imperfect counterfactuals, and to adjust for rewriting error by rewriting twice.We show that the RATE estimator is consistent under reasonable assumptions.We demonstrate the effectiveness of RATE on synthetic and real-world data, showing that it can accurately estimate the effect of a given attribute on the reward model. |
2024-10-15 |
PhysioFormer: Integrating Multimodal Physiological Signals and Symbolic Regression for Explainable Affective State Prediction
Most affective computing tasks still rely heavily on traditional methods, with few deep learning models applied, particularly in multimodal signal processing.Given the importance of stress monitoring for mental health, developing a highly reliable and accurate affective computing model is essential.In this context, we propose a novel model, for affective state prediction using physiological signals.PhysioFormer model integrates individual attributes and multimodal physiological data to address interindividual variability, enhancing its reliability and generalization across different individuals.By incorporating feature embedding and affective representation modules, PhysioFormer model captures dynamic changes in time-series data and multimodal signal features, significantly improving accuracy.The model also includes an explainability model that uses symbolic regression to extract laws linking physiological signals to affective states, increasing transparency and explainability. 0.501Experiments conducted on the Wrist and Chest subsets of the WESAD dataset confirmed the model's superior performance, achieving over 99% accuracy, outperforming existing SOTA models.Sensitivity and ablation experiments further demonstrated PhysioFormer's reliability, validating the contribution of its individual components.The integration of symbolic regression not only enhanced model explainability but also highlighted the complex relationships between physiological signals and affective states.Future work will focus on optimizing the model for larger datasets and real-time applications, particularly in more complex environments.Additionally, further exploration of physiological signals and environmental factors will help build a more comprehensive affective computing system, advancing its use in health monitoring and psychological intervention. |
2024-10-15 |
Do LLMs Have the Generalization Ability in Conducting Causal Inference?
In causal inference, generalization capability refers to the ability to conduct causal inference methods on new data to estimate the causal-effect between unknown phenomenon, which is crucial for expanding the boundaries of knowledge. 0.806Studies have evaluated the causal inference capabilities of Large Language Models (LLMs) concerning known phenomena, yet the generalization capabilities of LLMs concerning unseen phenomena remain unexplored. 0.576In this paper, we selected four tasks: Causal Path Discovery (CP), Backdoor Adjustment (BA), Factual Inference (FI), and Counterfactual Inference (CI) as representatives of causal inference tasks. 0.8To generate evaluation questions about previously unseen phenomena in new data on the four tasks, we propose a benchmark generation framework, which employs randomly generated graphs and node names to formulate questions within hypothetical new causal scenarios. 0.608Based on this framework, we compile a benchmark dataset of varying levels of question complexity.We extensively tested the generalization capabilities of five leading LLMs across four tasks.Experiment results reveal that while LLMs exhibit good generalization performance in solving simple CP, FI, and complex CI questions, they encounter difficulties when tackling BA questions and face obvious performance fluctuations as the problem complexity changes.Furthermore, when the names of phenomena incorporate existing terms, even if these names are entirely novel, their generalization performance can still be hindered by interference from familiar terms. |
2024-10-15 |
Summarized Causal Explanations For Aggregate Views (Full version)
SQL queries with group-by and average are frequently used and plotted as bar charts in several data analysis applications.Understanding the reasons behind the results in such an aggregate view may be a highly non-trivial and time-consuming task, especially for large datasets with multiple attributes.Hence, generating automated explanations for aggregate views can allow users to gain better insights into the results while saving time in data analysis.When providing explanations for such views, it is paramount to ensure that they are succinct yet comprehensive, reveal different types of insights that hold for different aggregate answers in the view, and, most importantly, they reflect reality and arm users to make informed data-driven decisions, i.e., the explanations do not only consider correlations but are causal. 0.74In this paper, we present CauSumX, a framework for generating summarized causal explanations for the entire aggregate view. 0.751Using background knowledge captured in a causal DAG, CauSumX finds the most effective causal treatments for different groups in the view. 0.772We formally define the framework and the optimization problem, study its complexity, and devise an efficient algorithm using the Apriori algorithm, LP rounding, and several optimizations.We experimentally show that our system generates useful summarized causal explanations compared to prior work and scales well for large high-dimensional data 0.748 |
2024-10-15 |
AIC CTU system at AVeriTeC: Re-framing automated fact-checking as a simple RAG task
This paper describes our $3^{rd}$ place submission in the AVeriTeC shared task in which we attempted to address the challenge of fact-checking with evidence retrieved in the wild using a simple scheme of Retrieval-Augmented Generation (RAG) designed for the task, leveraging the predictive power of Large Language Models.We release our codebase and explain its two modules - the Retriever and the Evidence & Label generator - in detail, justifying their features such as MMR-reranking and Likert-scale confidence estimation.We evaluate our solution on AVeriTeC dev and test set and interpret the results, picking the GPT-4o as the most appropriate model for our pipeline at the time of our publication, with Llama 3.1 70B being a promising open-source alternative.We perform an empirical error analysis to see that faults in our predictions often coincide with noise in the data or ambiguous fact-checks, provoking further research and data augmentation. 0.508 |
2024-10-15 |
The Age of DDoScovery: An Empirical Comparison of Industry and Academic DDoS Assessments
Motivated by the impressive but diffuse scope of DDoS research and reporting, we undertake a multistakeholder (joint industry-academic) analysis to seek convergence across the best available macroscopic views of the relative trends in two dominant classes of attacks - direct-path attacks and reflection-amplification attacks.We first analyze 24 industry reports to extract trends and (in)consistencies across observations by commercial stakeholders in 2022.We then analyze ten data sets spanning industry and academic sources, across four years (2019-2023), to find and explain discrepancies based on data sources, vantage points, methods, and parameters. 0.558Our method includes a new approach: we share an aggregated list of DDoS targets with industry players who return the results of joining this list with their proprietary data sources to reveal gaps in visibility of the academic data sources.We use academic data sources to explore an industry-reported relative drop in spoofed reflection-amplification attacks in 2021-2022.Our study illustrates the value, but also the challenge, in independent validation of security-related properties of Internet infrastructure.Finally, we reflect on opportunities to facilitate greater common understanding of the DDoS landscape.We hope our results inform not only future academic and industry pursuits but also emerging policy efforts to reduce systemic Internet security vulnerabilities. |
2024-10-15 |
LoSAM: Local Search in Additive Noise Models with Unmeasured Confounders, a Top-Down Global Discovery Approach
We address the challenge of causal discovery in structural equation models with additive noise without imposing additional assumptions on the underlying data-generating process. 0.831We introduce local search in additive noise model (LoSAM), which generalizes an existing nonlinear method that leverages local causal substructures to the general additive noise setting, allowing for both linear and nonlinear causal mechanisms. 0.655We show that LoSAM achieves polynomial runtime, and improves runtime and efficiency by exploiting new substructures to minimize the conditioning set at each step.Further, we introduce a variant of LoSAM, LoSAM-UC, that is robust to unmeasured confounding among roots, a property that is often not satisfied by functional-causal-model-based methods. 0.666We numerically demonstrate the utility of LoSAM, showing that it outperforms existing benchmarks. |
2024-10-14 |
A Practical Approach to Causal Inference over Time
In this paper, we focus on estimating the causal effect of an intervention over time on a dynamical system. 0.828To that end, we formally define causal interventions and their effects over time on discrete-time stochastic processes (DSPs). 0.802Then, we show under which conditions the equilibrium states of a DSP, both before and after a causal intervention, can be captured by a structural causal model (SCM). 0.803With such an equivalence at hand, we provide an explicit mapping from vector autoregressive models (VARs), broadly applied in econometrics, to linear, but potentially cyclic and/or affected by unmeasured confounders, SCMs. 0.518The resulting causal VAR framework allows us to perform causal inference over time from observational time series data. 0.808Our experiments on synthetic and real-world datasets show that the proposed framework achieves strong performance in terms of observational forecasting while enabling accurate estimation of the causal effect of interventions on dynamical systems. 0.699We demonstrate, through a case study, the potential practical questions that can be addressed using the proposed causal VAR framework. 0.764 |
2024-10-14 |
Causal Modeling of Climate Activism on Reddit
Climate activism is crucial in stimulating collective societal and behavioral change towards sustainable practices through political pressure.Although multiple factors contribute to the participation in activism, their complex relationships and the scarcity of data on their interactions have restricted most prior research to studying them in isolation, thus preventing the development of a quantitative, causal understanding of why people approach activism.In this work, we develop a comprehensive causal model of how and why Reddit users engage with activist communities driving mass climate protests (mainly the 2019 Earth Strike, Fridays for Future, and Extinction Rebellion). 0.526Our framework, based on Stochastic Variational Inference applied to Bayesian Networks, learns the causal pathways over multiple time periods. 0.749Distinct from previous studies, our approach uses large-scale and fine-grained longitudinal data (2016 to 2022) to jointly model the roles of sociodemographic makeup, experience of extreme weather events, exposure to climate-related news, and social influence through online interactions.We find that among users interested in climate change, participation in online activist communities is indeed influenced by direct interactions with activists and largely by recent exposure to media coverage of climate protests.Among people aware of climate change, left-leaning people from lower socioeconomic backgrounds are particularly represented in online activist groups.Our findings offer empirical validation for theories of media influence and critical mass, and lay the foundations to inform interventions and future studies to foster public participation in collective action. |
2024-10-14 |
Modeling News Interactions and Influence for Financial Market Prediction
The diffusion of financial news into market prices is a complex process, making it challenging to evaluate the connections between news events and market movements. 0.518This paper introduces FININ (Financial Interconnected News Influence Network), a novel market prediction model that captures not only the links between news and prices but also the interactions among news items themselves. 0.514FININ effectively integrates multi-modal information from both market data and news articles.We conduct extensive experiments on two datasets, encompassing the S&P 500 and NASDAQ 100 indices over a 15-year period and over 2.7 million news articles.The results demonstrate FININ's effectiveness, outperforming advanced market prediction models with an improvement of 0.429 and 0.341 in the daily Sharpe ratio for the two markets respectively.Moreover, our results reveal insights into the financial news, including the delayed market pricing of news, the long memory effect of news, and the limitations of financial sentiment analysis in fully extracting predictive power from news data. |
2024-10-14 |
A Simple Baseline for Predicting Events with Auto-Regressive Tabular Transformers
Many real-world applications of tabular data involve using historic events to predict properties of new ones, for example whether a credit card transaction is fraudulent or what rating a customer will assign a product on a retail platform. 0.524Existing approaches to event prediction include costly, brittle, and application-dependent techniques such as time-aware positional embeddings, learned row and field encodings, and oversampling methods for addressing class imbalance.Moreover, these approaches often assume specific use-cases, for example that we know the labels of all historic events or that we only predict a pre-specified label and not the data's features themselves. 0.538In this work, we propose a simple but flexible baseline using standard autoregressive LLM-style transformers with elementary positional embeddings and a causal language modeling objective.Our baseline outperforms existing approaches across popular datasets and can be employed for various use-cases.We demonstrate that the same model can predict labels, impute missing values, or model event sequences. |
2024-10-14 |
NT-LLM: A Novel Node Tokenizer for Integrating Graph Structure into Large Language Models
Graphs are a fundamental data structure for representing relationships in real-world scenarios. 0.541With the success of Large Language Models (LLMs) across various natural language processing (NLP) tasks, there has been growing interest in integrating LLMs for graph learning.However, applying LLMs to graph-related tasks poses significant challenges, as these models are not inherently designed to capture the complex structural information present in graphs.Existing approaches address this challenge through two strategies: the chain of tasks approach, which uses Graph Neural Networks (GNNs) to encode the graph structure so that LLMs are relieved from understanding spatial positions; and Graph-to-Text Conversion, which translates graph structures into semantic text representations that LLMs can process.Despite their progress, these methods often struggle to fully preserve the topological information of graphs or require extensive computational resources, limiting their practical applicability. In this work, we introduce Node Tokenizer for Large Language Models (NT-LLM), a novel framework that efficiently encodes graph structures by selecting key nodes as anchors and representing each node based on its relative distance to these anchors.This position-anchored encoding effectively captures the graph topology, enabling enhanced reasoning capabilities in LLMs over graph data.Additionally, we implement a task-specific tuning procedure to further improve structural understanding within LLMs.Through extensive empirical evaluations, NT-LLM demonstrates significant performance improvements across a variety of graph-related tasks. |
2024-10-10 |
Heterogeneous Graph Auto-Encoder for CreditCard Fraud Detection
The digital revolution has significantly impacted financial transactions, leading to a notable increase in credit card usage.However, this convenience comes with a trade-off: a substantial rise in fraudulent activities.Traditional machine learning methods for fraud detection often struggle to capture the inherent interconnectedness within financial data. 0.512This paper proposes a novel approach for credit card fraud detection that leverages Graph Neural Networks (GNNs) with attention mechanisms applied to heterogeneous graph representations of financial data.Unlike homogeneous graphs, heterogeneous graphs capture intricate relationships between various entities in the financial ecosystem, such as cardholders, merchants, and transactions, providing a richer and more comprehensive data representation for fraud analysis.To address the inherent class imbalance in fraud data, where genuine transactions significantly outnumber fraudulent ones, the proposed approach integrates an autoencoder.This autoencoder, trained on genuine transactions, learns a latent representation and flags deviations during reconstruction as potential fraud.This research investigates two key questions: (1) How effectively can a GNN with an attention mechanism detect and prevent credit card fraud when applied to a heterogeneous graph?(2) How does the efficacy of the autoencoder with attention approach compare to traditional methods?The results are promising, demonstrating that the proposed model outperforms benchmark algorithms such as Graph Sage and FI-GRL, achieving a superior AUC-PR of 0.89 and an F1-score of 0.81.This research significantly advances fraud detection systems and the overall security of financial transactions by leveraging GNNs with attention mechanisms and addressing class imbalance through an autoencoder. |
Explainability Research |
|
2024-10-15 |
Adversarially Guided Stateful Defense Against Backdoor Attacks in Federated Deep Learning
Recent works have shown that Federated Learning (FL) is vulnerable to backdoor attacks.Existing defenses cluster submitted updates from clients and select the best cluster for aggregation.However, they often rely on unrealistic assumptions regarding client submissions and sampled clients population while choosing the best cluster.We show that in realistic FL settings, state-of-the-art (SOTA) defenses struggle to perform well against backdoor attacks in FL.To address this, we highlight that backdoored submissions are adversarially biased and overconfident compared to clean submissions.We, therefore, propose an Adversarially Guided Stateful Defense (AGSD) against backdoor attacks on Deep Neural Networks (DNNs) in FL scenarios. 0.503AGSD employs adversarial perturbations to a small held-out dataset to compute a novel metric, called the trust index, that guides the cluster selection without relying on any unrealistic assumptions regarding client submissions.Moreover, AGSD maintains a trust state history of each client that adaptively penalizes backdoored clients and rewards clean clients.In realistic FL settings, where SOTA defenses mostly fail to resist attacks, AGSD mostly outperforms all SOTA defenses with minimal drop in clean accuracy (5% in the worst-case compared to best accuracy) even when (a) given a very small held-out dataset -- typically AGSD assumes 50 samples (<= 0.1% of the training data) and (b) no heldout dataset is available, and out-of-distribution data is used instead.For reproducibility, our code will be openly available at: https://github.com/hassanalikhatim/AGSD. |
2024-10-15 |
Poisson-Dirac Neural Networks for Modeling Coupled Dynamical Systems across Domains
Deep learning has achieved great success in modeling dynamical systems, providing data-driven simulators to predict complex phenomena, even without known governing equations. 0.596However, existing models have two major limitations: their narrow focus on mechanical systems and their tendency to treat systems as monolithic.These limitations reduce their applicability to dynamical systems in other domains, such as electrical and hydraulic systems, and to coupled systems.To address these limitations, we propose Poisson-Dirac Neural Networks (PoDiNNs), a novel framework based on the Dirac structure that unifies the port-Hamiltonian and Poisson formulations from geometric mechanics.This framework enables a unified representation of various dynamical systems across multiple domains as well as their interactions and degeneracies arising from couplings.Our experiments demonstrate that PoDiNNs offer improved accuracy and interpretability in modeling unknown coupled dynamical systems from data. 0.508 |
2024-10-15 |
Towards a Healthy AI Tradition: Lessons from Biology and Biomedical Science
AI is a magnificent field that directly and profoundly touches on numerous disciplines ranging from philosophy, computer science, engineering, mathematics, decision and data science and economics, to cognitive science, neuroscience and more. 0.552The number of applications and impact of AI is second to none and the potential of AI to broadly impact future science developments is particularly thrilling. 0.527While attempts to understand knowledge, reasoning, cognition and learning go back centuries, AI remains a relatively new field.In part due to the fact it has so many wide-ranging overlaps with other disparate fields it appears to have trouble developing a robust identity and culture.Here we suggest that contrasting the fast-moving AI culture to biological and biomedical sciences is both insightful and useful way to inaugurate a healthy tradition needed to envision and manage our ascent to AGI and beyond (independent of the AI Platforms used).The co-evolution of AI and Biomedical Science offers many benefits to both fields.In a previous perspective, we suggested that biomedical laboratories or centers can usefully embrace logistic traditions in AI labs that will allow them to be highly collaborative, improve the reproducibility of research, reduce risk aversion and produce faster mentorship pathways for PhDs and fellows.This perspective focuses on the benefits to AI by adapting features of biomedical science at higher, primarily cultural levels. |
2024-10-14 |
LG-CAV: Train Any Concept Activation Vector with Language Guidance
Concept activation vector (CAV) has attracted broad research interest in explainable AI, by elegantly attributing model predictions to specific concepts. 0.55However, the training of CAV often necessitates a large number of high-quality images, which are expensive to curate and thus limited to a predefined set of concepts.To address this issue, we propose Language-Guided CAV (LG-CAV) to harness the abundant concept knowledge within the certain pre-trained vision-language models (e.g., CLIP).This method allows training any CAV without labeled data, by utilizing the corresponding concept descriptions as guidance.To bridge the gap between vision-language model and the target model, we calculate the activation values of concept descriptions on a common pool of images (probe images) with vision-language model and utilize them as language guidance to train the LG-CAV.Furthermore, after training high-quality LG-CAVs related to all the predicted classes in the target model, we propose the activation sample reweighting (ASR), serving as a model correction technique, to improve the performance of the target model in return.Experiments on four datasets across nine architectures demonstrate that LG-CAV achieves significantly superior quality to previous CAV methods given any concept, and our model correction method achieves state-of-the-art performance compared to existing concept-based methods.Our code is available at https://github.com/hqhQAQ/LG-CAV. |
2024-10-14 |
Towards Reliable Verification of Unauthorized Data Usage in Personalized Text-to-Image Diffusion Models
Text-to-image diffusion models are pushing the boundaries of what generative AI can achieve in our lives.Beyond their ability to generate general images, new personalization techniques have been proposed to customize the pre-trained base models for crafting images with specific themes or styles.Such a lightweight solution, enabling AI practitioners and developers to easily build their own personalized models, also poses a new concern regarding whether the personalized models are trained from unauthorized data. 0.523A promising solution is to proactively enable data traceability in generative models, where data owners embed external coatings (e.g., image watermarks or backdoor triggers) onto the datasets before releasing.Later the models trained over such datasets will also learn the coatings and unconsciously reproduce them in the generated mimicries, which can be extracted and used as the data usage evidence.However, we identify the existing coatings cannot be effectively learned in personalization tasks, making the corresponding verification less reliable. In this paper, we introduce SIREN, a novel methodology to proactively trace unauthorized data usage in black-box personalized text-to-image diffusion models.Our approach optimizes the coating in a delicate way to be recognized by the model as a feature relevant to the personalization task, thus significantly improving its learnability.We also utilize a human perceptual-aware constraint, a hypersphere classification technique, and a hypothesis-testing-guided verification method to enhance the stealthiness and detection accuracy of the coating.The effectiveness of SIREN is verified through extensive experiments on a diverse set of benchmark datasets, models, and learning algorithms.SIREN is also effective in various real-world scenarios and evaluated against potential countermeasures.Our code is publicly available. |
2024-10-14 |
Transparent Networks for Multivariate Time Series
Transparent models, which are machine learning models that produce inherently interpretable predictions, are receiving significant attention in high-stakes domains. 0.549However, despite much real-world data being collected as time series, there is a lack of studies on transparent time series models.To address this gap, we propose a novel transparent neural network model for time series called Generalized Additive Time Series Model (GATSM).GATSM consists of two parts: 1) independent feature networks to learn feature representations, and 2) a transparent temporal module to learn temporal patterns across different time steps using the feature representations.This structure allows GATSM to effectively capture temporal patterns and handle dynamic-length time series while preserving transparency.Empirical experiments show that GATSM significantly outperforms existing generalized additive models and achieves comparable performance to black-box time series models, such as recurrent neural networks and Transformer.In addition, we demonstrate that GATSM finds interesting patterns in time series.The source code is available at https://github.com/gim4855744/GATSM. |
2024-10-10 |
Explainability of Deep Neural Networks for Brain Tumor Detection
Medical image classification is crucial for supporting healthcare professionals in decision-making and training.While Convolutional Neural Networks (CNNs) have traditionally dominated this field, Transformer-based models are gaining attention.In this study, we apply explainable AI (XAI) techniques to assess the performance of various models on real-world medical data and identify areas for improvement. 0.553We compare CNN models such as VGG-16, ResNet-50, and EfficientNetV2L with a Transformer model: ViT-Base-16.Our results show that data augmentation has little impact, but hyperparameter tuning and advanced modeling improve performance.CNNs, particularly VGG-16 and ResNet-50, outperform ViT-Base-16 and EfficientNetV2L, likely due to underfitting from limited data.XAI methods like LIME and SHAP further reveal that better-performing models visualize tumors more effectively.These findings suggest that CNNs with shallower architectures are more effective for small datasets and can support medical decision-making. |
2024-10-10 |
Provable Privacy Attacks on Trained Shallow Neural Networks
We study what provable privacy attacks can be shown on trained, 2-layer ReLU neural networks. 0.504We explore two types of attacks; data reconstruction attacks, and membership inference attacks.We prove that theoretical results on the implicit bias of 2-layer neural networks can be used to provably reconstruct a set of which at least a constant fraction are training points in a univariate setting, and can also be used to identify with high probability whether a given point was used in the training set in a high dimensional setting.To the best of our knowledge, our work is the first to show provable vulnerabilities in this setting. |
2024-10-10 |
Mechanistic Permutability: Match Features Across Layers
Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition.While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem.In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network.Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales.Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality.We also show that features persist over several layers and that our approach can approximate hidden states across layers.Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies. 0.535 |
2024-10-10 |
Explaining Hypergraph Neural Networks: From Local Explanations to Global Concepts
Hypergraph neural networks are a class of powerful models that leverage the message passing paradigm to learn over hypergraphs, a generalization of graphs well-suited to describing relational data with higher-order interactions.However, such models are not naturally interpretable, and their explainability has received very limited attention.We introduce SHypX, the first model-agnostic post-hoc explainer for hypergraph neural networks that provides both local and global explanations. 0.545At the instance-level, it performs input attribution by discretely sampling explanation subhypergraphs optimized to be faithful and concise.At the model-level, it produces global explanation subhypergraphs using unsupervised concept extraction.Extensive experiments across four real-world and four novel, synthetic hypergraph datasets demonstrate that our method finds high-quality explanations which can target a user-specified balance between faithfulness and concision, improving over baselines by 25 percent points in fidelity on average. |
2024-10-09 |
Learning a Neural Solver for Parametric PDE to Enhance Physics-Informed Methods
Physics-informed deep learning often faces optimization challenges due to the complexity of solving partial differential equations (PDEs), which involve exploring large solution spaces, require numerous iterations, and can lead to unstable training. 0.677These challenges arise particularly from the ill-conditioning of the optimization problem, caused by the differential terms in the loss function.To address these issues, we propose learning a solver, i.e., solving PDEs using a physics-informed iterative algorithm trained on data. 0.54Our method learns to condition a gradient descent algorithm that automatically adapts to each PDE instance, significantly accelerating and stabilizing the optimization process and enabling faster convergence of physics-aware models. 0.52Furthermore, while traditional physics-informed methods solve for a single PDE instance, our approach addresses parametric PDEs.Specifically, our method integrates the physical loss gradient with the PDE parameters to solve over a distribution of PDE parameters, including coefficients, initial conditions, or boundary conditions.We demonstrate the effectiveness of our method through empirical experiments on multiple datasets, comparing training and test-time optimization performance. |
2024-10-07 |
From Transparency to Accountability and Back: A Discussion of Access and Evidence in AI Auditing
Artificial intelligence (AI) is increasingly intervening in our lives, raising widespread concern about its unintended and undeclared side effects.These developments have brought attention to the problem of AI auditing: the systematic evaluation and analysis of an AI system, its development, and its behavior relative to a set of predetermined criteria. 0.507Auditing can take many forms, including pre-deployment risk assessments, ongoing monitoring, and compliance testing.It plays a critical role in providing assurances to various AI stakeholders, from developers to end users. 0.564Audits may, for instance, be used to verify that an algorithm complies with the law, is consistent with industry standards, and meets the developer's claimed specifications.However, there are many operational challenges to AI auditing that complicate its implementation. In this work, we examine a key operational issue in AI auditing: what type of access to an AI system is needed to perform a meaningful audit?Addressing this question has direct policy relevance, as it can inform AI audit guidelines and requirements.We begin by discussing the factors that auditors balance when determining the appropriate type of access, and unpack the benefits and drawbacks of four types of access.We conclude that, at minimum, black-box access -- providing query access to a model without exposing its internal implementation -- should be granted to auditors, as it balances concerns related to trade secrets, data privacy, audit standardization, and audit efficiency.We then suggest a framework for determining how much further access (in addition to black-box access) to grant auditors.We show that auditing can be cast as a natural hypothesis test, draw parallels hypothesis testing and legal procedure, and argue that this framing provides clear and interpretable guidance on audit implementation. |
2024-10-07 |
CAT: Concept-level backdoor ATtacks for Concept Bottleneck Models
Despite the transformative impact of deep learning across multiple domains, the inherent opacity of these models has driven the development of Explainable Artificial Intelligence (XAI). 0.633Among these efforts, Concept Bottleneck Models (CBMs) have emerged as a key approach to improve interpretability by leveraging high-level semantic information.However, CBMs, like other machine learning models, are susceptible to security threats, particularly backdoor attacks, which can covertly manipulate model behaviors.Understanding that the community has not yet studied the concept level backdoor attack of CBM, because of "Better the devil you know than the devil you don't know.", we introduce CAT (Concept-level Backdoor ATtacks), a methodology that leverages the conceptual representations within CBMs to embed triggers during training, enabling controlled manipulation of model predictions at inference time.An enhanced attack pattern, CAT+, incorporates a correlation function to systematically select the most effective and stealthy concept triggers, thereby optimizing the attack's impact.Our comprehensive evaluation framework assesses both the attack success rate and stealthiness, demonstrating that CAT and CAT+ maintain high performance on clean data while achieving significant targeted effects on backdoored datasets.This work underscores the potential security risks associated with CBMs and provides a robust testing methodology for future security assessments. |
2024-10-07 |
D-PoSE: Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation
We present D-PoSE (Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation), a one-stage method that estimates human pose and SMPL-X shape parameters from a single RGB image.Recent works use larger models with transformer backbones and decoders to improve the accuracy in human pose and shape (HPS) benchmarks.D-PoSE proposes a vision based approach that uses the estimated human depth-maps as an intermediate representation for HPS and leverages training with synthetic data and the ground-truth depth-maps provided with them for depth supervision during training.Although trained on synthetic datasets, D-PoSE achieves state-of-the-art performance on the real-world benchmark datasets, EMDB and 3DPW.Despite its simple lightweight design and the CNN backbone, it outperforms ViT-based models that have a number of parameters that is larger by almost an order of magnitude. 0.5D-PoSE code is available at: https://github.com/nvasilik/D-PoSE |
2024-10-07 |
AI-Enhanced Ethical Hacking: A Linux-Focused Experiment
This technical report investigates the integration of generative AI (GenAI), specifically ChatGPT, into the practice of ethical hacking through a comprehensive experimental study and conceptual analysis. 0.517Conducted in a controlled virtual environment, the study evaluates GenAI's effectiveness across the key stages of penetration testing on Linux-based target machines operating within a virtual local area network (LAN), including reconnaissance, scanning and enumeration, gaining access, maintaining access, and covering tracks.The findings confirm that GenAI can significantly enhance and streamline the ethical hacking process while underscoring the importance of balanced human-AI collaboration rather than the complete replacement of human input.The report also critically examines potential risks such as misuse, data biases, hallucination, and over-reliance on AI.This research contributes to the ongoing discussion on the ethical use of AI in cybersecurity and highlights the need for continued innovation to strengthen security defences. |
2024-10-06 |
Watermarking Decision Tree Ensembles
Protecting the intellectual property of machine learning models is a hot topic and many watermarking schemes for deep neural networks have been proposed in the literature. 0.545Unfortunately, prior work largely neglected the investigation of watermarking techniques for other types of models, including decision tree ensembles, which are a state-of-the-art model for classification tasks on non-perceptual data.In this paper, we present the first watermarking scheme designed for decision tree ensembles, focusing in particular on random forest models.We discuss watermark creation and verification, presenting a thorough security analysis with respect to possible attacks.We finally perform an experimental evaluation of the proposed scheme, showing excellent results in terms of accuracy and security against the most relevant threats. |
2024-10-03 |
Unveiling AI's Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors
AI models make mistakes when recognizing images-whether in-domain, out-of-domain, or adversarial.Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems.However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. 0.595Here, we conduct comprehensive empirical evaluations using a "mentor" model-a deep neural network designed to predict another model's errors.Our findings show that the mentor model excels at learning from a mentee's mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee.Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures.Subsequently, we draw insights from these observations and develop an "oracle" mentor model, dubbed SuperMentor, that achieves 78% accuracy in predicting errors across different error types.Our error prediction framework paves the way for future research on anticipating and correcting AI model behaviours, ultimately increasing trust in AI systems. 0.576All code, models, and data will be made publicly available. |
2024-10-03 |
Lie Algebra Canonicalization: Equivariant Neural Operators under arbitrary Lie Groups
The quest for robust and generalizable machine learning models has driven recent interest in exploiting symmetries through equivariant neural networks.In the context of PDE solvers, recent works have shown that Lie point symmetries can be a useful inductive bias for Physics-Informed Neural Networks (PINNs) through data and loss augmentation. 0.573Despite this, directly enforcing equivariance within the model architecture for these problems remains elusive.This is because many PDEs admit non-compact symmetry groups, oftentimes not studied beyond their infinitesimal generators, making them incompatible with most existing equivariant architectures.In this work, we propose Lie aLgebrA Canonicalization (LieLAC), a novel approach that exploits only the action of infinitesimal generators of the symmetry group, circumventing the need for knowledge of the full group structure.To achieve this, we address existing theoretical issues in the canonicalization literature, establishing connections with frame averaging in the case of continuous non-compact groups.Operating within the framework of canonicalization, LieLAC can easily be integrated with unconstrained pre-trained models, transforming inputs to a canonical form before feeding them into the existing model, effectively aligning the input for model inference according to allowed symmetries.LieLAC utilizes standard Lie group descent schemes, achieving equivariance in pre-trained models.Finally, we showcase LieLAC's efficacy on tasks of invariant image classification and Lie point symmetry equivariant neural PDE solvers using pre-trained models. |
Ethics Research |
|
2024-10-16 |
When researchers pay to publish: Results from a survey on APCs in four countries
This paper provides an empirical overview of the impact and practices of paying Article Processing Charges (APCs) by four nationally categorized groups of researchers in Argentina, Brazil, Mexico, and South Africa.The data was collected from 13,577 researchers through an online questionnaire.The analysis compares the practice of publishing in journals that charge APCs across different dimensions, including country, discipline, gender, and age of the researchers.The paper also focuses on the maximum amount APC paid and the methods and strategies researchers use to cover APC payments, such as waivers, research project funds, payment by coauthors, and the option to publish in closed access, where possible.Different tendencies were identified among the different disciplines and the national systems examined.Findings show that Argentine researchers apply for waivers most frequently and often use personal funds or international coauthors for APCs, with younger researchers less involved in APC payments.In contrast, Brazil, South Africa, and Mexico have more older researchers, yet younger researchers still publish more in APC journals.South African researchers lead in APC publications, likely due to better funding access and read and publish agreements.This study lays the groundwork for further analysis of gender asymmetries, funding access, and views on the commercial Open Access model of scientific dissemination. 0.518 |
2024-10-16 |
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
Hallucination has been a popular topic in natural language generation (NLG).In real-world applications, unfaithful content can result in bad data quality or loss of trust from end users. 0.551Thus, it is crucial to fact-check before adopting NLG for production usage, which can be expensive if done manually.In this paper, we investigate automated faithfulness evaluation in guided NLG.We developed a rubrics template and use large language models (LLMs) to score the generation into quantifiable scales.We compared popular LLMs as well as the widely adopted natural language inference (NLI) models in scoring quality and sensitivity.In addition, we developed methods to generation synthetic unfaithful data, as well as a heuristics to quantify the percentage of hallucination.Our results on 4 travel-domain industry dataset show that GPT-4 can provide accurate judgement and explanation on whether a source and a generation are factually consistent.Furthermore, we found that tuning NLI models on synthetic data can improve performance.Lastly, we present insights on latency and cost for deploying such system. |
2024-10-16 |
AI-Aided Kalman Filters
The Kalman filter (KF) and its variants are among the most celebrated algorithms in signal processing.These methods are used for state estimation of dynamic systems by relying on mathematical representations in the form of simple state-space (SS) models, which may be crude and inaccurate descriptions of the underlying dynamics.Emerging data-centric artificial intelligence (AI) techniques tackle these tasks using deep neural networks (DNNs), which are model-agnostic.Recent developments illustrate the possibility of fusing DNNs with classic Kalman-type filtering, obtaining systems that learn to track in partially known dynamics.This article provides a tutorial-style overview of design approaches for incorporating AI in aiding KF-type algorithms. 0.514We review both generic and dedicated DNN architectures suitable for state estimation, and provide a systematic presentation of techniques for fusing AI tools with KFs and for leveraging partial SS modeling and data, categorizing design approaches into task-oriented and SS model-oriented.The usefulness of each approach in preserving the individual strengths of model-based KFs and data-driven DNNs is investigated in a qualitative and quantitative study, whose code is publicly available, illustrating the gains of hybrid model-based/data-driven designs.We also discuss existing challenges and future research directions that arise from fusing AI and Kalman-type algorithms. 0.583 |
2024-10-16 |
ARIC: An Activity Recognition Dataset in Classroom Surveillance Images
The application of activity recognition in the ``AI + Education" field is gaining increasing attention. 0.512However, current work mainly focuses on the recognition of activities in manually captured videos and a limited number of activity types, with little attention given to recognizing activities in surveillance images from real classrooms.Activity recognition in classroom surveillance images faces multiple challenges, such as class imbalance and high activity similarity.To address this gap, we constructed a novel multimodal dataset focused on classroom surveillance image activity recognition called ARIC (Activity Recognition In Classroom).The ARIC dataset has advantages of multiple perspectives, 32 activity categories, three modalities, and real-world classroom scenarios.In addition to the general activity recognition tasks, we also provide settings for continual learning and few-shot continual learning.We hope that the ARIC dataset can act as a facilitator for future analysis and research for open teaching scenarios.You can download preliminary data from https://ivipclab.github.io/publication_ARIC/ARIC. |
2024-10-16 |
Yama: Precise Opcode-based Data Flow Analysis for Detecting PHP Applications Vulnerabilities
Web applications encompass various aspects of daily life, including online shopping, e-learning, and internet banking.Once there is a vulnerability, it can cause severe societal and economic damage. 0.509Due to its ease of use, PHP has become the preferred server-side programming language for web applications, making PHP applications a primary target for attackers.Data flow analysis is widely used for vulnerability detection before deploying web applications because of its efficiency.However, the high complexity of the PHP language makes it difficult to achieve precise data flow analysis.In this paper, we present Yama, a context-sensitive and path-sensitive interprocedural data flow analysis method for PHP, designed to detect taint-style vulnerabilities in PHP applications.We have found that the precise semantics and clear control flow of PHP opcodes enable data flow analysis to be more precise and efficient.Leveraging this observation, we established parsing rules for PHP opcodes and implemented a precise understanding of PHP program semantics in Yama.We evaluated Yama from three dimensions: basic data flow analysis capabilities, complex semantic analysis capabilities, and the ability to discover vulnerabilities in real-world applications, demonstrating Yama's advancement in vulnerability detection.Specifically, Yama possesses context-sensitive and path-sensitive interprocedural analysis capabilities, achieving a 99.1% true positive rate in complex semantic analysis experiments related to type inference, dynamic features, and built-in functions.It discovered and reported 38 zero-day vulnerabilities across 24 projects on GitHub with over 1,000 stars each, assigning 34 new CVE IDs.We have released the source code of the prototype implementation and the parsing rules for PHP opcodes to facilitate future research. |
2024-10-16 |
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
Agents powered by large language models have shown remarkable abilities in solving complex tasks.However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making.In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. 0.517We propose a novel data-driven approach for this problem.Firstly, we collect real-world human activities to generate proactive task predictions.These predictions are then labeled by human annotators as either accepted or rejected.The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents.Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events.Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents.Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models.These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration. 0.551 |
2024-10-16 |
A Fast Convoluted Story: Scaling Probabilistic Inference for Integer Arithmetic
As illustrated by the success of integer linear programming, linear integer arithmetic is a powerful tool for modelling combinatorial problems.Furthermore, the probabilistic extension of linear programming has been used to formulate problems in neurosymbolic AI.However, two key problems persist that prevent the adoption of neurosymbolic techniques beyond toy problems. 0.501First, probabilistic inference is inherently hard, #P-hard to be precise.Second, the discrete nature of integers renders the construction of meaningful gradients challenging, which is problematic for learning.In order to mitigate these issues, we formulate linear arithmetic over integer-valued random variables as tensor manipulations that can be implemented in a straightforward fashion using modern deep learning libraries.At the core of our formulation lies the observation that the addition of two integer-valued random variables can be performed by adapting the fast Fourier transform to probabilities in the log-domain.By relying on tensor operations we obtain a differentiable data structure, which unlocks, virtually for free, gradient-based learning.In our experimental validation we show that tensorising probabilistic linear integer arithmetic and leveraging the fast Fourier transform allows us to push the state of the art by several orders of magnitude in terms of inference and learning times. |
2024-10-16 |
FairGLVQ: Fairness in Partition-Based Classification
Fairness is an important objective throughout society. 0.529From the distribution of limited goods such as education, over hiring and payment, to taxes, legislation, and jurisprudence.Due to the increasing importance of machine learning approaches in all areas of daily life including those related to health, security, and equity, an increasing amount of research focuses on fair machine learning.In this work, we focus on the fairness of partition- and prototype-based models.The contribution of this work is twofold: 1) we develop a general framework for fair machine learning of partition-based models that does not depend on a specific fairness definition, and 2) we derive a fair version of learning vector quantization (LVQ) as a specific instantiation.We compare the resulting algorithm against other algorithms from the literature on theoretical and real-world data showing its practical relevance. |
2024-10-16 |
Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation
Machine learning (ML) exhibits promise in the clinical domain.However, it is constrained by data scarcity and ethical considerations, as the generation of clinical trials presents significant challenges due to stringent privacy regulations, high costs, and the extended duration required for conducting studies with human participants. 0.512Despite the advancements of large language models (LLMs) in general generation tasks, their potential in facilitating the generation of synthetic clinical trials is under-explored.To address this gap, we introduce a novel Retrieval-Reasoning few-shot framework that leverages LLMs to generate artificial yet realistic and diverse clinical trials with binary success/failure labels.Experiments conducted on real clinical trials from the \url{ClinicalTrials.gov} database demonstrate that our synthetic data can effectively augment real datasets.Furthermore, by fine-tuning a pre-trained model as a binary classifier on synthetic clinical trial datasets, we demonstrate that this augmentation enhances model training for downstream tasks such as trial outcome prediction.Our findings suggest that LLMs for synthetic clinical trial generation hold promise for accelerating clinical research and upholding ethical standards for patient privacy.The code is publicly available at https://anonymous.4open.science/r/Retrieval_Reasoning_Clinical_Trial_Generation-3EC4. |
2024-10-16 |
Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce
Language is a symbolic capital that affects people's lives in many ways (Bourdieu, 1977, 1991).It is a powerful tool that accounts for identities, cultures, traditions, and societies in general.Hence, data in a given language should be viewed as more than a collection of tokens.Good data collection and labeling practices are key to building more human-centered and socially aware technologies. 0.516While there has been a rising interest in mid- to low-resource languages within the NLP community, work in this space has to overcome unique challenges such as data scarcity and access to suitable annotators.In this paper, we collect feedback from those directly involved in and impacted by NLP artefacts for mid- to low-resource languages.We conduct a quantitative and qualitative analysis of the responses and highlight the main issues related to (1) data quality such as linguistic and cultural data suitability; and (2) the ethics of common annotation practices such as the misuse of online community services.Based on these findings, we make several recommendations for the creation of high-quality language artefacts that reflect the cultural milieu of its speakers, while simultaneously respecting the dignity and labor of data workers. |
2024-10-15 |
Improving Bias in Facial Attribute Classification: A Combined Impact of KL Divergence induced Loss Function and Dual Attention
Ensuring that AI-based facial recognition systems produce fair predictions and work equally well across all demographic groups is crucial. 0.564Earlier systems often exhibited demographic bias, particularly in gender and racial classification, with lower accuracy for women and individuals with darker skin tones.To tackle this issue and promote fairness in facial recognition, researchers have introduced several bias-mitigation techniques for gender classification and related algorithms.However, many challenges remain, such as data diversity, balancing fairness with accuracy, disparity, and bias measurement.This paper presents a method using a dual attention mechanism with a pre-trained Inception-ResNet V1 model, enhanced by KL-divergence regularization and a cross-entropy loss function.This approach reduces bias while improving accuracy and computational efficiency through transfer learning.The experimental results show significant improvements in both fairness and classification accuracy, providing promising advances in addressing bias and enhancing the reliability of facial recognition systems. |
2024-10-15 |
Isambard-AI: a leadership class supercomputer optimised specifically for Artificial Intelligence
Isambard-AI is a new, leadership-class supercomputer, designed to support AI-related research. 0.504Based on the HPE Cray EX4000 system, and housed in a new, energy efficient Modular Data Centre in Bristol, UK, Isambard-AI employs 5,448 NVIDIA Grace-Hopper GPUs to deliver over 21 ExaFLOP/s of 8-bit floating point performance for LLM training, and over 250 PetaFLOP/s of 64-bit performance, for under 5MW.Isambard-AI integrates two, all-flash storage systems: a 20 PiByte Cray ClusterStor and a 3.5 PiByte VAST solution.Combined these give Isambard-AI flexibility for training, inference and secure data accesses and sharing. 0.533But it is the software stack where Isambard-AI will be most different from traditional HPC systems.Isambard-AI is designed to support users who may have been using GPUs in the cloud, and so access will more typically be via Jupyter notebooks, MLOps, or other web-based, interactive interfaces, rather than the approach used on traditional supercomputers of sshing into a system before submitting jobs to a batch scheduler.Its stack is designed to be quickly and regularly upgraded to keep pace with the rapid evolution of AI software, with full support for containers.Phase 1 of Isambard-AI is due online in May/June 2024, with the full system expected in production by the end of the year. |
2024-10-15 |
Self-Supervised Learning For Robust Robotic Grasping In Dynamic Environment
Some of the threats in the dynamic environment include the unpredictability of the motion of objects and interferences to the robotic grasp. 0.519In such conditions the traditional supervised and reinforcement learning approaches are ill suited because they rely on a large amount of labelled data and a predefined reward signal.More specifically in this paper we introduce an important and promising framework known as self supervised learning (SSL) whose goal is to apply to the RGBD sensor and proprioceptive data from robot hands in order to allow robots to learn and improve their grasping strategies in real time.The invariant SSL framework overcomes the deficiencies of the fixed labelling by adapting the SSL system to changes in the objects behavior and improving performance in dynamic situations.The above proposed method was tested through various simulations and real world trials, with the series obtaining enhanced grasp success rates of 15% over other existing methods, especially under dynamic scenarios.Also, having tested for adaptation times, it was confirmed that the system could adapt faster, thus applicable for use in the real world, such as in industrial automation and service robotics.In future work, the proposed approach will be expanded to more complex tasks, such as multi object manipulation and functions in the context of cluttered environments, in order to apply the proposed methodology to a broader range of robotic tasks. |
2024-10-15 |
Diffusion-Based Offline RL for Improved Decision-Making in Augmented ARC Task
Effective long-term strategies enable AI systems to navigate complex environments by making sequential decisions over extended horizons.Similarly, reinforcement learning (RL) agents optimize decisions across sequences to maximize rewards, even without immediate feedback.To verify that Latent Diffusion-Constrained Q-learning (LDCQ), a prominent diffusion-based offline RL method, demonstrates strong reasoning abilities in multi-step decision-making, we aimed to evaluate its performance on the Abstraction and Reasoning Corpus (ARC).However, applying offline RL methodologies to enhance strategic reasoning in AI for solving tasks in ARC is challenging due to the lack of sufficient experience data in the ARC training set.To address this limitation, we introduce an augmented offline RL dataset for ARC, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules.SOLAR enables the application of offline RL methods by offering sufficient experience data.We synthesized SOLAR for a simple task and used it to train an agent with the LDCQ method.Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task, showing the agent's ability to make multi-step sequential decisions and correctly identify answer states.These results highlight the potential of the offline RL approach to enhance AI's strategic reasoning capabilities. 0.54 |
2024-10-15 |
PhysioFormer: Integrating Multimodal Physiological Signals and Symbolic Regression for Explainable Affective State Prediction
Most affective computing tasks still rely heavily on traditional methods, with few deep learning models applied, particularly in multimodal signal processing.Given the importance of stress monitoring for mental health, developing a highly reliable and accurate affective computing model is essential.In this context, we propose a novel model, for affective state prediction using physiological signals.PhysioFormer model integrates individual attributes and multimodal physiological data to address interindividual variability, enhancing its reliability and generalization across different individuals.By incorporating feature embedding and affective representation modules, PhysioFormer model captures dynamic changes in time-series data and multimodal signal features, significantly improving accuracy.The model also includes an explainability model that uses symbolic regression to extract laws linking physiological signals to affective states, increasing transparency and explainability. 0.516Experiments conducted on the Wrist and Chest subsets of the WESAD dataset confirmed the model's superior performance, achieving over 99% accuracy, outperforming existing SOTA models.Sensitivity and ablation experiments further demonstrated PhysioFormer's reliability, validating the contribution of its individual components.The integration of symbolic regression not only enhanced model explainability but also highlighted the complex relationships between physiological signals and affective states.Future work will focus on optimizing the model for larger datasets and real-time applications, particularly in more complex environments.Additionally, further exploration of physiological signals and environmental factors will help build a more comprehensive affective computing system, advancing its use in health monitoring and psychological intervention. |
2024-10-15 |
KLay: Accelerating Neurosymbolic AI
A popular approach to neurosymbolic AI involves mapping logic formulas to arithmetic circuits (computation graphs consisting of sums and products) and passing the outputs of a neural network through these circuits. 0.545This approach enforces symbolic constraints onto a neural network in a principled and end-to-end differentiable way.Unfortunately, arithmetic circuits are challenging to run on modern AI accelerators as they exhibit a high degree of irregular sparsity.To address this limitation, we introduce knowledge layers (KLay), a new data structure to represent arithmetic circuits that can be efficiently parallelized on GPUs.Moreover, we contribute two algorithms used in the translation of traditional circuit representations to KLay and a further algorithm that exploits parallelization opportunities during circuit evaluations.We empirically show that KLay achieves speedups of multiple orders of magnitude over the state of the art, thereby paving the way towards scaling neurosymbolic AI to larger real-world applications. |
2024-10-15 |
Advanced Persistent Threats (APT) Attribution Using Deep Reinforcement Learning
This paper investigates the application of Deep Reinforcement Learning (DRL) for attributing malware to specific Advanced Persistent Threat (APT) groups through detailed behavioural analysis.By analysing over 3500 malware samples from 12 distinct APT groups, the study utilises sophisticated tools like Cuckoo Sandbox to extract behavioural data, providing a deep insight into the operational patterns of malware.The research demonstrates that the DRL model significantly outperforms traditional machine learning approaches such as SGD, SVC, KNN, MLP, and Decision Tree Classifiers, achieving an impressive test accuracy of 89.27 %.It highlights the model capability to adeptly manage complex, variable, and elusive malware attributes.Furthermore, the paper discusses the considerable computational resources and extensive data dependencies required for deploying these advanced AI models in cybersecurity frameworks. 0.621Future research is directed towards enhancing the efficiency of DRL models, expanding the diversity of the datasets, addressing ethical concerns, and leveraging Large Language Models (LLMs) to refine reward mechanisms and optimise the DRL framework.By showcasing the transformative potential of DRL in malware attribution, this research advocates for a responsible and balanced approach to AI integration, with the goal of advancing cybersecurity through more adaptable, accurate, and robust systems. 0.587 |
2024-10-15 |
Transfer Learning with Foundational Models for Time Series Forecasting using Low-Rank Adaptations
High computational power and the availability of large datasets have supported the development of Foundational Models.They are a new emerging technique widely used in Generative Artificial Intelligence, characterized by their scalability and their use in Transfer Learning.The enormous and heterogeneous amounts of data used in their initial training phase, known as pre-training, give them a higher generalization capacity than any other specific model, constituting a solid base that can be adapted or adjusted to a wide range of tasks, increasing their applicability.This study proposes LLIAM, the Llama Lora-Integrated Autorregresive Model.Low-Rank Adaptations are used to enhance the knowledge of the model with diverse time series datasets, known as the fine-tuning phase.To illustrate the capabilities of our proposal, two sets of experiments have been carried out that obtained favorable and promising results with lower training times than other Deep Learning approaches.With this work, we also encourage the use of available resources (such as these pre-trained models) to avoid unnecessary and costly training, narrowing the gap between the goals of traditional Artificial Intelligence and those specified by the definition of Green Artificial Intelligence. 0.533 |
2024-10-15 |
A Data-Driven Aggressive Autonomous Racing Framework Utilizing Local Trajectory Planning with Velocity Prediction
The development of autonomous driving has boosted the research on autonomous racing. 0.506However, existing local trajectory planning methods have difficulty planning trajectories with optimal velocity profiles at racetracks with sharp corners, thus weakening the performance of autonomous racing.To address this problem, we propose a local trajectory planning method that integrates Velocity Prediction based on Model Predictive Contour Control (VPMPCC).The optimal parameters of VPMPCC are learned through Bayesian Optimization (BO) based on a proposed novel Objective Function adapted to Racing (OFR).Specifically, VPMPCC achieves velocity prediction by encoding the racetrack as a reference velocity profile and incorporating it into the optimization problem.This method optimizes the velocity profile of local trajectories, especially at corners with significant curvature.The proposed OFR balances racing performance with vehicle safety, ensuring safe and efficient BO training.In the simulation, the number of training iterations for OFR-based BO is reduced by 42.86% compared to the state-of-the-art method.The optimal simulation-trained parameters are then applied to a real-world F1TENTH vehicle without retraining.During prolonged racing on a custom-built racetrack featuring significant sharp corners, the mean velocity of VPMPCC reaches 93.18% of the vehicle's handling limits.The released code is available at https://github.com/zhouhengli/VPMPCC. |
2024-10-15 |
Towards a Healthy AI Tradition: Lessons from Biology and Biomedical Science
AI is a magnificent field that directly and profoundly touches on numerous disciplines ranging from philosophy, computer science, engineering, mathematics, decision and data science and economics, to cognitive science, neuroscience and more. 0.754The number of applications and impact of AI is second to none and the potential of AI to broadly impact future science developments is particularly thrilling. 0.693While attempts to understand knowledge, reasoning, cognition and learning go back centuries, AI remains a relatively new field. 0.67In part due to the fact it has so many wide-ranging overlaps with other disparate fields it appears to have trouble developing a robust identity and culture.Here we suggest that contrasting the fast-moving AI culture to biological and biomedical sciences is both insightful and useful way to inaugurate a healthy tradition needed to envision and manage our ascent to AGI and beyond (independent of the AI Platforms used). 0.621The co-evolution of AI and Biomedical Science offers many benefits to both fields. 0.621In a previous perspective, we suggested that biomedical laboratories or centers can usefully embrace logistic traditions in AI labs that will allow them to be highly collaborative, improve the reproducibility of research, reduce risk aversion and produce faster mentorship pathways for PhDs and fellows. 0.527This perspective focuses on the benefits to AI by adapting features of biomedical science at higher, primarily cultural levels. 0.591 |
2024-10-15 |
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
Recent advancements in Multi-modal Large Language Models (MLLMs) have opened new avenues for applications in Embodied AI.Building on previous work, EgoThink, we introduce VidEgoThink, a comprehensive benchmark for evaluating egocentric video understanding capabilities.To bridge the gap between MLLMs and low-level control in Embodied AI, we design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.To minimize manual annotation costs, we develop an automatic data generation pipeline based on the Ego4D dataset, leveraging the prior knowledge and multimodal capabilities of GPT-4o.Three human annotators then filter the generated data to ensure diversity and quality, resulting in the VidEgoThink benchmark.We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.Experimental results indicate that all MLLMs, including GPT-4o, perform poorly across all tasks related to egocentric video understanding.These findings suggest that foundation models still require significant advancements to be effectively applied to first-person scenarios in Embodied AI. 0.581In conclusion, VidEgoThink reflects a research trend towards employing MLLMs for egocentric vision, akin to human capabilities, enabling active observation and interaction in the complex real-world environments. 0.52 |
2024-10-15 |
Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models
Vision-language pre-training (VLP) models, trained on large-scale image-text pairs, have become widely used across a variety of downstream vision-and-language (V+L) tasks.This widespread adoption raises concerns about their vulnerability to adversarial attacks. 0.502Non-universal adversarial attacks, while effective, are often impractical for real-time online applications due to their high computational demands per data instance.Recently, universal adversarial perturbations (UAPs) have been introduced as a solution, but existing generator-based UAP methods are significantly time-consuming.To overcome the limitation, we propose a direct optimization-based UAP approach, termed DO-UAP, which significantly reduces resource consumption while maintaining high attack performance.Specifically, we explore the necessity of multimodal loss design and introduce a useful data augmentation strategy.Extensive experiments conducted on three benchmark VLP datasets, six popular VLP models, and three classical downstream tasks demonstrate the efficiency and effectiveness of DO-UAP.Specifically, our approach drastically decreases the time consumption by 23-fold while achieving a better attack performance. |
2024-10-14 |
ABBA-VSM: Time Series Classification using Symbolic Representation on the Edge
In recent years, Edge AI has become more prevalent with applications across various industries, from environmental monitoring to smart city management. 0.619Edge AI facilitates the processing of Internet of Things (IoT) data and provides privacy-enabled and latency-sensitive services to application users using Machine Learning (ML) algorithms, e.g., Time Series Classification (TSC). 0.507However, existing TSC algorithms require access to full raw data and demand substantial computing resources to train and use them effectively in runtime.This makes them impractical for deployment in resource-constrained Edge environments.To address this, in this paper, we propose an Adaptive Brownian Bridge-based Symbolic Aggregation Vector Space Model (ABBA-VSM).It is a new TSC model designed for classification services on Edge.Here, we first adaptively compress the raw time series into symbolic representations, thus capturing the changing trends of data.Subsequently, we train the classification model directly on these symbols.ABBA-VSM reduces communication data between IoT and Edge devices, as well as computation cycles, in the development of resource-efficient TSC services on Edge.We evaluate our solution with extensive experiments using datasets from the UCR time series classification archive.The results demonstrate that the ABBA-VSM achieves up to 80% compression ratio and 90-100% accuracy for binary classification.Whereas, for non-binary classification, it achieves an average compression ratio of 60% and accuracy ranging from 60-80%. |
2024-10-14 |
LG-CAV: Train Any Concept Activation Vector with Language Guidance
Concept activation vector (CAV) has attracted broad research interest in explainable AI, by elegantly attributing model predictions to specific concepts. 0.545However, the training of CAV often necessitates a large number of high-quality images, which are expensive to curate and thus limited to a predefined set of concepts.To address this issue, we propose Language-Guided CAV (LG-CAV) to harness the abundant concept knowledge within the certain pre-trained vision-language models (e.g., CLIP).This method allows training any CAV without labeled data, by utilizing the corresponding concept descriptions as guidance.To bridge the gap between vision-language model and the target model, we calculate the activation values of concept descriptions on a common pool of images (probe images) with vision-language model and utilize them as language guidance to train the LG-CAV.Furthermore, after training high-quality LG-CAVs related to all the predicted classes in the target model, we propose the activation sample reweighting (ASR), serving as a model correction technique, to improve the performance of the target model in return.Experiments on four datasets across nine architectures demonstrate that LG-CAV achieves significantly superior quality to previous CAV methods given any concept, and our model correction method achieves state-of-the-art performance compared to existing concept-based methods.Our code is available at https://github.com/hqhQAQ/LG-CAV. |
2024-10-14 |
Towards Reliable Verification of Unauthorized Data Usage in Personalized Text-to-Image Diffusion Models
Text-to-image diffusion models are pushing the boundaries of what generative AI can achieve in our lives.Beyond their ability to generate general images, new personalization techniques have been proposed to customize the pre-trained base models for crafting images with specific themes or styles.Such a lightweight solution, enabling AI practitioners and developers to easily build their own personalized models, also poses a new concern regarding whether the personalized models are trained from unauthorized data. 0.67A promising solution is to proactively enable data traceability in generative models, where data owners embed external coatings (e.g., image watermarks or backdoor triggers) onto the datasets before releasing.Later the models trained over such datasets will also learn the coatings and unconsciously reproduce them in the generated mimicries, which can be extracted and used as the data usage evidence.However, we identify the existing coatings cannot be effectively learned in personalization tasks, making the corresponding verification less reliable. In this paper, we introduce SIREN, a novel methodology to proactively trace unauthorized data usage in black-box personalized text-to-image diffusion models.Our approach optimizes the coating in a delicate way to be recognized by the model as a feature relevant to the personalization task, thus significantly improving its learnability.We also utilize a human perceptual-aware constraint, a hypersphere classification technique, and a hypothesis-testing-guided verification method to enhance the stealthiness and detection accuracy of the coating.The effectiveness of SIREN is verified through extensive experiments on a diverse set of benchmark datasets, models, and learning algorithms.SIREN is also effective in various real-world scenarios and evaluated against potential countermeasures.Our code is publicly available. |
2024-10-14 |
Generalized Adversarial Code-Suggestions: Exploiting Contexts of LLM-based Code-Completion
While convenient, relying on LLM-powered code assistants in day-to-day work gives rise to severe attacks.For instance, the assistant might introduce subtle flaws and suggest vulnerable code to the user.These adversarial code-suggestions can be introduced via data poisoning and, thus, unknowingly by the model creators. 0.504In this paper, we provide a generalized formulation of such attacks, spawning and extending related work in this domain.This formulation is defined over two components: First, a trigger pattern occurring in the prompts of a specific user group, and, second, a learnable map in embedding space from the prompt to an adversarial bait.The latter gives rise to novel and more flexible targeted attack-strategies, allowing the adversary to choose the most suitable trigger pattern for a specific user-group arbitrarily, without restrictions on the pattern's tokens.Our directional-map attacks and prompt-indexing attacks increase the stealthiness decisively.We extensively evaluate the effectiveness of these attacks and carefully investigate defensive mechanisms to explore the limits of generalized adversarial code-suggestions.We find that most defenses unfortunately offer little protection only. |
2024-10-14 |
Users' Perception on Appropriateness of Robotic Coaching Assistant's Disclosure Behaviors
Social robots have emerged as valuable contributors to individuals' well-being coaching. 0.612Notably, their integration into long-term human coaching trials shows particular promise, emphasizing a complementary role alongside human coaches rather than outright replacement.In this context, robots serve as supportive entities during coaching sessions, offering insights based on their knowledge about users' well-being and activity. 0.541Traditionally, such insights have been gathered through methods like written self-reports or wearable data visualizations.However, the disclosure of people's information by a robot raises concerns regarding privacy, appropriateness, and trust. 0.683To address this, we conducted an initial study with [n = 22] participants to quantify their perceptions of privacy regarding disclosures made by a robot coaching assistant. 0.625The study was conducted online, presenting participants with six prerecorded scenarios illustrating various types of information disclosure and the robot's role, ranging from active on-demand to proactive communication conditions. 0.534 |
2024-10-14 |
Traversability-Aware Legged Navigation by Learning from Real-World Visual Data
The enhanced mobility brought by legged locomotion empowers quadrupedal robots to navigate through complex and unstructured environments.However, optimizing agile locomotion while accounting for the varying energy costs of traversing different terrains remains an open challenge.Most previous work focuses on planning trajectories with traversability cost estimation based on human-labeled environmental features.However, this human-centric approach is insufficient because it does not account for the varying capabilities of the robot locomotion controllers over challenging terrains. 0.56To address this, we develop a novel traversability estimator in a robot-centric manner, based on the value function of the robot's locomotion controller.This estimator is integrated into a new learning-based RGBD navigation framework.The framework develops a planner that guides the robot in avoiding obstacles and hard-to-traverse terrains while reaching its goals.The training of the navigation planner is directly performed in the real world using a sample efficient reinforcement learning method.Through extensive benchmarking, we demonstrate that the proposed framework achieves the best performance in accurate traversability cost estimation and efficient learning from multi-modal data (the robot's color and depth vision, and proprioceptive feedback) for real-world training.Using the proposed method, a quadrupedal robot learns to perform traversability-aware navigation through trial and error in various real-world environments with challenging terrains that are difficult to classify using depth vision alone. |
2024-10-14 |
DR-MPC: Deep Residual Model Predictive Control for Real-world Social Navigation
How can a robot safely navigate around people exhibiting complex motion patterns? 0.54Reinforcement Learning (RL) or Deep RL (DRL) in simulation holds some promise, although much prior work relies on simulators that fail to precisely capture the nuances of real human motion.To address this gap, we propose Deep Residual Model Predictive Control (DR-MPC), a method to enable robots to quickly and safely perform DRL from real-world crowd navigation data.By blending MPC with model-free DRL, DR-MPC overcomes the traditional DRL challenges of large data requirements and unsafe initial behavior.DR-MPC is initialized with MPC-based path tracking, and gradually learns to interact more effectively with humans.To further accelerate learning, a safety component estimates when the robot encounters out-of-distribution states and guides it away from likely collisions.In simulation, we show that DR-MPC substantially outperforms prior work, including traditional DRL and residual DRL models.Real-world experiments show our approach successfully enables a robot to navigate a variety of crowded situations with few errors using less than 4 hours of training data. |
2024-10-14 |
A Personalized MOOC Learning Group and Course Recommendation Method Based on Graph Neural Network and Social Network Analysis
In order to enhance students' initiative and participation in MOOC learning, this study constructed a multi-level network model based on Social Network Analysis (SNA).The model makes use of data pertaining to nearly 40,000 users and tens of thousands of courses from various higher education MOOC platforms.Furthermore, an AI-based assistant has been developed which utilises the collected data to provide personalised recommendations regarding courses and study groups for students.The objective is to examine the relationship between students' course selection preferences and their academic interest levels.Based on the results of the relationship analysis, the AI assistant employs technologies such as GNN to recommend suitable courses and study groups to students. 0.54This study offers new insights into the potential of personalised teaching on MOOC platforms, demonstrating the value of data-driven and AI-assisted methods in improving the quality of online learning experiences, increasing student engagement, and enhancing learning outcomes. |
2024-10-10 |
APOLLO: A GPT-based tool to detect phishing emails and generate explanations that warn users
Phishing is one of the most prolific cybercriminal activities, with attacks becoming increasingly sophisticated.It is, therefore, imperative to explore novel technologies to improve user protection across both technical and human dimensions. 0.625Large Language Models (LLMs) offer significant promise for text processing in various domains, but their use for defense against phishing attacks still remains scarcely explored.In this paper, we present APOLLO, a tool based on OpenAI's GPT-4o to detect phishing emails and generate explanation messages to users about why a specific email is dangerous, thus improving their decision-making capabilities.We have evaluated the performance of APOLLO in classifying phishing emails; the results show that the LLM models have exemplary capabilities in classifying phishing emails (97 percent accuracy in the case of GPT-4o) and that this performance can be further improved by integrating data from third-party services, resulting in a near-perfect classification rate (99 percent accuracy).To assess the perception of the explanations generated by this tool, we also conducted a study with 20 participants, comparing four different explanations presented as phishing warnings.We compared the LLM-generated explanations to four baselines: a manually crafted warning, and warnings from Chrome, Firefox, and Edge browsers.The results show that not only the LLM-generated explanations were perceived as high quality, but also that they can be more understandable, interesting, and trustworthy than the baselines.These findings suggest that using LLMs as a defense against phishing is a very promising approach, with APOLLO representing a proof of concept in this research direction. |
2024-10-10 |
A Target-Aware Analysis of Data Augmentation for Hate Speech Detection
Hate speech is one of the main threats posed by the widespread use of social networks, despite efforts to limit it. 0.531Although attention has been devoted to this issue, the lack of datasets and case studies centered around scarcely represented phenomena, such as ableism or ageism, can lead to hate speech detection systems that do not perform well on underrepresented identity groups.Given the unpreceded capabilities of LLMs in producing high-quality data, we investigate the possibility of augmenting existing data with generative language models, reducing target imbalance.We experiment with augmenting 1,000 posts from the Measuring Hate Speech corpus, an English dataset annotated with target identity information, adding around 30,000 synthetic examples using both simple data augmentation methods and different types of generative models, comparing autoregressive and sequence-to-sequence approaches.We find traditional DA methods to often be preferable to generative models, but the combination of the two tends to lead to the best results.Indeed, for some hate categories such as origin, religion, and disability, hate speech classification using augmented data for training improves by more than 10% F1 over the no augmentation baseline.This work contributes to the development of systems for hate speech detection that are not only better performing but also fairer and more inclusive towards targets that have been neglected so far. |
2024-10-10 |
Robust AI-Generated Text Detection by Restricted Embeddings
Growing amount and quality of AI-generated texts makes detecting such content more difficult. 0.507In most real-world scenarios, the domain (style and topic) of generated data and the generator model are not known in advance.In this work, we focus on the robustness of classifier-based detectors of AI-generated text, namely their ability to transfer to unseen generators or semantic domains.We investigate the geometry of the embedding space of Transformer-based text encoders and show that clearing out harmful linear subspaces helps to train a robust classifier, ignoring domain-specific spurious features.We investigate several subspace decomposition and feature selection strategies and achieve significant improvements over state of the art methods in cross-domain and cross-generator transfer.Our best approaches for head-wise and coordinate-based subspace removal increase the mean out-of-distribution (OOD) classification score by up to 9% and 14% in particular setups for RoBERTa and BERT embeddings respectively.We release our code and data: https://github.com/SilverSolver/RobustATD |
2024-10-10 |
On the Evaluation of Generative Robotic Simulations
Due to the difficulty of acquiring extensive real-world data, robot simulation has become crucial for parallel training and sim-to-real transfer, highlighting the importance of scalable simulated robotic tasks.Foundation models have demonstrated impressive capacities in autonomously generating feasible robotic tasks.However, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks.To address this, we propose a comprehensive evaluation framework tailored to generative simulations.Our framework segments evaluation into three core aspects: quality, diversity, and generalization.For single-task quality, we evaluate the realism of the generated task and the completeness of the generated trajectories using large language models and vision-language models.In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories.For task-level generalization, we assess the zero-shot generalization ability on unseen tasks of a policy trained with multiple generated tasks.Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach.The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics.Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works.Our anonymous website: https://sites.google.com/view/evaltasks. 0.51 |