WorldGPT、Pix2Pix-OnTheFly、StyleDyRF、ManiGaussian、Face SR

本文首发于公众号:机器感知

WorldGPT、Pix2Pix-OnTheFly、StyleDyRF、ManiGaussian、Face SR

HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular  Images

图片

We propose a robust and accurate method for reconstructing 3D hand mesh from monocular images. This is a very challenging problem, as hands are often severely occluded by objects. Previous works often have disregarded 2D hand pose information, which contains hand prior knowledge that is strongly correlated with occluded regions. Thus, in this work, we propose a novel 3D hand mesh reconstruction network HandGCAT, that can fully exploit hand prior as compensation information to enhance occluded region features. Specifically, we designed the Knowledge-Guided Graph Convolution (KGC) module and the Cross-Attention Transformer (CAT) module. KGC extracts hand prior information from 2D hand pose by graph convolution. CAT fuses hand prior into occluded regions by considering their high correlation.

Text-to-Audio Generation Synchronized with Videos

图片

In recent times, the focus on text-to-audio (TTA) generation has intensified, as researchers strive to synthesize audio from textual descriptions. However, most existing methods, though leveraging latent diffusion models to learn the correlation between audio and text embeddings, fall short when it comes to maintaining a seamless synchronization between the produced audio and its video. This often results in discernible audio-visual mismatches. To bridge this gap, we introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench. This benchmark distinguishes itself with three novel metrics dedicated to evaluating visual alignment and temporal consistency.

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text  and Image Inputs

图片

Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation.

Pix2Pix-OnTheFly: Leveraging LLMs for Instruction-Guided Image Editing

图片

The combination of language processing and image processing keeps attracting increased interest given recent impressive advances that leverage the combined strengths of both domains of research. Among these advances, the task of editing an image on the basis solely of a natural language instruction stands out as a most challenging endeavour. While recent approaches for this task resort, in one way or other, to some form of preliminary preparation, training or fine-tuning, this paper explores a novel approach: We propose a preparation-free method that permits instruction-guided image editing on the fly. This approach is organized along three steps properly orchestrated that resort to image captioning and DDIM inversion, followed by obtaining the edit direction embedding, followed by image editing proper.

CHAI: Clustered Head Attention for Efficient LLM Inference

图片

Large Language Models (LLMs) with hundreds of billions of parameters have transformed the field of machine learning. However, serving these models at inference time is both compute and memory intensive, where a single request can require multiple GPUs and tens of Gigabytes of memory. Multi-Head Attention is one of the key components of LLMs, which can account for over 50% of LLMs memory and compute requirement. We observe that there is a high amount of redundancy across heads on which tokens they pay attention to. Based on this insight, we propose Clustered Head Attention (CHAI). CHAI combines heads with a high amount of correlation for self-attention at runtime, thus reducing both memory and compute.

Mechanics of Next Token Prediction with Self-Attention

图片

Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$ $\textit{layer}$ $\textit{learn}$ $\textit{from}$ $\textit{next-token}$ $\textit{prediction?}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token. $\textbf{(2)}$ $\textbf{Soft}$ $\textbf{composition:}$ It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window.

Make Me Happier: Evoking Emotions Through Image Diffusion Models

图片

Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. For the first time, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations.

StyleDyRF: Zero-shot 4D Style Transfer for Dynamic Neural Radiance Fields

图片

4D style transfer aims at transferring arbitrary visual style to the synthesized novel views of a dynamic 4D scene with varying viewpoints and times. Existing efforts on 3D style transfer can effectively combine the visual features of style images and neural radiance fields (NeRF) but fail to handle the 4D dynamic scenes limited by the static scene assumption. Consequently, we aim to handle the novel challenging problem of 4D style transfer for the first time, which further requires the consistency of stylized results on dynamic objects. In this paper, we introduce StyleDyRF, a method that represents the 4D feature space by deforming a canonical feature volume and learns a linear style transformation matrix on the feature volume in a data-driven fashion. To obtain the canonical feature volume, the rays at each time step are deformed with the geometric prior of a pre-trained dynamic NeRF to render the feature map under the supervision of pre-trained visual encoders.

ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation

图片

Performing language-conditioned robotic manipulation tasks in unstructured environments is highly demanded for general intelligent robots. Conventional robotic manipulation methods usually learn semantic representation of the observation for action prediction, which ignores the scene-level spatiotemporal dynamics for human goal completion. In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. Specifically, we first formulate the dynamic Gaussian Splatting framework that infers the semantics propagation in the Gaussian embedding space, where the semantic representation is leveraged to predict the optimal robot action. Then, we build a Gaussian world model to parameterize the distribution in our dynamic Gaussian Splatting framework, which provides informative supervision in the interactive environment via future scene reconstruction. We evaluate our ManiGaussian on 10 RLBench tasks with 166 variations, and the results demonstrate our framework can outperform the state-of-the-art methods by 13.1\% in average success rate.

Tackling the Singularities at the Endpoints of Time Intervals in  Diffusion Models

图片

Most diffusion models assume that the reverse process adheres to a Gaussian distribution. However, this approximation has not been rigorously validated, especially at singularities, where t=0 and t=1. Improperly dealing with such singularities leads to an average brightness issue in applications, and limits the generation of images with extreme brightness or darkness. We primarily focus on tackling singularities from both theoretical and practical perspectives. Initially, we establish the error bounds for the reverse process approximation, and showcase its Gaussian characteristics at singularity time steps. Based on this theoretical insight, we confirm the singularity at t=1 is conditionally removable while it at t=0 is an inherent property. Upon these significant conclusions, we propose a novel plug-and-play method SingDiffusion to address the initial singular time step sampling, which not only effectively resolves the average brightness issue for a wide range of diffusion models without extra training efforts, but also enhances their generation capability in achieving notable lower FID scores.

Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation

图片

The pre-trained vision-language model, exemplified by CLIP, advances zero-shot semantic segmentation by aligning visual features with class embeddings through a transformer decoder to generate semantic masks. Despite its effectiveness, prevailing methods within this paradigm encounter challenges, including overfitting on seen classes and small fragmentation in masks. To mitigate these issues, we propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.Specifically, we leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings. Moreover, to circumvent noisy alignments from the vision part due to its redundant nature, we introduce route attention into self-attention for finding visual consensus, thereby enhancing semantic consistency within the same object. Equipped with a vision-language prompting strategy, our approach significantly boosts the generalization capacity of segmentation models for unseen classes. Experimental results underscore the effectiveness of our approach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on the COCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.

PFStorer: Personalized Face Restoration and Super-Resolution

图片

Recent developments in face restoration have achieved remarkable results in producing high-quality and lifelike outputs. The stunning results however often fail to be faithful with respect to the identity of the person as the models lack necessary context. In this paper, we explore the potential of personalized face restoration with diffusion models. In our approach a restoration model is personalized using a few images of the identity, leading to tailored restoration with respect to the identity while retaining fine-grained details. By using independent trainable blocks for personalization, the rich prior of a base restoration model can be exploited to its fullest. To avoid the model relying on parts of identity left in the conditioning low-quality images, a generative regularizer is employed. With a learnable parameter, the model learns to balance between the details generated based on the input image and the degree of personalization. Moreover, we improve the training pipeline of face restoration models to enable an alignment-free approach. We showcase the robust capabilities of our approach in several real-world scenarios with multiple identities, demonstrating our method's ability to generate fine-grained details with faithful restoration. In the user study we evaluate the perceptual quality and faithfulness of the genereated details, with our method being voted best 61% of the time compared to the second best with 25% of the votes.

Towards Dense and Accurate Radar Perception Via Efficient Cross-Modal Diffusion Model

图片

Millimeter wave (mmWave) radars have attracted significant attention from both academia and industry due to their capability to operate in extreme weather conditions. However, they face challenges in terms of sparsity and noise interference, which hinder their application in the field of micro aerial vehicle (MAV) autonomous navigation. To this end, this paper proposes a novel approach to dense and accurate mmWave radar point cloud construction via cross-modal learning. Specifically, we introduce diffusion models, which possess state-of-the-art performance in generative modeling, to predict LiDAR-like point clouds from paired raw radar data. We also incorporate the most recent diffusion model inference accelerating techniques to ensure that the proposed method can be implemented on MAVs with limited computing resources.

Gaussian Splatting in Style

图片

Scene stylization extends the work of neural style transfer to three spatial dimensions. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across a multi-view setting. A vast majority of the previous works achieve this by optimizing the scene with a specific style image. In contrast, we propose a novel architecture trained on a collection of style images, that at test time produces high quality stylized novel views. Our work builds up on the framework of 3D Gaussian splatting. For a given scene, we take the pretrained Gaussians and process them using a multi resolution hash grid and a tiny MLP to obtain the conditional stylised views. The explicit nature of 3D Gaussians give us inherent advantages over NeRF-based methods including geometric consistency, along with having a fast training and rendering regime.

OneVOS: Unifying Video Object Segmentation with All-in-One Transformer  Framework

图片

Contemporary Video Object Segmentation (VOS) approaches typically consist stages of feature extraction, matching, memory management, and multiple objects aggregation. Recent advanced models either employ a discrete modeling for these components in a sequential manner, or optimize a combined pipeline through substructure aggregation. However, these existing explicit staged approaches prevent the VOS framework from being optimized as a unified whole, leading to the limited capacity and suboptimal performance in tackling complex videos. In this paper, we propose OneVOS, a novel framework that unifies the core components of VOS with All-in-One Transformer. Specifically, to unify all aforementioned modules into a vision transformer, we model all the features of frames, masks and memory for multiple objects as transformer tokens, and integrally accomplish feature extraction, matching and memory management of multiple objects through the flexible attention mechanism. Furthermore, a Unidirectional Hybrid Attention is proposed through a double decoupling of the original attention operation, to rectify semantic errors and ambiguities of stored tokens in OneVOS framework. Finally, to alleviate the storage burden and expedite inference, we propose the Dynamic Token Selector, which unveils the working mechanism of OneVOS and naturally leads to a more efficient version of OneVOS.

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

图片

We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/457098.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

影城管理系统|基于springboot框架+ Mysql+Java+B/S架构的影城管理系统设计与实现(可运行源码+数据库+设计文档+部署说明)

推荐阅读100套最新项目 最新ssmjava项目文档视频演示可运行源码分享 最新jspjava项目文档视频演示可运行源码分享 最新Spring Boot项目文档视频演示可运行源码分享 目录 前台功能效果图 管理员功能登录前台功能效果图 系统功能设计 数据库E-R图设计 lunwen参考 摘要 研究…

MISC-Catflag

前言 开始拿到这道题,以为是要识别文件类型,后面发现不是,kali识别为ascii文本文件。而用010editor打开,又是一堆看不懂的码 后面发现有很多重复内容1B 5B 43等等,再看题目type flag or cat flag可以联想linux的cat命…

Affinity Designer:超越想象,打造独一无二的设计作品!mac/win版

Affinity Designer是一款功能强大的图形设计软件,它拥有广泛的工具和功能,可以满足各种设计需求。无论是平面设计师、UI/UX设计师、插画师还是摄影师,Affinity Designer都能为他们提供所需的工具和支持。 Affinity Designer 软件获取 Affin…

Oracle 配置多个缓冲池(Keep pool Recycle Pool)

默认情况下,Oracle只有一个缓冲池 - Buffer Cache,其可以满足基本数据缓存需求。但某些数据的访问模式可能与普通数据不同,对于访问非常频繁的数据和很少访问的数据(两种极端),Oracle可以支持配置两个独立的…

鸿蒙到底好不好?要不要搞?

相信各位搞安卓的小伙伴多多少少都了解过鸿蒙,有些一知半解而有些已经开始学习起来了。 鸿蒙到底好不好?要不要搞? Android开发反正目前工作感觉也不好找,即便是上海这样的大城市也难搞,人员挺饱和的。而且年前裁员的…

寒假作业Day 12

寒假作业Day 12 一、选择题 队列是先进先出的线性表,既能插入数据,也能删除数据 A:ABCD,一进栈就出栈;B:DCBA,全部进栈之后再出栈 C:ACBD,先进A,然后出&…

“禁止互撕”新规第二天,热搜把#章子怡“怒怼”网友#推上了榜一

3月12日,微博热搜发布公告,对热搜词条处置规则进行了更新。 针对热搜词条长期以来存在的引战互撕、挑唆对立等不良现象,热搜生态秩序亟待改善,微博给出了两大解决方案: 一是更新热搜词条处置规则,当热搜词…

【C++】string的底层剖析以及模拟实现

一、字符串类的认识 C语言中,字符串是以\0结尾的一些字符的集合,为了操作方便,C标准库中提供了一些str系列的库函数, 但是这些库函数与字符串是分离开的,不太符合OOP的思想,而且底层空间需要用户自己管理&a…

[LeetCode][110]平衡二叉树

题目 110.平衡二叉树 给定一个二叉树,判断它是否是平衡二叉树。 示例 1: 输入:root [3,9,20,null,null,15,7] 输出:true 示例 2: 输入:root [1,2,2,3,3,null,null,4,4] 输出:false 示例 3&…

ROS 语音交互(三) tts

目录 一、模型选择 二、流程 三、核心代码展示 一、模型选择 科大讯飞超拟人识别 二、流程 超拟⼈合成协议 | 讯飞开放平台文档中心 (xfyun.cn) 三、核心代码展示 # coding: utf-8 import _thread as thread import os import time import base64import base64 import …

一种基于宏和serde_json实现的rust web中统一返回类

本人rust萌新,写web碰到了这个,基于ChatGPT和文心一言学了宏,强行把这玩意实现出来了,做个学习记录,如果有更好的方法,勿喷。 先看效果,注意不支持嵌套,且kv映射要用>(因为它这个…

【LeetCode热题100】142. 环形链表 II(链表)

一.题目要求 给定一个链表的头节点 head ,返回链表开始入环的第一个节点。 如果链表无环,则返回 null。 如果链表中有某个节点,可以通过连续跟踪 next 指针再次到达,则链表中存在环。 为了表示给定链表中的环,评测系统…

C++初阶:类和对象(三)——拷贝构造函数和运算符重载

目录 一、拷贝构造函数 1.概念 2.特性 二、赋值运算符重载 1.运算符重载 2.赋值运算符重载 (1)注意的点: (2)赋值运算符不允许被重载为全局函数,只能重载为类的成员函数 (3)…

C语言例3-11:使用算术运算符的例子。

代码如下: int main(void) {int a12, b10;float c2.0, d0.5;double e6.5, f13.0;printf("-a %d\n",-a);printf("ab %d\n",ab);printf("a-b %d\n",a-b);printf("a*b %d\n",a*b);printf("a/b %d\n"…

RabbitMq踩坑记录

1、连接报错:Broker not available; cannot force queue declarations during start: java.io.IOException 2.1、原因:端口不对 2.2、解决方案: 检查你的连接配置,很可能是你的yml里面的端口配置的是15672,更改为5672即…

Python实时追踪关键点组成人体模型

项目背景 最近遇到这样一个需求: 1:实时追踪关键点组成人体模型(手臂包括三个点:手腕,肘关节,双肩;腿部包括胯骨,膝盖,脚踝) 2:运用追踪到的关键…

15.WEB渗透测试--Kali Linux(三)

免责声明:内容仅供学习参考,请合法利用知识,禁止进行违法犯罪活动! 内容参考于: 易锦网校会员专享课 上一个内容:14.WEB渗透测试--Kali Linux(二)-CSDN博客 Kali工具使用 3389远…

基于springboot+vue实现电子商务平台管理系统项目【项目源码+论文说明】计算机毕业设计

基于springboot实现电子商务平台管理系统演示 研究的目的和意义 据我国IT行业发布的报告表明,近年来,我国互联网发展呈快速增长趋势,网民的数量已达8700万,逼近世界第一,并且随着宽带的实施及降价,每天约有…

【数据结构】树与堆 (向上/下调整算法和复杂度的分析、堆排序以及topk问题)

文章目录 1.树的概念1.1树的相关概念1.2树的表示 2.二叉树2.1概念2.2特殊二叉树2.3二叉树的存储 3.堆3.1堆的插入(向上调整)3.2堆的删除(向下调整)3.3堆的创建3.3.1使用向上调整3.3.2使用向下调整3.3.3两种建堆方式的比较 3.4堆排…

从0开始做知乎好物

一、什么是知乎好物 是知乎官方推出的一种帮助内容创作者变现的方式,以图文的形式为主,在创作的内容当中插入相关的商品卡片,用户通过点击卡片链接购买后,会有一定比例的佣金给到创作者 是指在知乎社区中推荐或被推荐的一些高品…
最新文章