DreamPolisher、InternLM2 、AniArtAvatar、PlainMamba、AniPortrait

本文首发于公众号:机器感知

DreamPolisher、InternLM2 、AniArtAvatar、PlainMamba、AniPortrait

DreamPolisher: Towards High-Quality Text-to-3D Generation via Geometric  Diffusion

We present DreamPolisher, a novel Gaussian Splatting based method with geometric guidance, tailored to learn cross-view consistency and intricate detail from textual descriptions. While recent progress on text-to-3D generation methods have been promising, prevailing methods often fail to ensure view-consistency and textural richness. This problem becomes particularly noticeable for methods that work with text input alone. To address this, we propose a two-stage Gaussian Splatting based approach that enforces geometric consistency among views. Initially, a coarse 3D generation undergoes refinement via geometric optimization. Subsequently, we use a ControlNet driven refiner coupled with the geometric consistency term to improve both texture fidelity and overall consistency of the generated 3D asset. Empirical evaluations across diverse textual prompts spanning various object categories demonstrate the efficacy of DreamPolisher in generating consistent and realistic 3D objects, aligning closely with the semantics of the textual instructions.

InternLM2 Technical Report

图片

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV  Caching

图片

The Transformer architecture has significantly advanced natural language processing (NLP) and has been foundational in developing large language models (LLMs) such as LLaMA and OPT, which have come to dominate a broad range of NLP tasks. Despite their superior accuracy, LLMs present unique challenges in practical inference, concerning the compute and memory-intensive nature. Thanks to the autoregressive characteristic of LLM inference, KV caching for the attention layers in Transformers can effectively accelerate LLM inference by substituting quadratic-complexity computation with linear-complexity memory accesses. Yet, this approach requires increasing memory as demand grows for processing longer sequences. The overhead leads to reduced throughput due to I/O bottlenecks and even out-of-memory errors, particularly on resource-constrained systems like a single commodity GPU. In this paper, we propose ALISA, a novel algorithm-system co-design solution to address the challenges imposed by KV caching. On the algorithm level, ALISA prioritizes tokens that are most important in generating a new token via a Sparse Window Attention (SWA) algorithm. SWA introduces high sparsity in attention layers and reduces the memory footprint of KV caching at negligible accuracy loss. On the system level, ALISA employs three-phase token-level dynamical scheduling and optimizes the trade-off between caching and recomputation, thus maximizing the overall performance in resource-constrained systems. In a single GPU-CPU system, we demonstrate that under varying workloads, ALISA improves the throughput of baseline systems such as FlexGen and vLLM by up to 3X and 1.9X, respectively.

AniArtAvatar: Animatable 3D Art Avatar from a Single Image

图片

We present a novel approach for generating animatable 3D-aware art avatars from a single image, with controllable facial expressions, head poses, and shoulder movements. Unlike previous reenactment methods, our approach utilizes a view-conditioned 2D diffusion model to synthesize multi-view images from a single art portrait with a neutral expression. With the generated colors and normals, we synthesize a static avatar using an SDF-based neural surface. For avatar animation, we extract control points, transfer the motion with these points, and deform the implicit canonical space. Firstly, we render the front image of the avatar, extract the 2D landmarks, and project them to the 3D space using a trained SDF network. We extract 3D driving landmarks using 3DMM and transfer the motion to the avatar landmarks. To animate the avatar pose, we manually set the body height and bound the head and torso of an avatar with two cages. The head and torso can be animated by transforming the two cages. Our approach is a one-shot pipeline that can be applied to various styles. Experiments demonstrate that our method can generate high-quality 3D art avatars with desired control over different motions.

DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with  Space-sensitive Customization and Semantic Preservation

图片

Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing.

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

图片

In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image. Our methodology is divided into two stages. Initially, we extract 3D intermediate representations from audio and project them into a sequence of 2D facial landmarks. Subsequently, we employ a robust diffusion model, coupled with a motion module, to convert the landmark sequence into photorealistic and temporally consistent portrait animation. Experimental results demonstrate the superiority of AniPortrait in terms of facial naturalness, pose diversity, and visual quality, thereby offering an enhanced perceptual experience. Moreover, our methodology exhibits considerable potential in terms of flexibility and controllability, which can be effectively applied in areas such as facial motion editing or face reenactment. We release code and model weights at https://github.com/scutzzj/AniPortrait

PlainMamba: Improving Non-Hierarchical Mamba in Visual Recognition

图片

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for general visual recognition. The recent Mamba model has shown how SSMs can be highly competitive with other architectures on sequential data and initial attempts have been made to apply it to images. In this paper, we further adapt the selective scanning process of Mamba to the visual domain, enhancing its ability to learn features from two-dimensional images by (i) a continuous 2D scanning process that improves spatial continuity by ensuring adjacency of tokens in the scanning sequence, and (ii) direction-aware updating which enables the model to discern the spatial relations of tokens by encoding directional information. Our architecture is designed to be easy to use and easy to scale, formed by stacking identical PlainMamba blocks, resulting in a model with constant width throughout all layers. The architecture is further simplified by removing the need for special tokens. We evaluate PlainMamba on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves performance gains over previous non-hierarchical models and is competitive with hierarchical alternatives. For tasks requiring high-resolution inputs, in particular, PlainMamba requires much less computing while maintaining high performance. Code and models are available at https://github.com/ChenhongyiYang/PlainMamba

Boosting Diffusion Models with Moving Average Sampling in Frequency  Domain

图片

Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost.

Superior and Pragmatic Talking Face Generation with Teacher-Student  Framework

图片

Talking face generation technology creates talking videos from arbitrary appearance and motion signal, with the "arbitrary" offering ease of use but also introducing challenges in practical applications. Existing methods work well with standard inputs but suffer serious performance degradation with intricate real-world ones. Moreover, efficiency is also an important concern in deployment. To comprehensively address these issues, we introduce SuperFace, a teacher-student framework that balances quality, robustness, cost and editability. We first propose a simple but effective teacher model capable of handling inputs of varying qualities to generate high-quality results. Building on this, we devise an efficient distillation strategy to acquire an identity-specific student model that maintains quality with significantly reduced computational load. Our experiments validate that SuperFace offers a more comprehensive solution than existing methods for the four mentioned objectives, especially in reducing FLOPs by 99\% with the student model. SuperFace can be driven by both video and audio and allows for localized facial attributes editing.

TC4D: Trajectory-Conditioned Text-to-4D Generation

图片

Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: https://sherwinbahmani.github.io/tc4d.

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture  Synthesis

图片

Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

图片

Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/493754.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

kubernetes负载均衡资源-Ingress

一、Ingress概念 1.1 Ingress概念 使用NodePort类型的Service可以将集群内部服务暴露给集群外部客广端,但使用这种类型Service存在如下几个问题。 1、一个端口只能一个服务使用,所有通过NodePort暴露的端口都需要提前规划;2、如果通过NodePort暴露端口过多,后期维护成本太…

中国土壤厚度空间分布数据

土壤层次分为覆盖层 林溶层 淀积层 母质层,其中在林溶层中的最上面那层就是我们通常说的土壤厚度在这一层中,这一层也被称为腐殖层,是肥力性质最好的一层,植物根系和微生物也集中在这一层。至于覆盖层在森林土壤中比较常见&#x…

【LeetCode: 2580. 统计将重叠区间合并成组的方案数 + 合并区间】

🚀 算法题 🚀 🌲 算法刷题专栏 | 面试必备算法 | 面试高频算法 🍀 🌲 越难的东西,越要努力坚持,因为它具有很高的价值,算法就是这样✨ 🌲 作者简介:硕风和炜,…

计算机组成原理 3 运算器

定点补码加/减法运算 补码加减法的实现 补码加法 : [X + Y] 补 [X] 补 + [Y] 补 和的补码 补码的和 补码减法 : [X−Y] 补 [X] 补 + [−Y] 补 [X] 补 −[Y] 补 差的补码 补码的差 求补公式 : [−…

qemu快速入门

1.环境 win10系统上 -》 通过vmware装 -》 CentOS 7.4 -》装qemu虚拟出一台指定cpu的CentOS 7.9 2.安装基本命令 yum install -y net-tools yum install -y wget 3.安装基础依赖 yum groupinstall Development Tools -y yum groupinstall "Virtualization Host"…

MySQL数据库高级语句

文章目录 MySQL高级语句older by 排序区间判断查询或与且(or 与and)嵌套查询(多条件)查询不重复记录distinctcount 计数限制结果条目limit别名as常用通配符嵌套查询(子查询)同表不同表嵌套查询还能用于删除…

Redis中的客户端(一)

客户端 概述 Redis服务器是典型的一对多服务器程序:一个服务器可以与多个客户端建立网络连接,每个客户端可以向服务器发送命令请求,而服务器则接收并处理客户端发送的命令请求,并向客户端返回命令回复。通过使用由IO多路复用技术实现的文件…

C++ explicit隐式类型转换

单参数构造函数支持隐式类型的转换 什么意思? 简单来理解就是: 一个类对象的构造函数的参数只有一个,就可以直接进行赋值传参 例如构造函数的参数为int,且只有一个int 就可以直接将int类型的整型数据转换成类对象 也就是说从int类…

MySQL中的日历/时间/时间戳

一,日历 MySQL 使用通常所说的 proleptic 阳历。 每个将日历由朱利安改为阳历的国家在改变日历期间都不得不删除至少10天。 为了了解其运作,让我们看看1582年10月,这是由朱利安日历转换为阳历的第一次: 周一 周二 周三 周四 周五 周六…

海外媒体宣发:企业最牛出海最巨有“料”的几个新闻媒体

海外媒体宣发:企业最牛出海最巨有“料”的几个新闻媒体 1.雅虎财经(Yahoo Finance)雅虎网(英文名字:Yahoo,NASDAQ:YHOO)是美国有名的互联网技术门户网,都是20世纪初互联…

充钱也不能任性,今天用百度AI又骂街了

今天在用文心一言的时候又翻车了,应该是又骂街了。 cao,充钱也不能任性啊,不能手贱去看百度的新功能,垃圾的一批。 本来付费用了用4.0,感觉Chat功能还是可以的,不论是简单的代码,还是一些通用的…

抖音电商“达人客服”产品上线啦!超多作者邀你一起“321上客服”!

有问题别自己克服,来抖音电商找“达人客服” 当代年轻人购物,正在从机智省变成理智购。越来越多的人在达人直播间购物,看重的不止是优惠力度,还有服务保障。 为了帮助达人更好地服务用户,抖音电商上线了「达人客服」…

MySQL数据库------------探索高级SQL查询语句(一)

目录 一、常用查询 1.1按关键字排序 1.2简单的select条件查询(where) 二、排序 2.1升序排列 2.2降序排序 三、order by 查询结果排序 ①order by还可以结合where进行条件过滤,筛选地址是哪里的学生按分数降序排列 ②查询学生信息先按hobbyid降序排列&#…

如何解决Modbus转Profinet网关通信不稳定或数据丢失问题

接到现场反映,在配置Modbus转Profinet网关时,出现Modbus转Profinet网关(XD-MDPN100)通信不稳定或数据丢失的问题,就这个问题特做出答疑。 解决Modbus转Profinet网关(XD-MDPN100)通信不稳定或数据…

【区块链】C语言编程实现三叉Merkle树

目录 1. Merkle树简介2. 构建Merkle树3. 生成SPV路径4. 验证SPV路径5. 三叉Merkle树创建、SPV生成及验证总程序6. 程序运行结果 1. Merkle树简介 如上图所示,Merkle 树的叶子节点为交易序列,对每一笔交易进行 Hash(SHA 256算法) 之…

STM32F10X开发环境的搭建

一、keil软件安装 找到keil软件包,解压缩,找到keil5安装软件: 鼠标右键选择以管理员权限运行。点击next,直到安装结束。 安装完成后在桌面会出现keil5软件图标: 然后再安装相应的芯片支持包:我们用的是stm…

C语言:文件操作的详解(看完一定有更深刻的理解)

目录 前言 程序文件 文件的打开和关闭 流 标准流 文件的顺序读写 写文件 fputc函数 fputs函数 fprintf函数 读文件 fgetc函数 fgets函数 fscanf函数 printf/fprintf/sprintf scanf/fscanf/sscanf 文件的随机读写 fseek函数 ftell函数 rewind函数 大多数人用…

【数据库管理操作】Mysql 创建学生数据库及对数据表进行修改

MySQL 创建学生成绩数据库 1.创建数据库 create database studentscore;创建完成之后,如果需要使用该数据,使用use命令 use studentscore;创建表前查看当前数据库中包含的表 show tables; 2.创建bclass表 create table bclass( class_id char(8) …

深度学习入门1——Optimization

Methods of optimization Stochastic Gradient Descent (SGD) use mini-batch (32/64/128) to do gradient descent SGD Momentum continue moving in the general direction as the previous iterations Build up “velocity” as a running mean of gradients Rho giv…

全国河流湖库公开数据及应用实践

关于全国河流湖口的数据,通常指的是各条河流流入湖泊或海洋的位置及其相关的水文、地理信息。这类数据包括但不限于以下几个方面: 1. 地理位置:每条河流的出海口或流入湖泊的具体经纬度坐标。 2. 水文特征:如湖口水位、流量、径…