
October 10, 2024
Camille X
到目前的act的VAE, diffusion policy的u-net, pi0的flow-matching基本都源自于图像生成领域。在有条件情况下,图像的序列预测与机器人的动作序列预测有相通的地方,图像是(c, h, w), 机器人动作是(horizon, action_dim),都是多维度的连续值序列预测问题。c ~ action_dim, h*w ~ horizon。`
其实到最后,会发现act,dp, pi0最后生成的序列都是来自于一个shape为(horizon, action_dim)的采样,act是全0的query,用encoder_output作为key和value;dp和pi0,虽然都是在shape为(horizon, action_dim)的标准高斯分布中采样但也都是以state, img, env信息作为condition,可以理解为使用shape为(horizon, action_dim)的标准高斯分布作为query, 用condition作为key和value。
vla的引入解决的是常识性的问题,比如黄色胶带引起的伪关联问题。另外rl的引入或许也可以解决这部分问题。
环境信息的引入,一方面可以缓解机器人本身获取信息的局限性和不确定性,另一方面也可以提供更丰富的上下文信息,帮助模型更好地理解任务和环境,从而提升生成动作序列的质量和鲁棒性。典型的比如点云信息的引入,在工厂或者家庭环境中,机器人本身相机获取的点云信息与环境中的点云信息匹配,可以帮助机器人更好地理解其在环境中的位置和姿态,而不仅仅是基于自身的末端位置,这样可以解决视角泛化问题。
def conditional_sample(
self, batch_size: int, global_cond: Tensor | None = None, generator: torch.Generator | None = None
) -> Tensor:
device = get_device_from_parameters(self)
dtype = get_dtype_from_parameters(self)
# Sample prior.
sample = torch.randn(
size=(batch_size, self.config.horizon, self.config.action_feature.shape[0]),
dtype=dtype,
device=device,
generator=generator,
)
# self.noise_scheduler.set_timesteps(self.num_inference_steps)
self.noise_scheduler.set_timesteps(10)
for t in self.noise_scheduler.timesteps:
# Predict model output.
model_output = self.unet(
sample,
torch.full(sample.shape[:1], t, dtype=torch.long, device=sample.device),
global_cond=global_cond,
)
# Compute previous image: x_t -> x_t-1
sample = self.noise_scheduler.step(model_output, t, sample, generator=generator).prev_sample
return sample
def conditional_sample(
self, batch_size: int, global_cond: Tensor | None = None, generator: torch.Generator | None = None
) -> Tensor:
device = get_device_from_parameters(self)
dtype = get_dtype_from_parameters(self)
# Sample prior.
sample = torch.randn(
size=(batch_size, self.config.horizon, self.config.action_feature.shape[0]),
dtype=dtype,
device=device,
generator=generator,
)
# self.noise_scheduler.set_timesteps(self.num_inference_steps)
self.noise_scheduler.set_timesteps(10)
for t in self.noise_scheduler.timesteps:
# Predict model output.
model_output = self.unet(
sample,
torch.full(sample.shape[:1], t, dtype=torch.long, device=sample.device),
global_cond=global_cond,
)
# Compute previous image: x_t -> x_t-1
sample = self.noise_scheduler.step(model_output, t, sample, generator=generator).prev_sample
return sample
ddim vs ddpm