Green Screen Augmentation Enables Scene Generalisation in Robotic Manipulation

Eugene Teoh1,∗, Sumit Patidar1,∗, Xiao Ma1, Stephen James1
1Dyson Robot Learning Lab, Equal Contribution
greenaug.github.io
Abstract

Generalising vision-based manipulation policies to novel environments remains a challenging area with limited exploration. Current practices involve collecting data in one location, training imitation learning or reinforcement learning policies with this data, and deploying the policy in the same location. However, this approach lacks scalability as it necessitates data collection in multiple locations for each task. This paper proposes a novel approach where data is collected in a location predominantly featuring green screens. We introduce Green-screen Augmentation (GreenAug), employing a chroma key algorithm to overlay background textures onto a green screen. Through extensive real-world empirical studies with over 850 training demonstrations and 8.2k evaluation episodes, we demonstrate that GreenAug surpasses no augmentation, standard computer vision augmentation, and prior generative augmentation methods in performance. While no algorithmic novelties are claimed, our paper advocates for a fundamental shift in data collection practices. We propose that real-world demonstrations in future research should utilise green screens, followed by the application of GreenAug. We believe GreenAug unlocks policy generalisation to visually distinct novel locations, addressing the current scene generalisation limitations in robot learning.

Keywords: Green Screen, Data Augmentation, Learning from Demonstration

Refer to caption
Figure 1: GreenAug provides a simple visual augmentation to robot policies by first collecting data with a green screen, then augmenting it with different textures. The resulting policy can be transferred to unseen visually distinct novel locations (scenes).

1 Introduction

Recent advancements in robot learning policies [1, 2, 3, 4, 5, 6, 7] have shown significant capabilities in performing complex manipulation tasks. However, generalising these policies to new locations remains a substantial challenge due to the lack of diverse training datasets. Ideally, these datasets should include a wide variety of environments, such as diverse areas of homes. However, gathering real-world data from different scenes is difficult and costly. These scenes refer to visually distinct physical locations, such as an oven situated in different kitchens or a toilet placed in various homes. The difficulty of collecting diverse data necessitates more efficient use of existing datasets.

Generative augmentation approaches [8, 9, 10] have attempted to address this by using generative models [11, 12, 13] to augment robot datasets. However, these methods often require extensive manual tuning and face several challenges. This includes text prompt engineering, chaining multiple object detectors, segmenters and generative models, and problems with performance and processing speed. Additionally, they can be inaccurate in robotic settings—particularly in segmentation and inpainting from wrist camera views, potentially introducing noise into robot policies.

In light of these complications, we opt for a simpler yet effective alternative: green screens. The film industry has utilised green screens extensively [14, 15, 16, 17, 18], enabling the addition of virtual backgrounds to live footage. Inspired by these applications, we apply green screen technology to robotics, allowing robots to perform tasks in unfamiliar scenes not part of the training demonstration data.

Refer to caption
Figure 2: The GreenAug process begins with acquiring a green screen mask using chroma keying. GreenAug-Rand applies random textures, GreenAug-Gen uses generative models to inpaint the background, and GreenAug-Mask learns a masking network to filter out the background.

In this paper, we introduce Green-screen Augmentation (GreenAug), a simple real-world visual augmentation method that uses green screen and chroma keying to replace backgrounds, applicable to RGB-based robot learning methods. We explore several variants of GreenAug, including the use of random textures (Fig. 1), backgrounds generated by generative models, and a background masking network to obscure the background during inference. By replacing backgrounds with various textures, it allows robot learning policies to be robust against changes in visual scenes and focus on crucial features in the image space.

We conducted extensive real-world experiments across eight challenging robotic manipulation tasks and six further studies, amounting to over 850 training demonstrations and 8.2k evaluation episodes. We evaluated the performance of control policies in unseen scenes for head-to-head comparisons on scene generalisation. We compared several variants of GreenAug against approaches with no augmentation, standard computer vision augmentations, and a generative augmentation [8, 9, 10] method. Our results show that GreenAug outperforms no augmentation by 65%, standard computer vision augmentation by 29% and generative augmentation by 21%.

2 Related Work

Visual augmentation in robotics. Visual augmentation is important in robotics for adapting to changing environments. Standard computer vision augmentations like random photometric distortion, cropping, shifting, convolutions and overlays have enhanced performance in imitation learning [19, 20] and reinforcement learning [1, 21, 2, 22, 23, 24]. However, most of these methods only apply simple photometric perturbations. Domain randomisation [25, 26, 27, 28, 29, 30, 31] enhances this by generating synthetic data with varied visual and physical dynamics parameters for simulation-to-reality (Sim2Real) transfer. Alternatively, methods like CACTI [8], GenAug [9], and ROSIE [10] use generative models such as Stable Diffusion [12] to diversify visual data directly on real-world data, bypassing the need for simulation.

Green screen in machine learning and robotics. Green screen has been traditionally used for film and video production [14, 15, 16, 17, 18]. In recent years, its application in machine learning has increased. Smirnov et al. [18] applied machine learning to improve the quality of chroma keying. Xu et al. [32], Sengupta et al. [33], Lin et al. [34, 35] explored machine learning techniques to replace green screens, enabling natural image matting without them. Schülein et al. [36] used green screens and chroma keying to replace backgrounds with clinical scenes to create synthetic data for medical clothing detection. In robotics, the use of green screens remains limited. Coates and Ng [37] employed it to develop a multi-camera object detector with synthetic data from chroma-keyed backgrounds.

3 Green Screen Augmentation

In this section, we provide a detailed introduction to GreenAug. The practical steps for GreenAug are as follows: (1) Green Screen Scene Setup; (2) GreenAug via Chroma Keying; (3) Training Robot Learning Policies. In the following sub-sections, we expand on each of these stages.

3.1 Green Screen Scene Setup

Refer to caption
Figure 3: Physical steps for green-screen setup. Scene items can either be moved into the green screen, or the green screen can be brought to the scene.

The act of scene setup consists of obscuring the background (i.e. non-task relevant objects) with a green screen. There are several ways of achieving this, two of which are highlighted in Fig. 3 and described below. Once the scene has been set up, demonstration collection can begin.

Scene to Green Screen, where a permanent green screen area or room is established, and items can be moved into the green screen for data collection. This is the most common use case and includes tasks such as general pick-and-place, opening drawers, sweeping, pushing, etc.

Green Screen to Scene, where the green screen is brought to a fixed, unmovable object. Scenes that usually fall into this category are ones that require manipulating integrated or heavy objects, such as stacking dishwashers, opening ovens, and opening doors.

3.2 GreenAug via Chroma Keying

Chroma keying is a visual effects technique for layering two images or video streams together based on colour hues (chroma range). This technique is commonly used in video production and post-production to composite two frames or images together by removing a background colour (usually green or blue) from the foreground content, making it transparent. This allows for the insertion of a new background or visual element in place of the green or blue background. Many chroma key algorithms exist, but we opt for a simple algorithm proposed by Cannon [38]. Given the generated mask, several options are available for applying GreenAug. We provide three variants of GreenAug: Random (GreenAug-Rand), Generative (GreenAug-Gen) and Mask (GreenAug-Mask), illustrated in Fig. 2 and described in detail below.

GreenAug-Rand This variant applies a fixed set of random textures to the chroma-keyed background. Following research in domain randomisation [25, 26, 27, 28, 29, 30, 31], increasing the variability of these textures helps the policy ignore the background and focus on task-specific items (objects manipulated by the policy).

GreenAug-Gen. This variant uses the chroma-keyed mask to inpaint realistic or imagined backgrounds using generative models like Stable Diffusion. Examples of prompts include: “photorealistic bedroom”, “photorealistic kitchen”, “photorealistic living room”. This method augments the image with semantic backgrounds, aiming to closely resemble real-world scenarios.

GreenAug-Mask. This variant uses a masking (soft segmentation) network trained to predict masks. These predicted masks are then applied to the image observations to obtain blacked-out, dark backgrounds. This simplification of the visual input potentially helps the visuomotor policies to focus on the main elements of interest by eliminating background noise and distractions. During training, the masking network processes images against chroma-keyed backgrounds with random textures (akin to GreenAug-Rand) and learns to predict the masks generated through chroma keying.

Table 1: Main experiment results averaged across three novel scenes. Each task-method combination is evaluated with 112 evaluation episodes on average. Full detailed results are provided in the Appendix.
Success Rate (%)
Task NoAug CVAug
Generative
Augmentation
GreenAug
Rand
GreenAug
Gen
GreenAug
Mask
Open Drawer 59 65 77 96 87 79
Place Cube in Drawer 36 69 70 92 83 37
Take Lid off Saucepan 67 81 77 88 73 71
Sweep Coffee Beans 66 78 75 96 81 84
Place Jeans in Basket 71 75 76 87 77 67
Place Bear in Basket 45 63 61 95 49 41
Stack Cups 49 59 77 81 72 55
Slide Book and Pick Up 49 74 89 93 93 35
Average 55 70 75 91 77 58

3.3 Training Robot Learning Policies

GreenAug can be applied to RGB-based robot learning methods. Similar to standard augmentation methods, images can be transformed with GreenAug and fed into policy networks during training, or they can be preprocessed offline and then used for training. Offline preprocessing is more common due to the longer computation time of some GreenAug variants. However, in online settings such as reinforcement learning, online transformations are also effective. GreenAug-Rand and GreenAug-Gen allow each raw frame from the training demonstrations to be augmented with different textures, significantly increasing the amount of preprocessed data. In contrast, GreenAug-Mask only masks the background and provides a single solution. To ensure a fair comparison, we keep the number of preprocessed frames equal to the number of raw frames for all methods.

In our main experiment (Section 4.3), we chose Action Chunking with Transformers (ACT) [4] as our control variable to demonstrate the effectiveness of this augmentation method. We selected ACT because of its recent success in adapting behaviours from a modest number of demonstrations, making it an ideal platform to showcase the benefits of GreenAug. Additionally, in Section 4.4, we demonstrate that GreenAug-Rand is also effective with a reinforcement learning policy.

4 Experiments

In this section, we present the experiments to evaluate the effectiveness of GreenAug on robot learning policies. Prior works [20, 39] have confirmed the effectiveness of background and texture randomisation in simulation. Since GreenAug focuses on real-world data augmentation, our experiments are conducted exclusively in the real world. We aim to study the following: (1) Does GreenAug improve visual generalization to unseen scenes? (2) Which variant of GreenAug is the most effective, and what are the tradeoffs? (3) Is GreenAug applicable in different data collection settings? (4) Is GreenAug agnostic to robot embodiments and learning methods?

4.1 Baselines

We implement several baselines to compare with GreenAug, as described below.

No augmentation (NoAug). No visual augmentation.

Computer Vision augmentation (CVAug). Random photometric distortions and random shift.

Generative augmentation. Generative augmentation encompasses a broader range of methods such as CACTI [8], GenAug [9], and ROSIE [10]. CACTI uses Stable Diffusion for inpainting but does not detail the method for obtaining object masks. GenAug, on the other hand, is constrained to a tabletop setting. ROSIE relies on proprietary models and does provide publicly available code. Thus, we have developed our own implementation that closely aligns with these methods. Our implementation is based on Grounding DINO [40] for open vocabulary object detection, Segment Anything [41] for zero-shot segmentation, and Stable Diffusion Turbo [12, 13] for inpainting, integrated with ControlNet [42] and conditioned on DPT-Hybrid [43] (monocular depth estimator) for better generation. Generative augmentation is similar to GreenAug-Gen, but it uses object detection and segmentation for mask creation instead of chroma keying. The pseudocode detailing this implementation is outlined in the Appendix.

4.2 Setup

Refer to caption
Figure 4: Visualisations of train and test scenes.

For our main experiment, we designed eight tasks (illustrated in Fig. 7) and structured our experiments for each task as follows.

Data collection. We collected two sets of demonstrations, each consisting of 50 demos. One set was recorded against a green screen (Scene 1), and the other within a standard setting (Scene 2). All data were collected using a leader-follower teleoperation system, similar to ALOHA [4], but with a 7-DoF Franka Panda arms and a 2F-140 Robotiq gripper on the follower. We used three D415 Realsense cameras, positioned at the upper wrist, lower wrist and left shoulder camera. The images are captured at a resolution of 240 (height) x 320 (width) pixels. For the main experiments alone, we collected over 800 demonstrations and conducted more than 6.6k evaluation runs. Additionally, we gathered about 50 more training demonstrations and 1.6k evaluations for the ablation and further studies in Section 4.4.

Training. We trained all baselines and our methods on both sets of data, except for GreenAug, which was excluded from Scene 2 as it relies on the green screen. Each data set corresponds to a separate policy. ACT is used as the control policy for our main experiments.

Evaluation. In addition to Scenes 1 and 2, we evaluated the methods in three novel scenes (Scenes 3–5). Initially, each method was assessed in Scene 1 to establish an upper-bound performance for the task. Subsequently, the methods were evaluated in Scene 3–5 to test generalisation. For each combination of task, method, train scenes (2), test scenes (3), we performed 25 evaluation runs.

Each scene is shown in Fig. 4. To focus on testing visual generalisation across different scenes, we maintained the positions and orientations of the objects (while applying the same degree of randomisation for one-to-one comparison) relative to the robot while moving between scenes.

Table 2: Processing time per RGB frame.

Method Time (\downarrow) Generative Augmentation 2.530 s GreenAug-Rand 0.009 s GreenAug-Gen 0.882 s

Table 3: GreenAug-Rand applied to RL.

Success Rate (%) Train Scene Test Scene NoAug GreenAug Rand Green Screen 1 Novel Scene 12 64

Table 4: GreenAug-Rand with different texture types averaged across tasks and novel scenes. Entropy signifies the amount of texture randomness.

Texture Type Entropy (bits) Success Rate (%) None - 48 Solid Colours 0.00 65 Perlin Noise 4.45 66 MIL Textures 6.81 87

Table 5: Object generalisation results. Policies trained on a green cup were tested on other objects. (n) specifies the number of objects tested in the category.

Success Rate (%) Object Category NoAug GreenAug Rand GreenAug Gen Cups (3) 95 83 80 Cans (2) 38 46 40 Cubes (2) 0 32 52 Soft Toy (1) 0 84 72 Average 45 61 62

4.3 Results

Refer to caption
Figure 5: Visualisations of raw and preprocessed frames (left shoulder and lower wrist camera views) of generative augmentation, GreenAug-Gen and GreenAug-Mask (during inference). Both generative methods struggle with producing good contextual wrist camera inpainting. In generative augmentation, the gripper is inpainted as part of the background, while GreenAug-Mask shows masking artefacts in novel scenes.

Table 1 presents our experimental findings. The results demonstrate that GreenAug-Rand surpasses all other baseline methods across all tasks. Specifically, GreenAug-Rand shows approximately a 65% improvement over NoAug, around a 29% improvement compared to CVAug, and about a 21% improvement over generative augmentation.

Surprisingly, GreenAug-Gen and generative augmentation rank second and third in performance respectively, despite using semantically meaningful backgrounds like living rooms or kitchens. As expected, both methods perform similarly, since they differ only in how they obtain background masks (object detection and segmentation). This suggests that specific semantic content is not crucial for GreenAug’s success, as the variant using random backgrounds performs even better. This superior performance may have resulted from the wider variety of colours and textures offered by the random backgrounds.

Generative augmentation performs slightly worse than GreenAug-Gen, likely because it struggles to provide good masks in wrist camera views (illustrated in Fig. 5), which are essential for tasks requiring precise and stable visual input. Despite advancements in generative models, segmentation and inpainting from robot camera views remain suboptimal.

GreenAug-Mask shows the least effectiveness among all methods tested. Qualitative evaluations of the masked images reveal frequent failures to completely obscure backgrounds, especially in novel scenes (shown in Fig. 5). This issue stems from two main factors: the inherent imperfections in ground truth masks obtained from chroma keying and the compounding error from the masking network. The network’s imperfect masking further complicates the tasks, pushing the images into out-of-distribution states that challenge the control policy.

4.4 Ablation and Further Studies

Refer to caption
((a))
Refer to caption
((b))
Figure 6: (6(a)) GreenAug-Rand performs the best when applied to all frames per trajectory. (6(b)) Visual assessment of applying GreenAug to a single shade of green, in scenarios where multiple objects with varying shades of green are present. Masked objects have their backgrounds blacked out.

Based on the main experiments, we demonstrated that GreenAug-Rand outperforms all other methods. We then conducted the following in-depth analyses.

Benchmarking GreenAug’s speed. We conducted a benchmark to compare the processing speed of various methods, shown in Table 5. CVAug and GreenAug-Mask were excluded because the former is applied on the fly during training, and the latter performs poorly. We show that GreenAug-Rand is significantly faster than the other two generative methods.

Applying GreenAug to a different robot with reinforcement learning. We investigated whether GreenAug can be applied to a different robot embodiment and learning method, beyond the Franka Panda and ACT. We set up a similar “take lid off saucepan” task on a UR5. We used a continuous demo-driven DQN variant [44, 45, 46, 47] with actions discretised into bins. The robot was provided with 24 demonstrations and was given a sparse reward of 0 for failure and 1 for success. We trained the robot online with 20 minutes of exploration on a green screen background and evaluated two policies, NoAug and GreenAug-Rand, in one novel scene. The results, shown in Table 5 demonstrate that GreenAug-Rand applied to reinforcement learning with a different robot performs significantly better than NoAug.

Impact of texture randomness. We investigated how the texture randomness of GreenAug-Rand affects performance. We tested solid colours, Perlin noise (procedurally generated textures) [48], and MIL textures (used in the main experiments). All texture datasets are of the same size (5771). The evaluation was conducted on the “put cube in drawer” and “stack cups” tasks from the main experiment across three novel scenes (Scenes 3–5). Table 5 summarises the results. Consistent with domain randomization studies [25, 26, 27, 28, 29, 31], greater texture randomness leads to better performance. Examples of each texture type are provided in the Appendix.

Generalisation across object category. We assessed if GreenAug can be applied not just to backgrounds but also to different object categories. We set up a simple pick-and-place task. We first trained on a green cup and then tested on other visually different objects. The results, shown in Table 5, indicate that GreenAug-Gen performs best, with only a 1% difference from GreenAug-Rand. Both methods outperform NoAug by more than 35%. NoAug performs well on cups but fails with cubes and soft toys, and occasionally works with cans due to their similar geometric shapes to cups. GreenAug-Rand and GreenAug-Gen show better performance across different object categories, demonstrating some level of generalisation. However, performance with cups suffers slightly, likely due to the strong augmentation causing confusion about geometric shapes.

Green screen coverage. In real-world settings, some frames in the robot data may move away from the green screen during robot servoing. For example, if the green screen is only partially set up in the scene, the robot may observe parts of the scene not covered by the green screen. To emulate this scenario, we applied GreenAug-Rand to varying percentages of frames per episode. This was evaluated on the same tasks as the texture randomness study. The results are summarised in Fig. 6(a). As expected, green screen coverage is proportional to the success rate.

Presence of multiple green objects. Green screens could affect scenes when there are multiple green objects. We evaluated the sensitivity of chroma keying under these conditions, a challenge also encountered in the film industry. This study questions whether chroma keying can effectively isolate one green object without impacting others. We conducted a visual assessment (shown in Fig. 6(b)) and showed that we can augment only one object at a time while leaving the others unchanged. Alternatively, one can also use a different colour such as blue (along with a green background) for chroma-keying objects.

5 Conclusion and Limitations

Refer to caption
Figure 7: Visualisations of real-world tasks. The trajectory sequences are stacked horizontally, starting with the initial positions labelled as #0.

This paper proposes and investigates the efficacy of GreenAug in robotic manipulation across a variety of real-world scenarios. We have demonstrated that GreenAug not only works effectively across different tasks but also surpasses other augmentation methods in performance while maintaining simplicity. GreenAug outperforms NoAug by approximately 65%, CVAug by 29% and generative augmentation by about 21%. Our findings advocate for a paradigm shift in data collection practices for robot learning. We propose the use of green screens for future real-world demonstrations. Implementing GreenAug could significantly improve policy generalisation across novel locations, effectively addressing scene generalisation limitations currently faced in the field.

While GreenAug proves to be useful, several challenges remain that we have outlined for future research. GreenAug is effective for background generalisation and to an extent, object generalisation (as shown in further studies), but it falls short when it comes to adapting to objects with very different geometric shapes. This type of generalisation involves changing the dynamics and trajectories of the demonstrations, such as accommodating different mugs with unique handles that require specific grasping points. Furthermore, GreenAug could be complementary to generative augmentation. This combination could help train world models capable of producing imaginary trajectories that generalise across diverse objects and appliances.

Acknowledgments

Big thanks to the members of the Dyson Robot Learning Lab for discussions and infrastructure help: Nic Backshall, Iain Haughton, Younggyo Seo, Sridhar Sola, Jafar Uruc, Yunfan Lu, Abdi Abdinur, Nikita Chernyadev.

References

  • Yarats et al. [2020] D. Yarats, I. Kostrikov, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International conference on learning representations, 2020.
  • Yarats et al. [2021] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
  • Shridhar et al. [2023] M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
  • Zhao et al. [2023] T. Z. Zhao, V. Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016.
  • Chi et al. [2023] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. C. Burchfiel, and S. Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.026.
  • Ma et al. [2024] X. Ma, S. Patidar, I. Haughton, and S. James. Hierarchical diffusion policy for kinematics-aware multi-task robotic manipulation. arXiv preprint arXiv:2403.03890, 2024.
  • Vosylius et al. [2024] V. Vosylius, Y. Seo, J. Uruç, and S. James. Render and diffuse: Aligning image and action spaces for diffusion-based behaviour cloning. arXiv preprint arXiv:2405.18196, 2024.
  • Mandi et al. [2022] Z. Mandi, H. Bharadhwaj, V. Moens, S. Song, A. Rajeswaran, and V. Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning. arXiv preprint arXiv:2212.05711, 2022.
  • Chen et al. [2023] Q. Chen, S. C. Kiami, A. Gupta, and V. Kumar. GenAug: Retargeting behaviors to unseen situations via Generative Augmentation. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.010.
  • Yu et al. [2023] T. Yu, T. Xiao, J. Tompson, A. Stone, S. Wang, A. Brohan, J. Singh, C. Tan, D. M, J. Peralta, K. Hausman, B. Ichter, and F. Xia. Scaling Robot Learning with Semantically Imagined Experience. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.027.
  • Sohl-Dickstein et al. [2015] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • Rombach et al. [2022] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Sauer et al. [2023] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  • Smith and Blinn [1996] A. R. Smith and J. F. Blinn. Blue screen matting. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 259–268, 1996.
  • Grundhöfer et al. [2010] A. Grundhöfer, D. Kurz, S. Thiele, and O. Bimber. Color invariant chroma keying and color spill neutralization for dynamic scenes and cameras. The Visual Computer, 26:1167–1176, 2010.
  • Foster [2014] J. Foster. The green screen handbook: real-world production techniques. Routledge, 2014.
  • Aksoy et al. [2016] Y. Aksoy, T. O. Aydin, M. Pollefeys, and A. Smolić. Interactive high-quality green-screen keying via color unmixing. ACM Transactions on Graphics (TOG), 36(4):1, 2016.
  • Smirnov et al. [2023] D. Smirnov, C. LeGendre, X. Yu, and P. Debevec. Magenta green screen: Spectrally multiplexed alpha matting with deep colorization. In Proceedings of the Digital Production Symposium, pages 1–13, 2023.
  • Young et al. [2021] S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto. Visual imitation made easy. In Conference on Robot Learning, pages 1992–2005. PMLR, 2021.
  • Xie et al. [2023] A. Xie, L. Lee, T. Xiao, and C. Finn. Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659, 2023.
  • Laskin et al. [2020] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020.
  • Hansen et al. [2021] N. Hansen, H. Su, and X. Wang. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. Advances in neural information processing systems, 34:3680–3693, 2021.
  • Hansen and Wang [2021] N. Hansen and X. Wang. Generalization in reinforcement learning by soft data augmentation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617. IEEE, 2021.
  • Almuzairee et al. [2024] A. Almuzairee, N. Hansen, and H. I. Christensen. A recipe for unbounded data augmentation in visual reinforcement learning. arXiv preprint arXiv:2405.17416, 2024.
  • Sadeghi and Levine [2017] F. Sadeghi and S. Levine. Cad2rl: Real single-image flight without a single real image. In Proceedings of Robotics: Science and Systems, Cambridge, Massachusetts, July 2017. doi:10.15607/RSS.2017.XIII.034.
  • Tobin et al. [2017] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
  • James et al. [2017] S. James, A. J. Davison, and E. Johns. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. In Conference on Robot Learning, pages 334–343. PMLR, 2017.
  • Matas et al. [2018] J. Matas, S. James, and A. J. Davison. Sim-to-real reinforcement learning for deformable object manipulation. Conference on Robot Learning, 2018.
  • James et al. [2019] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12627–12637, 2019.
  • Alghonaim and Johns [2021] R. Alghonaim and E. Johns. Benchmarking domain randomisation for visual sim-to-real transfer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 12802–12808. IEEE, 2021.
  • So et al. [2022] J. So, A. Xie, S. Jung, J. Edlund, R. Thakker, A. Agha-mohammadi, P. Abbeel, and S. James. Sim-to-real via sim-to-seg: End-to-end off-road autonomous driving without real data. Conference on Robot Learning, 2022.
  • Xu et al. [2017] N. Xu, B. Price, S. Cohen, and T. Huang. Deep image matting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2970–2979, 2017.
  • Sengupta et al. [2020] S. Sengupta, V. Jayaram, B. Curless, S. M. Seitz, and I. Kemelmacher-Shlizerman. Background matting: The world is your green screen. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2291–2300, 2020.
  • Lin et al. [2021] S. Lin, A. Ryabtsev, S. Sengupta, B. L. Curless, S. M. Seitz, and I. Kemelmacher-Shlizerman. Real-time high-resolution background matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8762–8771, 2021.
  • Lin et al. [2022] S. Lin, L. Yang, I. Saleemi, and S. Sengupta. Robust high-resolution video matting with temporal guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 238–247, 2022.
  • Schülein et al. [2023] P. Schülein, H. Teufel, R. Vorpahl, I. Emter, Y. Bukschat, M. Pfister, N. Rathmann, S. Diehl, and M. Vetter. Comparison of synthetic dataset generation methods for medical intervention rooms using medical clothing detection as an example. EURASIP Journal on Image and Video Processing, 2023(1):12, 2023.
  • Coates and Ng [2010] A. Coates and A. Y. Ng. Multi-camera object detection for robotics. In 2010 IEEE International conference on robotics and automation, pages 412–419. IEEE, 2010.
  • Cannon [2011] E. Cannon. Greenscreen code and hints. http://gc-films.com/chromakey.html, 2011. [Accessed 15-01-2024].
  • Pumacay et al. [2024] W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191, 2024.
  • Liu et al. [2023] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  • Kirillov et al. [2023] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Zhang et al. [2023] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • Ranftl et al. [2021] R. Ranftl, A. Bochkovskiy, and V. Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Seyde et al. [2022] T. Seyde, P. Werner, W. Schwarting, I. Gilitschenski, M. Riedmiller, D. Rus, and M. Wulfmeier. Solving continuous control via q-learning. arXiv preprint arXiv:2210.12566, 2022.
  • Ball et al. [2023] P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023.
  • Farebrother et al. [2024] J. Farebrother, J. Orbay, Q. Vuong, A. A. Taïga, Y. Chebotar, T. Xiao, A. Irpan, S. Levine, P. S. Castro, A. Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl. arXiv preprint arXiv:2403.03950, 2024.
  • Perlin [1985] K. Perlin. An image synthesizer. ACM Siggraph Computer Graphics, 19(3):287–296, 1985.
  • Finn et al. [2017] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. In Conference on robot learning, pages 357–368. PMLR, 2017.
  • Ronneberger et al. [2015] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • Iakubovskii [2019] P. Iakubovskii. Segmentation models pytorch. https://github.com/qubvel/segmentation_models.pytorch, 2019.
  • Wright [2017] S. Wright. Digital compositing for film and video: Production Workflows and Techniques. Routledge, 2017.
  • Li et al. [2021] H. Li, W. Zhu, H. Jin, and Y. Ma. Automatic, illumination-invariant and real-time green-screen keying using deeply guided linear models. Symmetry, 13(8):1454, 2021.
  • Jin et al. [2022] Y. Jin, Z. Li, D. Zhu, M. Shi, and Z. Wang. Automatic and real-time green screen keying. The Visual Computer, 38(9):3135–3147, 2022.
  • James et al. [2022] S. James, K. Wada, T. Laidlow, and A. J. Davison. Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13739–13748, 2022.

Appendix A Experiment Setups

In this section, we provide the detailed setups of our real-robot experiments to help reproduce the results.

Robot Setup. The robot setup consists of a 7-DoF Franka Panda Emika arm equipped with a Robotiq 2F-140 gripper. We use three RealSense D415 cameras: two cameras mounted on the end-effector (lower wrist, upper wrist) for a wide field-of-view, and one camera (left shoulder) fixed on the base, as depicted in Fig. 8(a).

Data collection. We gather demonstrations for our tasks utilising a leader-follower setup similar to ALOHA [4]. An expert human demonstrator moves the Leader arm, and the Follower arm mirrors the Leader’s joint positions, as shown in Fig. 8(b). Camera and robot state observations are recorded at 30 FPS.

Tasks. For each task, we collect 50 demonstrations each at two scenes: green screen room and living room. Fig. 11 shows the task definitions with sketches to illustrate the setup with measurements and randomisation. For all tasks, the initial robot joint positions are [0.0, -0.785, 0.0, -2.356, 0.0, 1.571, 0.0].

Refer to caption
((a)) Robot setup. Franka Panda Arm is mounted on a Vention base with three Realsense cameras. The robot is above the ground by \qty23. The robot pose is represented in the base frame, FRsubscript𝐹𝑅{F_{R}}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT.
Refer to caption
((b)) Leader and follower robot setup.
Refer to caption
((a)) Open Drawer: The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is uniformly randomised inside the \qty10×\qty10\qty10\qty10\qty{10}{}\times\qty{10}{}10 × 10 region. The gripper then slides into the small drawer opening and then pulls the drawer open. In total, 50 demonstrations are collected with an average demo length of 169 steps or 13 secs.
Refer to caption
((b)) Place Cube in Drawer: The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is fixed relative to the drawer. The cube is randomised on the drawer top within the \qty32×\qty10\qty32\qty10\qty{32}{}\times\qty{10}{}32 × 10 region. The robot picks up the cube and places it inside the opened drawer. A total of 50 demonstrations are collected with an average demo length of 250 steps or 19 secs.
Refer to caption
((c)) Sweep Coffee Beans: The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is fixed relative to the drawer. The drawer used is the same as in previous tasks but rotated by \qty90. We stick two black tapes on the drawer top. The coffee beans are randomised in the \qty24×\qty24\qty24\qty24\qty{24}{}\times\qty{24}{}24 × 24 region between the two tapes. The sponge position is randomised along the tape B (\qty24). The robot grasps the sponge and sweeps the coffee beans to the left of tape A. A total of 50 demonstrations are collected with an average demo length of 314 steps or 24 secs. For evaluations, a trial is considered successful if 90% of beans are swept. We use 20 beans, so at least 18 beans needs to be swept for success.
Refer to caption
((a)) Take Lid off Saucepan: The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is fixed relative to the drawer. The saucepan is randomized in the L-shaped region. The robot grasps the lid of the saucepan and always places it in the orange area. In total, 50 demonstrations were collected with an average demo length of 857 steps.
Refer to caption
((b)) Place Jeans to Basket: The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is fixed relative to the laundry basket. The jeans is semi-folded and hanging on the right edge of the chair. The chair position is randomised in the \qty20×\qty20\qty20\qty20\qty{20}{}\times\qty{20}{}20 × 20 region. The robot grasps the jeans from the side and places it in the laundry basket. We collected 50 demonstrations, in each half of the demonstrations the chair position is randomised keeping the orientation A and in the other half the orientation of the chair remained B. The average length of the demo is 592 steps or 44 secs.
Refer to caption
((c)) Place Bear in Basket: The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is fixed and the bear (toy) position is randomised in the \qty46×\qty30\qty46\qty30\qty{46}{}\times\qty{30}{}46 × 30 region on the ground. For half of the demos, the randomisation is done in orientation A and for the other half in orientation B. The robot first picks up the toy and places it in the basket nearby. We collect 50 demonstrations in total with an average demo length of 741 steps or 56 secs.
Refer to caption
((a)) Stack Cups: The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is fixed relative to the drawer. The orange cup is randomised in the \qty10×\qty32\qty10\qty32\qty{10}{}\times\qty{32}{}10 × 32 region whereas the blue cup is randomised along the blue line (32cm). The robot first picks up the orange cup and stacks it on the blue cup. In total, 50 demonstrations are collected with an average demo length of 590 steps or 44 secs.
Refer to caption
((b)) Slide Book and Pick Up: The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is fixed relative to the black coffee table. The book position is randomised in the \qty12cm×\qty12cm\qty12𝑐𝑚\qty12𝑐𝑚\qty{12}{cm}\times\qty{12}{cm}12 italic_c italic_m × 12 italic_c italic_m region. In half of the demonstrations, the book orientation is kept A and in the other half, orientation B. The robot first corrects the book orientation if necessary by pushing on its edge and then slides the book to the edge of the table. It then picks it up and places it in the area depicted by orange (rightmost figure). We collect 50 demonstrations in total with an average demo length of 930 steps or 70 secs.
Refer to caption
((c)) Place Cup in Drawer (Object Generalisation): The robot base (FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is fixed and the green cup position is randomised in the \qty20cm×\qty32cm\qty20𝑐𝑚\qty32𝑐𝑚\qty{20}{cm}\times\qty{32}{cm}20 italic_c italic_m × 32 italic_c italic_m region on the drawer top. The robot first picks up the cup and places it in the drawer. We collect 50 demonstrations in total with an average demo length of 597 steps or 45 secs.
Figure 11: Task definitions with randomisation. We illustrate 8 main tasks and 1 ablation task with randomisation used. We used the images from the left shoulder (top row) and lower wrist camera (bottom row) to describe each task sequence. Note that the sketches on the right are not drawn to scale.
Refer to caption
Figure 12: Reinforcement learning experiment setup (“Take the lid off saucepan”). The illustration shows train and test scenes and the task sequence. In the train scene, a green screen cloth covers the table. In the test scene, the cloth is removed, and distractors are added around the table. UR5 robots are used for this experiment, with only upper and lower wrist cameras.

Appendix B More Visualisations

Refer to caption
Figure 13: Visual observations of GreenAug-Rand with MIL textures [49]
Refer to caption
Figure 14: Visual observations of GreenAug-Rand with solid & Perlin textures
Refer to caption
Figure 15: Visual observations of GreenAug-Gen
Refer to caption
Figure 16: Visual observations of GreenAug-Mask during inference.
Refer to caption
Figure 17: Visuals of three novel scenes used during evaluations for each task in respective order. These include a subset of kitchen, washing, study, and living rooms.

Appendix C Compute and Hyperparameter Details

We perform the preprocessing and model training using NVIDIA L4 GPUs (24GB VRAM).

ACT. We use the same implementation of ACT as described in the original paper, with the following changes to hyperparameters: action chunking size is set to 20, the number of epochs is 5000, and we sample 16 transitions per epoch. Unlike the original ACT implementation, which samples one transition per episode per epoch, we sample multiple transitions.

Table 6: Pre-processing hyperparameters for each task. Chroma key parameters are represented by Key Colour (K𝐾Kitalic_K) in hexadecimal colour codes, tola (α𝛼\alphaitalic_α), and tolb (β𝛽\betaitalic_β). Detection Text Prompt is used for generative augmentation. Inpaint Text Prompt is used for both generative augmentation and GreenAug-Gen for background generation.
Task Key Colour (K𝐾Kitalic_K) tola (α𝛼\alphaitalic_α) tolb (β𝛽\betaitalic_β) Detection Text Prompt Inpaint Text Prompt
Open Drawer #439f82 30 35 drawer. robot arm. robot gripper. photorealistic kitchen, study room, washing room, living room, or bedroom
Place Cube in Drawer #25806f 35 40 red cube. drawer. robot arm. robot gripper. photorealistic kitchen, study room, washing room, living room, or bedroom
Sweep Coffee Beans #1d6953 23 30 sponge. coffee beans. black tapes. drawer. robot arm. robot gripper. photorealistic kitchen, study room, washing room, living room, or bedroom
Take Lid off Saucepan #348367 15 25 saucepan. drawer. robot arm. robot gripper. photorealistic kitchen, study room, washing room, living room, or bedroom
Place Jeans in Basket #25806f 30 40 jeans. chair. robot arm. robot gripper. photorealistic kitchen, study room, washing room, living room, or bedroom
Place Bear in Basket #25806f 30 30 soft toy. basket. robot arm. robot gripper. photorealistic kitchen, study room, washing room, living room, or bedroom
Stack Cups #348367 15 25 blue cup. orange cup. drawer. robot arm. robot gripper. photorealistic kitchen, study room, washing room, living room, or bedroom
Slide Book and Pick Up #25806f 20 30 book. table. robot arm. robot gripper. photorealistic kitchen, study room, washing room, living room, or bedroom
Object Generalisation #699230 30 20 green cup. drawer. robot arm. robot gripper. colourful cup, bowl, cube, toy, can, bottle or general graspable object

GreenAug-Mask U-Net. We use the original U-Net architecture [50] (implemented by Iakubovskii [51]) for the masking network used in GreenAug-Mask. The model comprises 14.3 million parameters.

Table 7: Masking network hyperparameters for GreenAug-Mask.
Model Unet
Encoder ResNet18
Encoder Weights ImageNet
Epochs 100
Batch size 128
Image size 224×224224224224\times 224224 × 224
Seed 42

Appendix D Detailed Results

This section presents the full unaggregated results.

Table 8: Full experiment results. “Green Screen→Green Screen” roughly represents the upper bound performance (in parentheses) and is not included in the average. Full unaggregated results for each task are in Tables 9, 10, 11, 12, 13, 14, 16 and 15. The tables are also hyperlinked in the task text below.
Success Rate (%)
Task
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Open Drawer Green Screen Green Screen (100) (88) (96) (100) (100) (100)
Living Room 3 Novel Scenes 63 51 57 - - -
Green Screen 3 Novel Scenes 55 79 96 96 87 79
Place Cube in Drawer Green Screen Green Screen (92) (96) (72) (100) (84) (96)
Living Room 3 Novel Scenes 33 64 68 - - -
Green Screen 3 Novel Scenes 39 73 72 92 83 37
Sweep Coffee Beans Green Screen Green Screen (100) (96) (88) (96) (80) (92)
Living Room 3 Novel Scenes 55 79 73 - - -
Green Screen 3 Novel Scenes 77 77 77 96 81 84
Take Lid off Saucepan Green Screen Green Screen (96) (84) (92) (80) (80) (84)
Living Room 3 Novel Scenes 61 79 72 - - -
Green Screen 3 Novel Scenes 73 83 83 88 73 71
Place Jeans in Basket Green Screen Green Screen (100) (100) (92) (100) (100) (92)
Living Room 3 Novel Scenes 69 73 75 - - -
Green Screen 3 Novel Scenes 72 77 77 87 77 67
Place Bear in Basket Green Screen Green Screen (100) (100) (100) 100 (96) (100)
Living Room 3 Novel Scenes 35 45 41 - - -
Green Screen 3 Novel Scenes 55 80 81 95 49 41
Stack Cups Green Screen Green Screen (76) (84) (84) (88) (80) (80)
Living Room 3 Novel Scenes 41 57 75 - - -
Green Screen 3 Novel Scenes 57 61 80 81 72 55
Slide Book and Pick Up Green Screen Green Screen (100) (100) (100) (100) (100) (100)
Living Room 3 Novel Scenes 43 61 89 - - -
Green Screen 3 Novel Scenes 55 87 89 93 93 35
Average 55 70 75 91 77 58
Table 9: “Open Drawer” task unaggregated results.
Success Rate (%)
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Green Screen Green Screen (100) (88) (96) (100) (100) (100)
Living Room Novel Scene 1 60 48 68 - - -
Novel Scene 2 68 64 64 - - -
Novel Scene 3 60 40 40 - - -
Green Screen Novel Scene 1 52 88 88 100 84 76
Novel Scene 2 72 80 100 100 96 80
Novel Scene 3 40 68 100 88 80 80
Table 10: “Place Cube in Drawer” task unaggregated results..
Success Rate (%)
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Green Screen Green Screen (92) (96) (72) (100) (84) (96)
Living Room Novel Scene 1 68 88 72 - - -
Novel Scene 2 0 32 64 - - -
Novel Scene 3 32 72 68 - - -
Green Screen Novel Scene 1 36 92 76 92 92 48
Novel Scene 2 56 68 64 96 92 48
Novel Scene 3 24 60 76 88 64 16
Table 11: “Sweep Coffee Beans” task unaggregated results.
Success Rate (%)
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Green Screen Green Screen (100) (96) (88) (96) (80) (92)
Living Room Novel Scene 1 60 96 88 - - -
Novel Scene 2 52 60 88 - - -
Novel Scene 3 52 80 44 - - -
Green Screen Novel Scene 1 92 80 88 100 92 96
Novel Scene 2 76 80 88 96 80 76
Novel Scene 3 64 72 56 92 72 80
Table 12: “Take Lid off Saucepan” task unaggregated results.
Success Rate (%)
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Green Screen Green Screen (96) (84) (92) (80) (80) (84)
Living Room Novel Scene 1 64 76 76 - - -
Novel Scene 2 68 84 68 - - -
Novel Scene 3 52 76 72 - - -
Green Screen Novel Scene 1 64 80 80 88 76 68
Novel Scene 2 96 96 92 84 76 80
Novel Scene 3 60 72 76 92 68 64
Table 13: “Place Jeans in Basket” task unaggregated results.
Success Rate (%)
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Green Screen Green Screen (100) (100) (92) (100) (100) (92)
Living Room Novel Scene 1 64 64 68 - - -
Novel Scene 2 76 76 76 - - -
Novel Scene 3 68 80 80 - - -
Green Screen Novel Scene 1 68 76 76 84 72 64
Novel Scene 2 72 80 76 80 80 72
Novel Scene 3 76 76 80 96 80 64
Table 14: “Place Bear in Basket” task unaggregated results.
Success Rate (%)
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Green Screen Green Screen (100) (100) (100) (100) (96) (100)
Living Room Novel Scene 1 32 24 32 - - -
Novel Scene 2 36 24 12 - - -
Novel Scene 3 36 88 80 - - -
Green Screen Novel Scene 1 56 92 80 100 44 32
Novel Scene 2 28 52 68 84 12 24
Novel Scene 3 80 96 96 100 92 68
Table 15: “Stack Cups” task unaggregated results.
Success Rate (%)
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Green Screen Green Screen (76) (84) (84) (88) (80) (80)
Living Room Novel Scene 1 44 52 64 - - -
Novel Scene 2 40 60 76 - - -
Novel Scene 3 40 60 84 - - -
Green Screen Novel Scene 1 56 44 76 72 60 24
Novel Scene 2 64 80 88 84 72 80
Novel Scene 3 52 60 76 88 84 60
Table 16: “Slide Book and Pick Up” task unaggregated results.
Success Rate (%)
Train
Scene
Test
Scene
NoAug CVAug
Generative
Augmentation
GreenAug
random
GreenAug
generative
GreenAug
mask
Green Screen Green Screen (100) (100) (100) (100) (100) (100)
Living Room Novel Scene 1 68 84 88 - - -
Novel Scene 2 28 48 84 - - -
Novel Scene 3 32 52 96 - - -
Green Screen Novel Scene 1 44 96 92 96 96 32
Novel Scene 2 48 72 84 88 88 8
Novel Scene 3 72 92 92 96 96 64
Table 17: Texture randomness unaggregated results (GreenAug-Rand). “Green Screen→Green Screen” roughly represents the upper bound performance (in parentheses) and is not included in the average.
Success Rate (%)
Task
Train
Scene
Test
Scene
None Solid Textures Perlin Textures MIL Textures
Place Cube in Drawer Green Screen Green Screen (92) (100) (100) (100)
Green Screen Novel Scene 1 36 68 64 92
Novel Scene 2 56 36 56 96
Novel Scene 3 24 68 68 88
Stack Cups Green Screen Green Screen (76) (76) (80) (88)
Green Screen Novel Scene 1 56 76 68 72
Novel Scene 2 64 68 60 84
Novel Scene 3 52 76 80 88
Average 48 65 66 87
Table 18: Green screen coverage unaggregated results (GreenAug-Rand). “Green Screen→Green Screen” roughly represents the upper bound performance (in parentheses) and is not included in the average.
Success Rate (%)
Task
Train
Scene
Test
Scene
0% 25% 50% 75% 100%
Place Cube in Drawer Green Screen Green Screen (92) (100) (100) (100) (100)
Green Screen Novel Scene 1 36 80 76 80 92
Novel Scene 2 56 64 68 72 96
Novel Scene 3 24 88 84 88 88
Stack Cups Green Screen Green Screen (76) (76) (80) (76) (88)
Green Screen Novel Scene 1 56 64 76 80 72
Novel Scene 2 64 68 60 68 84
Novel Scene 3 52 72 76 76 88
Average 48 73 73 77 87
Table 19: Object generalisation unaggregated results. Data is collected on the green cup, and policies are then trained and evaluated on various objects (illustrated in Fig. 18).
Success Rate (%)
Object Type NoAug GreenAug-Rand GreenAug-Gen
Green Cup 96 88 80
Blue Cup 96 80 80
Orange Cup 92 80 80
Red Cube 0 44 60
Green Cube 0 20 44
Soda Can 40 28 36
Soya Can 36 64 44
Soft Toy 0 84 72
Average 45 61 62
Refer to caption
((a)) Demonstrations collected using the green cup.
Refer to caption
((b)) Test objects: cups, cans, cubes and toy.
Figure 18: Object generalisation across object category. Policy trained on green cup data using GreenAug-Rand and tested on different objects.

Appendix E Additional Limitations and Future Works

Exploration of better chroma key algorithms. The chroma key algorithm used in this paper [38] is a basic one that performs reasonably well, but it does not produce perfect masks. Some parameter tuning for K𝐾Kitalic_K, α𝛼\alphaitalic_α, and β𝛽\betaitalic_β is still necessary. Despite these imperfections, we demonstrate that GreenAug still significantly outperforms the baselines. In the film industry, extensive manual post-processing is often required to achieve perfect masks [52]. Future research could explore more advanced chroma key algorithms that provide superior green screen masks [17, 53, 54, 18]. This could potentially enhance the performance of GreenAug-Mask, which relies heavily on green screen mask as ground truth for training.

Pose generalisation. A major ongoing challenge in robot learning is generalising to 6D poses not present in the training dataset. Current robot learning policies, especially imitation learning-based ones often fail when objects are relocated to different positions within 3D space.

Application to methods with 3D observations. Currently, GreenAug has only been tested on RGB-based robot learning policies. Recent advances in next-best-pose-based agents [55, 3, 6] have demonstrated that by aligning the observation space with action space, we can obtain strong generalisation in robot learning policies. As a general plug-and-play method, GreenAug could potentially further improve the scene generalisation of the next-best-pose agents, which we leave for future study.