Green Screen Augmentation Enables Scene Generalisation in Robotic Manipulation

Eugene Teoh^1,∗, Sumit Patidar^1,∗, Xiao Ma¹, Stephen James¹
¹Dyson Robot Learning Lab, ^∗Equal Contribution
greenaug.github.io

Abstract

Generalising vision-based manipulation policies to novel environments remains a challenging area with limited exploration. Current practices involve collecting data in one location, training imitation learning or reinforcement learning policies with this data, and deploying the policy in the same location. However, this approach lacks scalability as it necessitates data collection in multiple locations for each task. This paper proposes a novel approach where data is collected in a location predominantly featuring green screens. We introduce Green-screen Augmentation (GreenAug), employing a chroma key algorithm to overlay background textures onto a green screen. Through extensive real-world empirical studies with over 850 training demonstrations and 8.2k evaluation episodes, we demonstrate that GreenAug surpasses no augmentation, standard computer vision augmentation, and prior generative augmentation methods in performance. While no algorithmic novelties are claimed, our paper advocates for a fundamental shift in data collection practices. We propose that real-world demonstrations in future research should utilise green screens, followed by the application of GreenAug. We believe GreenAug unlocks policy generalisation to visually distinct novel locations, addressing the current scene generalisation limitations in robot learning.

Keywords: Green Screen, Data Augmentation, Learning from Demonstration

Refer to caption — Figure 1: GreenAug provides a simple visual augmentation to robot policies by first collecting data with a green screen, then augmenting it with different textures. The resulting policy can be transferred to unseen visually distinct novel locations (scenes).

1 Introduction

Recent advancements in robot learning policies [1, 2, 3, 4, 5, 6, 7] have shown significant capabilities in performing complex manipulation tasks. However, generalising these policies to new locations remains a substantial challenge due to the lack of diverse training datasets. Ideally, these datasets should include a wide variety of environments, such as diverse areas of homes. However, gathering real-world data from different scenes is difficult and costly. These scenes refer to visually distinct physical locations, such as an oven situated in different kitchens or a toilet placed in various homes. The difficulty of collecting diverse data necessitates more efficient use of existing datasets.

Generative augmentation approaches [8, 9, 10] have attempted to address this by using generative models [11, 12, 13] to augment robot datasets. However, these methods often require extensive manual tuning and face several challenges. This includes text prompt engineering, chaining multiple object detectors, segmenters and generative models, and problems with performance and processing speed. Additionally, they can be inaccurate in robotic settings—particularly in segmentation and inpainting from wrist camera views, potentially introducing noise into robot policies.

In light of these complications, we opt for a simpler yet effective alternative: green screens. The film industry has utilised green screens extensively [14, 15, 16, 17, 18], enabling the addition of virtual backgrounds to live footage. Inspired by these applications, we apply green screen technology to robotics, allowing robots to perform tasks in unfamiliar scenes not part of the training demonstration data.

In this paper, we introduce Green-screen Augmentation (GreenAug), a simple real-world visual augmentation method that uses green screen and chroma keying to replace backgrounds, applicable to RGB-based robot learning methods. We explore several variants of GreenAug, including the use of random textures (Fig. 1), backgrounds generated by generative models, and a background masking network to obscure the background during inference. By replacing backgrounds with various textures, it allows robot learning policies to be robust against changes in visual scenes and focus on crucial features in the image space.

We conducted extensive real-world experiments across eight challenging robotic manipulation tasks and six further studies, amounting to over 850 training demonstrations and 8.2k evaluation episodes. We evaluated the performance of control policies in unseen scenes for head-to-head comparisons on scene generalisation. We compared several variants of GreenAug against approaches with no augmentation, standard computer vision augmentations, and a generative augmentation [8, 9, 10] method. Our results show that GreenAug outperforms no augmentation by 65%, standard computer vision augmentation by 29% and generative augmentation by 21%.

2 Related Work

Visual augmentation in robotics. Visual augmentation is important in robotics for adapting to changing environments. Standard computer vision augmentations like random photometric distortion, cropping, shifting, convolutions and overlays have enhanced performance in imitation learning [19, 20] and reinforcement learning [1, 21, 2, 22, 23, 24]. However, most of these methods only apply simple photometric perturbations. Domain randomisation [25, 26, 27, 28, 29, 30, 31] enhances this by generating synthetic data with varied visual and physical dynamics parameters for simulation-to-reality (Sim2Real) transfer. Alternatively, methods like CACTI [8], GenAug [9], and ROSIE [10] use generative models such as Stable Diffusion [12] to diversify visual data directly on real-world data, bypassing the need for simulation.

Green screen in machine learning and robotics. Green screen has been traditionally used for film and video production [14, 15, 16, 17, 18]. In recent years, its application in machine learning has increased. Smirnov et al. [18] applied machine learning to improve the quality of chroma keying. Xu et al. [32], Sengupta et al. [33], Lin et al. [34, 35] explored machine learning techniques to replace green screens, enabling natural image matting without them. Schülein et al. [36] used green screens and chroma keying to replace backgrounds with clinical scenes to create synthetic data for medical clothing detection. In robotics, the use of green screens remains limited. Coates and Ng [37] employed it to develop a multi-camera object detector with synthetic data from chroma-keyed backgrounds.

3 Green Screen Augmentation

In this section, we provide a detailed introduction to GreenAug. The practical steps for GreenAug are as follows: (1) Green Screen Scene Setup; (2) GreenAug via Chroma Keying; (3) Training Robot Learning Policies. In the following sub-sections, we expand on each of these stages.

3.1 Green Screen Scene Setup

The act of scene setup consists of obscuring the background (i.e. non-task relevant objects) with a green screen. There are several ways of achieving this, two of which are highlighted in Fig. 3 and described below. Once the scene has been set up, demonstration collection can begin.

Scene to Green Screen, where a permanent green screen area or room is established, and items can be moved into the green screen for data collection. This is the most common use case and includes tasks such as general pick-and-place, opening drawers, sweeping, pushing, etc.

Green Screen to Scene, where the green screen is brought to a fixed, unmovable object. Scenes that usually fall into this category are ones that require manipulating integrated or heavy objects, such as stacking dishwashers, opening ovens, and opening doors.

3.2 GreenAug via Chroma Keying

Chroma keying is a visual effects technique for layering two images or video streams together based on colour hues (chroma range). This technique is commonly used in video production and post-production to composite two frames or images together by removing a background colour (usually green or blue) from the foreground content, making it transparent. This allows for the insertion of a new background or visual element in place of the green or blue background. Many chroma key algorithms exist, but we opt for a simple algorithm proposed by Cannon [38]. Given the generated mask, several options are available for applying GreenAug. We provide three variants of GreenAug: Random (GreenAug-Rand), Generative (GreenAug-Gen) and Mask (GreenAug-Mask), illustrated in Fig. 2 and described in detail below.

GreenAug-Rand This variant applies a fixed set of random textures to the chroma-keyed background. Following research in domain randomisation [25, 26, 27, 28, 29, 30, 31], increasing the variability of these textures helps the policy ignore the background and focus on task-specific items (objects manipulated by the policy).

GreenAug-Gen. This variant uses the chroma-keyed mask to inpaint realistic or imagined backgrounds using generative models like Stable Diffusion. Examples of prompts include: “photorealistic bedroom”, “photorealistic kitchen”, “photorealistic living room”. This method augments the image with semantic backgrounds, aiming to closely resemble real-world scenarios.

GreenAug-Mask. This variant uses a masking (soft segmentation) network trained to predict masks. These predicted masks are then applied to the image observations to obtain blacked-out, dark backgrounds. This simplification of the visual input potentially helps the visuomotor policies to focus on the main elements of interest by eliminating background noise and distractions. During training, the masking network processes images against chroma-keyed backgrounds with random textures (akin to GreenAug-Rand) and learns to predict the masks generated through chroma keying.

Table 1: Main experiment results averaged across three novel scenes. Each task-method combination is evaluated with 112 evaluation episodes on average. Full detailed results are provided in the Appendix.

Success Rate (%)

Task

NoAug

CVAug

Generative

Augmentation

GreenAug

Rand

GreenAug

Gen

GreenAug

Mask

Open Drawer

Place Cube in Drawer

Take Lid off Saucepan

Sweep Coffee Beans

Place Jeans in Basket

Place Bear in Basket

Stack Cups

Slide Book and Pick Up

Average

3.3 Training Robot Learning Policies

GreenAug can be applied to RGB-based robot learning methods. Similar to standard augmentation methods, images can be transformed with GreenAug and fed into policy networks during training, or they can be preprocessed offline and then used for training. Offline preprocessing is more common due to the longer computation time of some GreenAug variants. However, in online settings such as reinforcement learning, online transformations are also effective. GreenAug-Rand and GreenAug-Gen allow each raw frame from the training demonstrations to be augmented with different textures, significantly increasing the amount of preprocessed data. In contrast, GreenAug-Mask only masks the background and provides a single solution. To ensure a fair comparison, we keep the number of preprocessed frames equal to the number of raw frames for all methods.

In our main experiment (Section 4.3), we chose Action Chunking with Transformers (ACT) [4] as our control variable to demonstrate the effectiveness of this augmentation method. We selected ACT because of its recent success in adapting behaviours from a modest number of demonstrations, making it an ideal platform to showcase the benefits of GreenAug. Additionally, in Section 4.4, we demonstrate that GreenAug-Rand is also effective with a reinforcement learning policy.

4 Experiments

In this section, we present the experiments to evaluate the effectiveness of GreenAug on robot learning policies. Prior works [20, 39] have confirmed the effectiveness of background and texture randomisation in simulation. Since GreenAug focuses on real-world data augmentation, our experiments are conducted exclusively in the real world. We aim to study the following: (1) Does GreenAug improve visual generalization to unseen scenes? (2) Which variant of GreenAug is the most effective, and what are the tradeoffs? (3) Is GreenAug applicable in different data collection settings? (4) Is GreenAug agnostic to robot embodiments and learning methods?

4.1 Baselines

We implement several baselines to compare with GreenAug, as described below.

No augmentation (NoAug). No visual augmentation.

Computer Vision augmentation (CVAug). Random photometric distortions and random shift.

Generative augmentation. Generative augmentation encompasses a broader range of methods such as CACTI [8], GenAug [9], and ROSIE [10]. CACTI uses Stable Diffusion for inpainting but does not detail the method for obtaining object masks. GenAug, on the other hand, is constrained to a tabletop setting. ROSIE relies on proprietary models and does provide publicly available code. Thus, we have developed our own implementation that closely aligns with these methods. Our implementation is based on Grounding DINO [40] for open vocabulary object detection, Segment Anything [41] for zero-shot segmentation, and Stable Diffusion Turbo [12, 13] for inpainting, integrated with ControlNet [42] and conditioned on DPT-Hybrid [43] (monocular depth estimator) for better generation. Generative augmentation is similar to GreenAug-Gen, but it uses object detection and segmentation for mask creation instead of chroma keying. The pseudocode detailing this implementation is outlined in the Appendix.

4.2 Setup

For our main experiment, we designed eight tasks (illustrated in Fig. 7) and structured our experiments for each task as follows.

Data collection. We collected two sets of demonstrations, each consisting of 50 demos. One set was recorded against a green screen (Scene 1), and the other within a standard setting (Scene 2). All data were collected using a leader-follower teleoperation system, similar to ALOHA [4], but with a 7-DoF Franka Panda arms and a 2F-140 Robotiq gripper on the follower. We used three D415 Realsense cameras, positioned at the upper wrist, lower wrist and left shoulder camera. The images are captured at a resolution of 240 (height) x 320 (width) pixels. For the main experiments alone, we collected over 800 demonstrations and conducted more than 6.6k evaluation runs. Additionally, we gathered about 50 more training demonstrations and 1.6k evaluations for the ablation and further studies in Section 4.4.

Training. We trained all baselines and our methods on both sets of data, except for GreenAug, which was excluded from Scene 2 as it relies on the green screen. Each data set corresponds to a separate policy. ACT is used as the control policy for our main experiments.

Evaluation. In addition to Scenes 1 and 2, we evaluated the methods in three novel scenes (Scenes 3–5). Initially, each method was assessed in Scene 1 to establish an upper-bound performance for the task. Subsequently, the methods were evaluated in Scene 3–5 to test generalisation. For each combination of task, method, train scenes (2), test scenes (3), we performed 25 evaluation runs.

Each scene is shown in Fig. 4. To focus on testing visual generalisation across different scenes, we maintained the positions and orientations of the objects (while applying the same degree of randomisation for one-to-one comparison) relative to the robot while moving between scenes.

Table 2: Processing time per RGB frame.

Method Time ( $\downarrow$ ) Generative Augmentation 2.530 s GreenAug-Rand 0.009 s GreenAug-Gen 0.882 s

Table 3: GreenAug-Rand applied to RL.

Success Rate (%) Train Scene Test Scene NoAug GreenAug Rand Green Screen 1 Novel Scene 12 64

Table 4: GreenAug-Rand with different texture types averaged across tasks and novel scenes. Entropy signifies the amount of texture randomness.

Texture Type Entropy (bits) Success Rate (%) None - 48 Solid Colours 0.00 65 Perlin Noise 4.45 66 MIL Textures 6.81 87

Table 5: Object generalisation results. Policies trained on a green cup were tested on other objects. (n) specifies the number of objects tested in the category.

Success Rate (%) Object Category NoAug GreenAug Rand GreenAug Gen Cups (3) 95 83 80 Cans (2) 38 46 40 Cubes (2) 0 32 52 Soft Toy (1) 0 84 72 Average 45 61 62

4.3 Results

Table 1 presents our experimental findings. The results demonstrate that GreenAug-Rand surpasses all other baseline methods across all tasks. Specifically, GreenAug-Rand shows approximately a 65% improvement over NoAug, around a 29% improvement compared to CVAug, and about a 21% improvement over generative augmentation.

Surprisingly, GreenAug-Gen and generative augmentation rank second and third in performance respectively, despite using semantically meaningful backgrounds like living rooms or kitchens. As expected, both methods perform similarly, since they differ only in how they obtain background masks (object detection and segmentation). This suggests that specific semantic content is not crucial for GreenAug’s success, as the variant using random backgrounds performs even better. This superior performance may have resulted from the wider variety of colours and textures offered by the random backgrounds.

Generative augmentation performs slightly worse than GreenAug-Gen, likely because it struggles to provide good masks in wrist camera views (illustrated in Fig. 5), which are essential for tasks requiring precise and stable visual input. Despite advancements in generative models, segmentation and inpainting from robot camera views remain suboptimal.

GreenAug-Mask shows the least effectiveness among all methods tested. Qualitative evaluations of the masked images reveal frequent failures to completely obscure backgrounds, especially in novel scenes (shown in Fig. 5). This issue stems from two main factors: the inherent imperfections in ground truth masks obtained from chroma keying and the compounding error from the masking network. The network’s imperfect masking further complicates the tasks, pushing the images into out-of-distribution states that challenge the control policy.

4.4 Ablation and Further Studies

Based on the main experiments, we demonstrated that GreenAug-Rand outperforms all other methods. We then conducted the following in-depth analyses.

Benchmarking GreenAug’s speed. We conducted a benchmark to compare the processing speed of various methods, shown in Table 5. CVAug and GreenAug-Mask were excluded because the former is applied on the fly during training, and the latter performs poorly. We show that GreenAug-Rand is significantly faster than the other two generative methods.

Applying GreenAug to a different robot with reinforcement learning. We investigated whether GreenAug can be applied to a different robot embodiment and learning method, beyond the Franka Panda and ACT. We set up a similar “take lid off saucepan” task on a UR5. We used a continuous demo-driven DQN variant [44, 45, 46, 47] with actions discretised into bins. The robot was provided with 24 demonstrations and was given a sparse reward of 0 for failure and 1 for success. We trained the robot online with 20 minutes of exploration on a green screen background and evaluated two policies, NoAug and GreenAug-Rand, in one novel scene. The results, shown in Table 5 demonstrate that GreenAug-Rand applied to reinforcement learning with a different robot performs significantly better than NoAug.

Impact of texture randomness. We investigated how the texture randomness of GreenAug-Rand affects performance. We tested solid colours, Perlin noise (procedurally generated textures) [48], and MIL textures (used in the main experiments). All texture datasets are of the same size (5771). The evaluation was conducted on the “put cube in drawer” and “stack cups” tasks from the main experiment across three novel scenes (Scenes 3–5). Table 5 summarises the results. Consistent with domain randomization studies [25, 26, 27, 28, 29, 31], greater texture randomness leads to better performance. Examples of each texture type are provided in the Appendix.

Generalisation across object category. We assessed if GreenAug can be applied not just to backgrounds but also to different object categories. We set up a simple pick-and-place task. We first trained on a green cup and then tested on other visually different objects. The results, shown in Table 5, indicate that GreenAug-Gen performs best, with only a 1% difference from GreenAug-Rand. Both methods outperform NoAug by more than 35%. NoAug performs well on cups but fails with cubes and soft toys, and occasionally works with cans due to their similar geometric shapes to cups. GreenAug-Rand and GreenAug-Gen show better performance across different object categories, demonstrating some level of generalisation. However, performance with cups suffers slightly, likely due to the strong augmentation causing confusion about geometric shapes.

Green screen coverage. In real-world settings, some frames in the robot data may move away from the green screen during robot servoing. For example, if the green screen is only partially set up in the scene, the robot may observe parts of the scene not covered by the green screen. To emulate this scenario, we applied GreenAug-Rand to varying percentages of frames per episode. This was evaluated on the same tasks as the texture randomness study. The results are summarised in Fig. 6(a). As expected, green screen coverage is proportional to the success rate.

Presence of multiple green objects. Green screens could affect scenes when there are multiple green objects. We evaluated the sensitivity of chroma keying under these conditions, a challenge also encountered in the film industry. This study questions whether chroma keying can effectively isolate one green object without impacting others. We conducted a visual assessment (shown in Fig. 6(b)) and showed that we can augment only one object at a time while leaving the others unchanged. Alternatively, one can also use a different colour such as blue (along with a green background) for chroma-keying objects.

5 Conclusion and Limitations

This paper proposes and investigates the efficacy of GreenAug in robotic manipulation across a variety of real-world scenarios. We have demonstrated that GreenAug not only works effectively across different tasks but also surpasses other augmentation methods in performance while maintaining simplicity. GreenAug outperforms NoAug by approximately 65%, CVAug by 29% and generative augmentation by about 21%. Our findings advocate for a paradigm shift in data collection practices for robot learning. We propose the use of green screens for future real-world demonstrations. Implementing GreenAug could significantly improve policy generalisation across novel locations, effectively addressing scene generalisation limitations currently faced in the field.

While GreenAug proves to be useful, several challenges remain that we have outlined for future research. GreenAug is effective for background generalisation and to an extent, object generalisation (as shown in further studies), but it falls short when it comes to adapting to objects with very different geometric shapes. This type of generalisation involves changing the dynamics and trajectories of the demonstrations, such as accommodating different mugs with unique handles that require specific grasping points. Furthermore, GreenAug could be complementary to generative augmentation. This combination could help train world models capable of producing imaginary trajectories that generalise across diverse objects and appliances.

Acknowledgments

Big thanks to the members of the Dyson Robot Learning Lab for discussions and infrastructure help: Nic Backshall, Iain Haughton, Younggyo Seo, Sridhar Sola, Jafar Uruc, Yunfan Lu, Abdi Abdinur, Nikita Chernyadev.

References

Yarats et al. [2020] D. Yarats, I. Kostrikov, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International conference on learning representations, 2020.
Yarats et al. [2021] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
Shridhar et al. [2023] M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
Zhao et al. [2023] T. Z. Zhao, V. Kumar, S. Levine, and C. Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.016.
Chi et al. [2023] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. C. Burchfiel, and S. Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.026.
Ma et al. [2024] X. Ma, S. Patidar, I. Haughton, and S. James. Hierarchical diffusion policy for kinematics-aware multi-task robotic manipulation. arXiv preprint arXiv:2403.03890, 2024.
Vosylius et al. [2024] V. Vosylius, Y. Seo, J. Uruç, and S. James. Render and diffuse: Aligning image and action spaces for diffusion-based behaviour cloning. arXiv preprint arXiv:2405.18196, 2024.
Mandi et al. [2022] Z. Mandi, H. Bharadhwaj, V. Moens, S. Song, A. Rajeswaran, and V. Kumar. Cacti: A framework for scalable multi-task multi-scene visual imitation learning. arXiv preprint arXiv:2212.05711, 2022.
Chen et al. [2023] Q. Chen, S. C. Kiami, A. Gupta, and V. Kumar. GenAug: Retargeting behaviors to unseen situations via Generative Augmentation. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.010.
Yu et al. [2023] T. Yu, T. Xiao, J. Tompson, A. Stone, S. Wang, A. Brohan, J. Singh, C. Tan, D. M, J. Peralta, K. Hausman, B. Ichter, and F. Xia. Scaling Robot Learning with Semantically Imagined Experience. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi:10.15607/RSS.2023.XIX.027.
Sohl-Dickstein et al. [2015] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
Rombach et al. [2022] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Sauer et al. [2023] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
Smith and Blinn [1996] A. R. Smith and J. F. Blinn. Blue screen matting. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 259–268, 1996.
Grundhöfer et al. [2010] A. Grundhöfer, D. Kurz, S. Thiele, and O. Bimber. Color invariant chroma keying and color spill neutralization for dynamic scenes and cameras. The Visual Computer, 26:1167–1176, 2010.
Foster [2014] J. Foster. The green screen handbook: real-world production techniques. Routledge, 2014.
Aksoy et al. [2016] Y. Aksoy, T. O. Aydin, M. Pollefeys, and A. Smolić. Interactive high-quality green-screen keying via color unmixing. ACM Transactions on Graphics (TOG), 36(4):1, 2016.
Smirnov et al. [2023] D. Smirnov, C. LeGendre, X. Yu, and P. Debevec. Magenta green screen: Spectrally multiplexed alpha matting with deep colorization. In Proceedings of the Digital Production Symposium, pages 1–13, 2023.
Young et al. [2021] S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto. Visual imitation made easy. In Conference on Robot Learning, pages 1992–2005. PMLR, 2021.
Xie et al. [2023] A. Xie, L. Lee, T. Xiao, and C. Finn. Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659, 2023.
Laskin et al. [2020] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020.
Hansen et al. [2021] N. Hansen, H. Su, and X. Wang. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. Advances in neural information processing systems, 34:3680–3693, 2021.
Hansen and Wang [2021] N. Hansen and X. Wang. Generalization in reinforcement learning by soft data augmentation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617. IEEE, 2021.
Almuzairee et al. [2024] A. Almuzairee, N. Hansen, and H. I. Christensen. A recipe for unbounded data augmentation in visual reinforcement learning. arXiv preprint arXiv:2405.17416, 2024.
Sadeghi and Levine [2017] F. Sadeghi and S. Levine. Cad2rl: Real single-image flight without a single real image. In Proceedings of Robotics: Science and Systems, Cambridge, Massachusetts, July 2017. doi:10.15607/RSS.2017.XIII.034.
Tobin et al. [2017] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
James et al. [2017] S. James, A. J. Davison, and E. Johns. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. In Conference on Robot Learning, pages 334–343. PMLR, 2017.
Matas et al. [2018] J. Matas, S. James, and A. J. Davison. Sim-to-real reinforcement learning for deformable object manipulation. Conference on Robot Learning, 2018.
James et al. [2019] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12627–12637, 2019.
Alghonaim and Johns [2021] R. Alghonaim and E. Johns. Benchmarking domain randomisation for visual sim-to-real transfer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 12802–12808. IEEE, 2021.
So et al. [2022] J. So, A. Xie, S. Jung, J. Edlund, R. Thakker, A. Agha-mohammadi, P. Abbeel, and S. James. Sim-to-real via sim-to-seg: End-to-end off-road autonomous driving without real data. Conference on Robot Learning, 2022.
Xu et al. [2017] N. Xu, B. Price, S. Cohen, and T. Huang. Deep image matting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2970–2979, 2017.
Sengupta et al. [2020] S. Sengupta, V. Jayaram, B. Curless, S. M. Seitz, and I. Kemelmacher-Shlizerman. Background matting: The world is your green screen. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2291–2300, 2020.
Lin et al. [2021] S. Lin, A. Ryabtsev, S. Sengupta, B. L. Curless, S. M. Seitz, and I. Kemelmacher-Shlizerman. Real-time high-resolution background matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8762–8771, 2021.
Lin et al. [2022] S. Lin, L. Yang, I. Saleemi, and S. Sengupta. Robust high-resolution video matting with temporal guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 238–247, 2022.
Schülein et al. [2023] P. Schülein, H. Teufel, R. Vorpahl, I. Emter, Y. Bukschat, M. Pfister, N. Rathmann, S. Diehl, and M. Vetter. Comparison of synthetic dataset generation methods for medical intervention rooms using medical clothing detection as an example. EURASIP Journal on Image and Video Processing, 2023(1):12, 2023.
Coates and Ng [2010] A. Coates and A. Y. Ng. Multi-camera object detection for robotics. In 2010 IEEE International conference on robotics and automation, pages 412–419. IEEE, 2010.
Cannon [2011] E. Cannon. Greenscreen code and hints. http://gc-films.com/chromakey.html, 2011. [Accessed 15-01-2024].
Pumacay et al. [2024] W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191, 2024.
Liu et al. [2023] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
Kirillov et al. [2023] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
Zhang et al. [2023] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
Ranftl et al. [2021] R. Ranftl, A. Bochkovskiy, and V. Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
Seyde et al. [2022] T. Seyde, P. Werner, W. Schwarting, I. Gilitschenski, M. Riedmiller, D. Rus, and M. Wulfmeier. Solving continuous control via q-learning. arXiv preprint arXiv:2210.12566, 2022.
Ball et al. [2023] P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023.
Farebrother et al. [2024] J. Farebrother, J. Orbay, Q. Vuong, A. A. Taïga, Y. Chebotar, T. Xiao, A. Irpan, S. Levine, P. S. Castro, A. Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl. arXiv preprint arXiv:2403.03950, 2024.
Perlin [1985] K. Perlin. An image synthesizer. ACM Siggraph Computer Graphics, 19(3):287–296, 1985.
Finn et al. [2017] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. In Conference on robot learning, pages 357–368. PMLR, 2017.
Ronneberger et al. [2015] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
Iakubovskii [2019] P. Iakubovskii. Segmentation models pytorch. https://github.com/qubvel/segmentation_models.pytorch, 2019.
Wright [2017] S. Wright. Digital compositing for film and video: Production Workflows and Techniques. Routledge, 2017.
Li et al. [2021] H. Li, W. Zhu, H. Jin, and Y. Ma. Automatic, illumination-invariant and real-time green-screen keying using deeply guided linear models. Symmetry, 13(8):1454, 2021.
Jin et al. [2022] Y. Jin, Z. Li, D. Zhu, M. Shi, and Z. Wang. Automatic and real-time green screen keying. The Visual Computer, 38(9):3135–3147, 2022.
James et al. [2022] S. James, K. Wada, T. Laidlow, and A. J. Davison. Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13739–13748, 2022.

Appendix A Experiment Setups

In this section, we provide the detailed setups of our real-robot experiments to help reproduce the results.

Robot Setup. The robot setup consists of a 7-DoF Franka Panda Emika arm equipped with a Robotiq 2F-140 gripper. We use three RealSense D415 cameras: two cameras mounted on the end-effector (lower wrist, upper wrist) for a wide field-of-view, and one camera (left shoulder) fixed on the base, as depicted in Fig. 8(a).

Data collection. We gather demonstrations for our tasks utilising a leader-follower setup similar to ALOHA [4]. An expert human demonstrator moves the Leader arm, and the Follower arm mirrors the Leader’s joint positions, as shown in Fig. 8(b). Camera and robot state observations are recorded at 30 FPS.

Tasks. For each task, we collect 50 demonstrations each at two scenes: green screen room and living room. Fig. 11 shows the task definitions with sketches to illustrate the setup with measurements and randomisation. For all tasks, the initial robot joint positions are [0.0, -0.785, 0.0, -2.356, 0.0, 1.571, 0.0].

Appendix B More Visualisations

Appendix C Compute and Hyperparameter Details

We perform the preprocessing and model training using NVIDIA L4 GPUs (24GB VRAM).

ACT. We use the same implementation of ACT as described in the original paper, with the following changes to hyperparameters: action chunking size is set to 20, the number of epochs is 5000, and we sample 16 transitions per epoch. Unlike the original ACT implementation, which samples one transition per episode per epoch, we sample multiple transitions.

Table 6: Pre-processing hyperparameters for each task. Chroma key parameters are represented by Key Colour (

K

) in hexadecimal colour codes, tola (

\alpha

), and tolb (

\beta

). Detection Text Prompt is used for generative augmentation. Inpaint Text Prompt is used for both generative augmentation and GreenAug-Gen for background generation.

Task	Key Colour ( $K$ )	tola ( $\alpha$ )	tolb ( $\beta$ )	Detection Text Prompt	Inpaint Text Prompt
Open Drawer	#439f82	30	35	drawer. robot arm. robot gripper.	photorealistic kitchen, study room, washing room, living room, or bedroom
Place Cube in Drawer	#25806f	35	40	red cube. drawer. robot arm. robot gripper.	photorealistic kitchen, study room, washing room, living room, or bedroom
Sweep Coffee Beans	#1d6953	23	30	sponge. coffee beans. black tapes. drawer. robot arm. robot gripper.	photorealistic kitchen, study room, washing room, living room, or bedroom
Take Lid off Saucepan	#348367	15	25	saucepan. drawer. robot arm. robot gripper.	photorealistic kitchen, study room, washing room, living room, or bedroom
Place Jeans in Basket	#25806f	30	40	jeans. chair. robot arm. robot gripper.	photorealistic kitchen, study room, washing room, living room, or bedroom
Place Bear in Basket	#25806f	30	30	soft toy. basket. robot arm. robot gripper.	photorealistic kitchen, study room, washing room, living room, or bedroom
Stack Cups	#348367	15	25	blue cup. orange cup. drawer. robot arm. robot gripper.	photorealistic kitchen, study room, washing room, living room, or bedroom
Slide Book and Pick Up	#25806f	20	30	book. table. robot arm. robot gripper.	photorealistic kitchen, study room, washing room, living room, or bedroom
Object Generalisation	#699230	30	20	green cup. drawer. robot arm. robot gripper.	colourful cup, bowl, cube, toy, can, bottle or general graspable object

GreenAug-Mask U-Net. We use the original U-Net architecture [50] (implemented by Iakubovskii [51]) for the masking network used in GreenAug-Mask. The model comprises 14.3 million parameters.

Table 7: Masking network hyperparameters for GreenAug-Mask.

Model	Unet
Encoder	ResNet18
Encoder Weights	ImageNet
Epochs	100
Batch size	128
Image size	$224\times 224$
Seed	42

Appendix D Detailed Results

This section presents the full unaggregated results.

Table 8: Full experiment results. “Green Screen→Green Screen” roughly represents the upper bound performance (in parentheses) and is not included in the average. Full unaggregated results for each task are in Tables 9, 10, 11, 12, 13, 14, 16 and 15. The tables are also hyperlinked in the task text below.

Success Rate (%)

Task

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Open Drawer

Green Screen

(100)

(88)

(96)

(100)

Living Room

3 Novel Scenes

Green Screen

3 Novel Scenes

Place Cube in Drawer

Green Screen

(92)

(96)

(72)

(100)

(84)

(96)

Living Room

3 Novel Scenes

Green Screen

3 Novel Scenes

Sweep Coffee Beans

Green Screen

(100)

(96)

(88)

(96)

(80)

(92)

Living Room

3 Novel Scenes

Green Screen

3 Novel Scenes

Take Lid off Saucepan

Green Screen

(96)

(84)

(92)

(80)

(84)

Living Room

3 Novel Scenes

Green Screen

3 Novel Scenes

Place Jeans in Basket

Green Screen

(100)

(92)

(100)

(92)

Living Room

3 Novel Scenes

Green Screen

3 Novel Scenes

Place Bear in Basket

Green Screen

(100)

100

(96)

(100)

Living Room

3 Novel Scenes

Green Screen

3 Novel Scenes

Stack Cups

Green Screen

(76)

(84)

(88)

(80)

Living Room

3 Novel Scenes

Green Screen

3 Novel Scenes

Slide Book and Pick Up

Green Screen

(100)

Living Room

3 Novel Scenes

Green Screen

3 Novel Scenes

Average

Table 9: “Open Drawer” task unaggregated results.

Success Rate (%)

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Green Screen

(100)

(88)

(96)

(100)

Living Room

Novel Scene 1

Novel Scene 2

Novel Scene 3

Green Screen

Novel Scene 1

100

Novel Scene 2

100

Novel Scene 3

100

Table 10: “Place Cube in Drawer” task unaggregated results..

Success Rate (%)

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Green Screen

(92)

(96)

(72)

(100)

(84)

(96)

Living Room

Novel Scene 1

Novel Scene 2

Novel Scene 3

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Table 11: “Sweep Coffee Beans” task unaggregated results.

Success Rate (%)

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Green Screen

(100)

(96)

(88)

(96)

(80)

(92)

Living Room

Novel Scene 1

Novel Scene 2

Novel Scene 3

Green Screen

Novel Scene 1

100

Novel Scene 2

Novel Scene 3

Table 12: “Take Lid off Saucepan” task unaggregated results.

Success Rate (%)

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Green Screen

(96)

(84)

(92)

(80)

(84)

Living Room

Novel Scene 1

Novel Scene 2

Novel Scene 3

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Table 13: “Place Jeans in Basket” task unaggregated results.

Success Rate (%)

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Green Screen

(100)

(92)

(100)

(92)

Living Room

Novel Scene 1

Novel Scene 2

Novel Scene 3

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Table 14: “Place Bear in Basket” task unaggregated results.

Success Rate (%)

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Green Screen

(100)

(96)

(100)

Living Room

Novel Scene 1

Novel Scene 2

Novel Scene 3

Green Screen

Novel Scene 1

100

Novel Scene 2

Novel Scene 3

100

Table 15: “Stack Cups” task unaggregated results.

Success Rate (%)

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Green Screen

(76)

(84)

(88)

(80)

Living Room

Novel Scene 1

Novel Scene 2

Novel Scene 3

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Table 16: “Slide Book and Pick Up” task unaggregated results.

Success Rate (%)

Train

Scene

Test

Scene

NoAug

CVAug

Generative

Augmentation

GreenAug

random

GreenAug

generative

GreenAug

mask

Green Screen

(100)

Living Room

Novel Scene 1

Novel Scene 2

Novel Scene 3

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Table 17: Texture randomness unaggregated results (GreenAug-Rand). “Green Screen→Green Screen” roughly represents the upper bound performance (in parentheses) and is not included in the average.

Success Rate (%)

Task

Train

Scene

Test

Scene

None

Solid Textures

Perlin Textures

MIL Textures

Place Cube in Drawer

Green Screen

(92)

(100)

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Stack Cups

Green Screen

(76)

(80)

(88)

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Average

Table 18: Green screen coverage unaggregated results (GreenAug-Rand). “Green Screen→Green Screen” roughly represents the upper bound performance (in parentheses) and is not included in the average.

Success Rate (%)

Task

Train

Scene

Test

Scene

25%

50%

75%

100%

Place Cube in Drawer

Green Screen

(92)

(100)

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Stack Cups

Green Screen

(76)

(80)

(76)

(88)

Green Screen

Novel Scene 1

Novel Scene 2

Novel Scene 3

Average

Table 19: Object generalisation unaggregated results. Data is collected on the green cup, and policies are then trained and evaluated on various objects (illustrated in Fig. 18).

	Success Rate (%)
Object Type	NoAug	GreenAug-Rand	GreenAug-Gen
Green Cup	96	88	80
Blue Cup	96	80	80
Orange Cup	92	80	80
Red Cube	0	44	60
Green Cube	0	20	44
Soda Can	40	28	36
Soya Can	36	64	44
Soft Toy	0	84	72
Average	45	61	62

Appendix E Additional Limitations and Future Works

Exploration of better chroma key algorithms. The chroma key algorithm used in this paper [38] is a basic one that performs reasonably well, but it does not produce perfect masks. Some parameter tuning for $K$ , $\alpha$ , and $\beta$ is still necessary. Despite these imperfections, we demonstrate that GreenAug still significantly outperforms the baselines. In the film industry, extensive manual post-processing is often required to achieve perfect masks [52]. Future research could explore more advanced chroma key algorithms that provide superior green screen masks [17, 53, 54, 18]. This could potentially enhance the performance of GreenAug-Mask, which relies heavily on green screen mask as ground truth for training.

Pose generalisation. A major ongoing challenge in robot learning is generalising to 6D poses not present in the training dataset. Current robot learning policies, especially imitation learning-based ones often fail when objects are relocated to different positions within 3D space.

Application to methods with 3D observations. Currently, GreenAug has only been tested on RGB-based robot learning policies. Recent advances in next-best-pose-based agents [55, 3, 6] have demonstrated that by aligning the observation space with action space, we can obtain strong generalisation in robot learning policies. As a general plug-and-play method, GreenAug could potentially further improve the scene generalisation of the next-best-pose agents, which we leave for future study.