Effects of different synthetic training data on real test data for semantic segmentation

Semantic segmentation - labelling each pixel in an image to a specific class- models require large amounts of manually labeled and collected data to train. These huge datasets often are lacking in many fields such as specific weather events and climates. They also lack variety as most datasets are of city and street scenes as opposed to more rural scenes. Synthetic-to-real domain adaptation can be used to fill those gaps. Although there has been prior work on how to generate virtual data, in this paper we present insight on how features of the training data can affect the model and ultimately, the outcome. They can also provide insight on the dataset selection process. We hypothesized that different synthetic training data on real test data will affect the outcome of semantic segmentation. By testing real-world data on two iterations of a model trained by two separate virtual datasets using Python and deep learning models, we compared the results and analyzed the differences and similarities the datasets yielded for semantic segmentation. We unearthed critical insights that shed light on the dataset selection process, enabling researchers and practitioners to make more informed decisions when choosing the appropriate dataset type for semantic segmentation tasks. We contributed valuable findings, unveiling the limitations of substituting real datasets with virtual counterparts and offering guidance for dataset selection.