In this work we have presented a model that efficiently balances between local representations obtained by convolution blocks and a global representations obtained by transformer blocks. Proposed model outperforms, previously, standard decoder architecture DeepLabV3 by at least 1% Jaccard index with smaller number of parameters. In the best case this improvement is of 7%. As part of our future work we plan to experiment with (1) MS COCO dataset pretraining (2) hyperparameters search.