This paper addresses the problem of semantic image labeling of urban remote sensing images into land cover maps. We exploit the prior knowledge that cities are composed of comparable spatial arrangements of urban objects, such as buildings. To do so, we cluster OpenStreetMap (OSM) building footprints into groups with similar local statistics, corresponding to different types of urban zones. We use the per-cluster expected building fraction to correct for over- and underrepresentations of classes predicted by a Convolutional Neural Network (CNN), using a Conditional Random Field (CRF). Results indicate a substantial improvement in both numerical and visual accuracy of the labeled maps.