Homogenization of daily temperature series is a fundamental step for climatological analyses. In the last decades, several methods have been developed, presenting different statistical and procedural approaches. In this study, four homogenization methods (together with two variants) have been tested and compared. This has been performed constructing a benchmark dataset, where segments of homogeneous series are replaced with simultaneous measurements from neighboring homogeneous series. This generates inhomogeneous series (the test set) whose homogeneous version (the benchmark set) is known. Two benchmark datasets are created. The first one is based on series from the Czech Republic and has a high quality, high station density, and a large number of reference series. The second one uses stations from all Europe and presents more challenges, such as missing segments, low station density, and scarcity of reference series. The comparison has been performed with pre-defined metrics which check the statistical distance between the homogenized versions and the benchmark. Almost all homogenization methods perform well on the near-ideal benchmark (maximum relative root mean square error (rRMSE): 1.01), while on the European dataset, the homogenization methods diverge and the rRMSE increases up to 1.87. Analyses of the percentages of non-adjusted inhomogeneous data (up to 39%) and substantial differences in the trends among the homogenized versions helped identifying diverging procedural characteristics of the methods. These results add new elements to the debate about homogenization methods for daily values and motivate the use of realistic and challenging datasets in evaluating their robustness and flexibility.