WEBVTT 00:00:00.000 --> 00:00:04.120 My name is Chuan Yan, and today I will introduce my work, 00:00:04.120 --> 00:00:09.080 our work, DeepSketch Vectorization via Implicit Surface Extraction. 00:00:09.080 --> 00:00:13.980 This work is a collaboration with Yong Li from South China University of Technology, 00:00:13.980 --> 00:00:19.320 Deepali Aneja, Matthew Fisher from Adobe, and Edgar Simo-Serra from Waseda University, 00:00:19.320 --> 00:00:23.880 and Yotam Gingold from George Mason University. 00:00:23.880 --> 00:00:29.960 Accurately extract vector lines from raster sketch is a beneficial step for many aspects. 00:00:29.960 --> 00:00:34.000 First, vector lines and junctions could preserve important information 00:00:34.000 --> 00:00:37.200 that is difficult to extract it from the bitmaps only. 00:00:37.200 --> 00:00:41.720 For example, the artist's drawing intent, the 2D or 3D shape. 00:00:41.720 --> 00:00:45.620 Also, it is easy to only choose the target vector stroke 00:00:45.620 --> 00:00:48.580 and adjust them without touching their neighbors. 00:00:48.580 --> 00:00:54.560 A lot of downstream sketch processing applications usually require vector lines as the input. 00:00:54.560 --> 00:00:58.640 And last, vector lines are necessary for many human vision abstraction 00:00:58.640 --> 00:01:04.320 and expression researches such as sketch retrieval or sketch recondition, etc. 00:01:04.320 --> 00:01:09.280 However, there are still challenges for existing sketch vectorization methods. 00:01:09.280 --> 00:01:13.800 The frame field-based methods or the fully end-to-end data-driven based methods 00:01:13.800 --> 00:01:17.680 all struggle to extract accurate vector passes on the sketch region, 00:01:17.680 --> 00:01:20.600 which contains dense strokes. 00:01:20.600 --> 00:01:23.660 And it will be difficult to further reduce the running time 00:01:23.660 --> 00:01:28.840 if the raster sketch contains complex structures and numerous junctions. 00:01:28.840 --> 00:01:35.200 For example, for a 384 by 512 raster sketch, 00:01:35.200 --> 00:01:41.000 the current method will take at least more than 40 seconds to finish. 00:01:41.000 --> 00:01:43.960 And we propose a new sketch vectorization method 00:01:43.960 --> 00:01:49.900 that provides much higher vectorization fidelity and faster execution time. 00:01:49.900 --> 00:01:54.480 Additionally, our intermediate vector path representation naturally supports 00:01:54.480 --> 00:01:58.280 future interactive topology refinement, if necessary. 00:01:58.280 --> 00:02:02.240 But all the results shown in this slide are fully automatic, 00:02:02.240 --> 00:02:06.640 unless it is stated otherwise. 00:02:06.640 --> 00:02:11.040 So, we designed our vectorization framework into three stages. 00:02:11.040 --> 00:02:17.600 Centerline encoding, vectorline reconstruction, and post-refinements. 00:02:17.600 --> 00:02:20.040 At the first centerline encoding stage, 00:02:20.040 --> 00:02:24.040 we train a distance field prediction network that extract information 00:02:24.040 --> 00:02:28.840 such as key point on the sampling map and centerline from the given raster sketch. 00:02:28.840 --> 00:02:33.900 And all those informations are encoded as unsigned distance fields. 00:02:33.900 --> 00:02:37.480 Then we train another network to reconstruct the initial vector passes 00:02:37.480 --> 00:02:41.700 from the encoded centerline, which is predicted in our first stage. 00:02:41.700 --> 00:02:44.440 Then after getting the initial vector lines, 00:02:44.440 --> 00:02:46.640 its junctions will be automatically refined 00:02:46.640 --> 00:02:51.440 based on the predicted key points and undersampling maps. 00:02:51.440 --> 00:02:54.000 At last, to further refine the output, 00:02:54.000 --> 00:02:57.180 we propose a dual-contouring downsampling method 00:02:57.180 --> 00:03:00.820 and a simple line grouping method to turn the vector segments 00:03:00.820 --> 00:03:04.080 into vector strokes with optimized file size. 00:03:04.080 --> 00:03:08.280 Now, let's dive into the technical details. 00:03:08.280 --> 00:03:11.980 One of our core contributions is that we formulate the sketch vectorization 00:03:11.980 --> 00:03:15.500 as an implicit surface reconstruction task, 00:03:15.500 --> 00:03:17.500 which is dual-contouring in our method. 00:03:17.500 --> 00:03:22.560 Therefore, we will briefly review its reconstruction logic in 2D first. 00:03:22.560 --> 00:03:24.300 Given a target vector sketch, 00:03:24.300 --> 00:03:29.000 we sample the canvas into a bunch of evenly distributed grids. 00:03:29.000 --> 00:03:32.120 For each grid, we compute the colorless distance 00:03:32.120 --> 00:03:36.720 from the grid center to the vector pass as an unsigned distance field. 00:03:36.720 --> 00:03:41.000 Since this field implicitly encodes the vector pass as a 2D surface, 00:03:41.000 --> 00:03:43.360 we can use dual-contouring to reconstruct it 00:03:43.360 --> 00:03:46.400 as a vector segment based on this field. 00:03:46.400 --> 00:03:49.640 So to be more specific, we could infer the edge flags 00:03:49.640 --> 00:03:52.460 and edge vertex from it. 00:03:52.460 --> 00:03:55.940 The edge flag records the intersection between the target vector sketch 00:03:55.940 --> 00:04:01.580 and the grid edge, and the edge vertex in each grid 00:04:01.580 --> 00:04:09.520 is set onto the vector pass if the grid contains a single vector pass. 00:04:09.520 --> 00:04:13.880 Each grid could only have one edge vertex. 00:04:13.880 --> 00:04:18.000 Then the reconstruction would be simply connecting any two neighbor vertices 00:04:18.000 --> 00:04:20.880 if the shared grid edge has a flag. 00:04:20.880 --> 00:04:23.900 However, this reconstruction could be inaccurate 00:04:23.900 --> 00:04:27.560 since our sampling rate in the vector space is not enough, 00:04:27.560 --> 00:04:29.600 and it is geologically impossible 00:04:29.600 --> 00:04:33.360 to reconstruct the high-valence junctions with dual-contouring. 00:04:33.360 --> 00:04:36.720 We termed those regions as undersampled maps. 00:04:36.720 --> 00:04:39.760 To solve this problem, we detected those undersampled regions 00:04:39.760 --> 00:04:44.640 and extracted the junction key points. 00:04:44.640 --> 00:04:47.860 Then we can make a correction by dropping all the line segments 00:04:47.860 --> 00:04:50.880 in the undersampled regions and rebuild the junction 00:04:50.880 --> 00:04:54.720 by connecting the key points to all the truncations. 00:04:54.720 --> 00:04:58.000 To sum up, to get the reconstructed vector sketch, 00:04:58.000 --> 00:05:01.440 we need to derive this five information 00:05:01.440 --> 00:05:05.780 that is needed for our training ground truth. 00:05:05.780 --> 00:05:10.640 So the upper three are the ground truth for our distance field prediction network, 00:05:10.640 --> 00:05:15.840 and the lower two is the ground truth for our line reconstruction network. 00:05:15.840 --> 00:05:20.600 And it is worth noting that, although the reconstructed vector sketch 00:05:20.600 --> 00:05:24.020 is not equal to the target vector sketch, 00:05:24.020 --> 00:05:27.120 but this design of intermediate vector representation 00:05:27.120 --> 00:05:29.360 could greatly reduce our training difficulty 00:05:29.360 --> 00:05:34.800 and naturally support interactive junction refinements. 00:05:34.800 --> 00:05:40.680 Then we create our dataset with 63K vector sketches from public dataset 00:05:40.680 --> 00:05:44.540 and then rasterize them with seven different brush styles. 00:05:44.540 --> 00:05:49.600 And so this ends up with 441K raster sketches as our training data. 00:05:49.600 --> 00:05:52.940 One group of the raster sketch examples is shown below. 00:05:52.940 --> 00:05:55.400 Then we formulate this distance field prediction 00:05:55.400 --> 00:05:57.740 as an image-to-image translation task 00:05:57.740 --> 00:06:02.360 and trend a unit-like network based on this created data. 00:06:02.360 --> 00:06:05.120 We use master L1 distance as our training loss to predict 00:06:05.120 --> 00:06:07.640 three types of unsigned distance fields, 00:06:07.640 --> 00:06:10.680 which encode the center lines, undersampled maps, 00:06:10.680 --> 00:06:13.140 and key points, respectively. 00:06:13.140 --> 00:06:15.940 To reduce the number of undersampled regions 00:06:15.940 --> 00:06:19.080 and improve the smoothness of the reconstructed curve, 00:06:19.080 --> 00:06:23.200 we double the resolution of the output unsigned distance field. 00:06:23.200 --> 00:06:25.200 [inaudible response from audience] 00:06:25.200 --> 00:06:26.200 What? 00:06:26.200 --> 00:06:27.200 [laughter] 00:06:27.200 --> 00:06:28.200 Okay. 00:06:28.200 --> 00:06:29.200 [laughter] 00:06:29.200 --> 00:06:36.120 So, we double the resolution of the output unsigned distance field, 00:06:36.120 --> 00:06:40.360 which is equivalent to double the sampling rate in the vector domain. 00:06:40.360 --> 00:06:43.360 So this will also lead to the resolution increase 00:06:43.360 --> 00:06:47.800 for both corresponding edge flex and edge vertex. 00:06:47.800 --> 00:06:50.400 Then we adapt recent neural-dual contouring 00:06:50.400 --> 00:06:53.080 and applied a series of modifications 00:06:53.080 --> 00:06:57.620 which particularly optimized for 2D vectorization tasks. 00:06:57.620 --> 00:07:00.860 So please see our paper for more details. 00:07:00.860 --> 00:07:03.020 Now I'm going to give more details about the logic 00:07:03.020 --> 00:07:05.340 how we refined the topology. 00:07:05.340 --> 00:07:09.340 So given a target vector sketch which sampled as a 4x4 grids, 00:07:09.340 --> 00:07:12.240 we could easily find that the reconstruction vector lines 00:07:12.240 --> 00:07:15.160 cannot preserve the x junction here, 00:07:15.160 --> 00:07:19.740 even all the edge flex are predicted correctly. 00:07:19.740 --> 00:07:22.820 But we can still fix this by predicting an undersampled map 00:07:22.820 --> 00:07:28.920 along with the x junction key point. 00:07:28.920 --> 00:07:32.540 So we can remove all the line segments inside that undersampled region 00:07:32.540 --> 00:07:37.260 and then connect all the truncations to the predicted key point. 00:07:37.260 --> 00:07:40.840 So while reconstructing the high-wilency junction, 00:07:40.840 --> 00:07:42.560 or a star junction in other words, 00:07:42.560 --> 00:07:45.020 is a historically challenging problem, 00:07:45.020 --> 00:07:50.240 our refinement logic provides one simple solution to it. 00:07:50.240 --> 00:07:53.700 Then to further reduce the number of unnecessary vector segments, 00:07:53.700 --> 00:07:55.400 we provide an undersampled map-based 00:07:55.400 --> 00:07:57.880 the dual contouring and downsampling method. 00:07:57.880 --> 00:08:00.560 To turn the vector segments into strokes, 00:08:00.560 --> 00:08:02.840 we propose a simple line grouping method 00:08:02.840 --> 00:08:06.200 that iteratively finds the shortest path. 00:08:06.200 --> 00:08:10.880 So for more details, please check our paper. 00:08:10.880 --> 00:08:13.120 Then we use our sketch cleanup benchmark 00:08:13.120 --> 00:08:17.720 to create a vectorized test set, a vectorization test set. 00:08:17.720 --> 00:08:22.720 The test set contains 369 clean raster sketches 00:08:22.720 --> 00:08:25.540 with their clean vector ground truth. 00:08:25.540 --> 00:08:30.640 So we compare our method with another matrix, 00:08:30.640 --> 00:08:33.880 with another using matrix as a chamfer distance, 00:08:33.880 --> 00:08:38.680 running times under six different input sketch resolution levels. 00:08:38.680 --> 00:08:43.640 And please note that all metrics along the y-axis are log-scaled. 00:08:43.640 --> 00:08:47.480 So experiment shows that our method consistently performs better 00:08:47.480 --> 00:08:50.880 and runs faster than all other methods. 00:08:50.880 --> 00:08:54.440 Here we show a comparison of vectorization results. 00:08:54.440 --> 00:08:58.440 As we can see, our method faithfully captures much more detail 00:08:58.440 --> 00:09:02.620 for even a tiny portion of the raster sketch. 00:09:02.620 --> 00:09:06.920 We also train all methods on another test set, 00:09:06.920 --> 00:09:11.000 which contains 112 rough sketches in the wild. 00:09:11.000 --> 00:09:14.400 So this comparison further shows our method's compatibility 00:09:14.400 --> 00:09:17.260 of capturing the stroke details. 00:09:17.260 --> 00:09:21.940 It is also worth mentioning that our method could successfully vectorize 00:09:21.940 --> 00:09:24.120 all the rough sketches with faster speed 00:09:24.120 --> 00:09:27.580 compared to the friend-filled-based methods. 00:09:27.580 --> 00:09:31.100 Here are two more vectorization results on complex rough sketches 00:09:31.100 --> 00:09:33.660 that only our method could vectorize. 00:09:33.660 --> 00:09:42.120 And please note that the zoom-in region is only about 1/15 to 1/25 of the full sketch. 00:09:42.120 --> 00:09:46.100 So here is the one failure case that shows the limitation of our method. 00:09:46.100 --> 00:09:49.920 Our method will output broken and messy lines when the input sketch 00:09:49.920 --> 00:09:53.700 contains densely repeated strokes or very thick strokes. 00:09:53.700 --> 00:09:56.840 Although this vectorization result could be improved by either 00:09:56.840 --> 00:10:03.040 downsampled raster input or apply a pre-processing such as line extraction, 00:10:03.040 --> 00:10:11.600 but we believe the fundamental reason 00:10:11.600 --> 00:10:14.200 should be our parallelized training strategy, 00:10:14.200 --> 00:10:17.240 which makes our network struggling when predicting a certain line 00:10:17.240 --> 00:10:20.000 and junction on such strokes. 00:10:20.000 --> 00:10:25.020 In future, we can consider an adaptive sampling rate, 00:10:25.020 --> 00:10:30.120 enforce more natural junctions, and vectorize more stroke attributes. 00:10:30.120 --> 00:10:34.580 At last, I want to say thanks for all my collaborators, 00:10:34.580 --> 00:10:36.360 and thank you for your listening. 00:10:36.360 --> 00:10:39.360 (audience applauds) 00:10:39.360 --> 00:10:40.760 (audience clapping)