@@ -425,4 +425,117 @@ Also note that BERT Large engines, especially using mixed precision with large b
425
425
426
426
# ## Results
427
427
428
- To be published soon.
428
+ The following sections provide details on how we achieved our performance and inference.
429
+
430
+ # ### Inference performance: NVIDIA A100 (40GB)
431
+
432
+ Our results were obtained by running the ` scripts/inference_benchmark.sh --gpu Ampere` script in the container generated by the TensorRT OSS Dockerfile on NVIDIA A100 with (1x A100 40G) GPUs.
433
+
434
+ # #### BERT Base
435
+
436
+ | Sequence Length | Batch Size | INT8 Latency (ms) | | | FP16 Latency (ms) | | |
437
+ | -----------------| ------------| -----------------| -----------------| ---------| -----------------| -----------------| ---------|
438
+ | | | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
439
+ | 128 | 1 | 0.33 | 0.97 | 0.58 | 0.75 | 0.75 | 0.72 |
440
+ | 128 | 2 | 0.78 | 0.79 | 0.63 | 0.84 | 1.07 | 0.84 |
441
+ | 128 | 4 | 0.76 | 0.98 | 0.76 | 1.13 | 1.46 | 1.14 |
442
+ | 128 | 8 | 1.08 | 1.08 | 0.98 | 1.66 | 1.81 | 1.66 |
443
+ | 128 | 12 | 1.26 | 1.63 | 1.27 | 2.07 | 2.07 | 2.07 |
444
+ | 128 | 16 | 1.47 | 1.48 | 1.47 | 2.48 | 2.49 | 2.48 |
445
+ | 128 | 24 | 2.13 | 2.13 | 2.13 | 3.47 | 3.49 | 3.46 |
446
+ | 128 | 32 | 2.54 | 2.83 | 2.54 | 4.37 | 4.40 | 4.34 |
447
+ | 128 | 64 | 4.58 | 4.59 | 4.54 | 8.70 | 8.79 | 8.65 |
448
+ | 128 | 128 | 9.04 | 9.06 | 8.97 | 17.05 | 17.07 | 16.90 |
449
+ | 384 | 1 | 1.15 | 1.15 | 1.15 | 1.43 | 1.44 | 1.43 |
450
+ | 384 | 2 | 1.37 | 1.37 | 1.37 | 1.84 | 2.21 | 1.84 |
451
+ | 384 | 4 | 1.73 | 1.74 | 1.73 | 2.47 | 2.48 | 2.47 |
452
+ | 384 | 8 | 2.51 | 2.51 | 2.51 | 3.77 | 3.80 | 3.76 |
453
+ | 384 | 12 | 3.61 | 3.62 | 3.61 | 5.36 | 5.37 | 5.30 |
454
+ | 384 | 16 | 4.39 | 4.40 | 4.38 | 7.32 | 7.32 | 7.24 |
455
+ | 384 | 24 | 6.24 | 6.24 | 6.23 | 10.50 | 10.51 | 10.41 |
456
+ | 384 | 32 | 8.42 | 8.50 | 8.42 | 14.32 | 14.44 | 14.27 |
457
+ | 384 | 64 | 16.48 | 16.52 | 16.36 | 27.51 | 27.54 | 27.33 |
458
+ | 384 | 128 | 31.71 | 31.78 | 31.58 | | | |
459
+
460
+ # #### BERT Large
461
+
462
+ | Sequence Length | Batch Size | INT8 Latency (ms) | | | FP16 Latency (ms) | | |
463
+ | -----------------| ------------| -----------------| -----------------| ---------| -----------------| -----------------| ---------|
464
+ | | | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
465
+ | 128 | 1 | 1.24 | 1.56 | 1.24 | 1.73 | 2.11 | 1.73 |
466
+ | 128 | 2 | 1.49 | 1.49 | 1.49 | 2.20 | 2.20 | 2.20 |
467
+ | 128 | 4 | 1.91 | 1.92 | 1.91 | 3.22 | 3.23 | 3.22 |
468
+ | 128 | 8 | 2.94 | 2.94 | 2.93 | 4.84 | 4.84 | 4.83 |
469
+ | 128 | 12 | 3.34 | 3.34 | 3.34 | 5.95 | 5.96 | 5.90 |
470
+ | 128 | 16 | 4.63 | 4.64 | 4.62 | 7.98 | 7.99 | 7.90 |
471
+ | 128 | 24 | 5.87 | 5.88 | 5.87 | 11.05 | 11.08 | 10.94 |
472
+ | 128 | 32 | 7.99 | 7.99 | 7.98 | 14.74 | 14.77 | 14.59 |
473
+ | 128 | 64 | 14.74 | 17.74 | 14.56 | 28.09 | 28.25 | 27.85 |
474
+ | 128 | 128 | 28.32 | 23.38 | 28.03 | 54.38 | 54.40 | 54.12 |
475
+ | 384 | 1 | 2.80 | 2.80 | 2.80 | 3.49 | 3.49 | 3.48 |
476
+ | 384 | 2 | 3.12 | 3.13 | 3.12 | 4.71 | 4.72 | 4.71 |
477
+ | 384 | 4 | 4.27 | 4.27 | 4.27 | 6.70 | 6.71 | 6.70 |
478
+ | 384 | 8 | 7.66 | 7.67 | 7.66 | 12.41 | 12.53 | 12.37 |
479
+ | 384 | 12 | 10.07 | 10.08 | 10.07 | 17.63 | 17.76 | 17.56 |
480
+ | 384 | 16 | 13.34 | 13.34 | 13.33 | 23.40 | 23.46 | 23.19 |
481
+ | 384 | 24 | 19.36 | 19.38 | 19.22 | 34.34 | 34.36 | 34.10 |
482
+ | 384 | 32 | 25.56 | 25.60 | 25.56 | 44.94 | 44.98 | 44.78 |
483
+ | 384 | 64 | 49.84 | 49.92 | 49.60 | 87.26 | 87.56 | 86.77 |
484
+ | 384 | 128 | 97.66 | 97.78 | 97.06 | 170.85 | 171.00 | 170.08 |
485
+
486
+
487
+ # ### Inference performance: NVIDIA T4 (16GB)
488
+
489
+ Our results were obtained by running the ` scripts/inference_benchmark.sh --gpu Turing` script in the container generated by the TensorRT OSS Dockerfile on NVIDIA T4 with (1x T4 16G) GPUs.
490
+
491
+ # #### BERT Base
492
+
493
+ | Sequence Length | Batch Size | INT8 Latency (ms) | | | FP16 Latency (ms) | | |
494
+ | -----------------| ------------| -----------------| -----------------| ---------| -----------------| -----------------| ---------|
495
+ | | | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
496
+ | 128 | 1 | 1.55 | 1.57 | 1.33 | 2.00 | 2.06 | 1.93 |
497
+ | 128 | 2 | 1.78 | 2.06 | 1.75 | 2.54 | 2.58 | 2.49 |
498
+ | 128 | 4 | 2.80 | 2.88 | 2.74 | 4.25 | 4.34 | 4.16 |
499
+ | 128 | 8 | 4.48 | 4.56 | 4.42 | 8.13 | 8.74 | 7.88 |
500
+ | 128 | 12 | 6.28 | 6.31 | 6.12 | 11.67 | 12.12 | 11.30 |
501
+ | 128 | 16 | 8.92 | 9.11 | 8.78 | 17.24 | 17.79 | 16.70 |
502
+ | 128 | 24 | 12.70 | 12.84 | 12.53 | 24.48 | 24.85 | 24.90 |
503
+ | 128 | 32 | 17.90 | 18.41 | 17.59 | 33.02 | 33.51 | 32.65 |
504
+ | 128 | 64 | 34.80 | 34.83 | 34.31 | 65.38 | 65.43 | 64.28 |
505
+ | 128 | 128 | 68.16 | 68.46 | 67.05 | 130.77 | 131.01 | 129.19 |
506
+ | 384 | 1 | 2.47 | 2.53 | 2.43 | 3.76 | 3.81 | 3.69 |
507
+ | 384 | 2 | 3.87 | 3.95 | 3.81 | 6.31 | 6.43 | 6.21 |
508
+ | 384 | 4 | 7.15 | 7.18 | 6.97 | 12.16 | 12.22 | 12.03 |
509
+ | 384 | 8 | 14.09 | 12.11 | 13.73 | 25.45 | 25.83 | 24.94 |
510
+ | 384 | 12 | 20.99 | 21.12 | 20.66 | 38.15 | 38.38 | 37.51 |
511
+ | 384 | 16 | 27.49 | 27.65 | 27.08 | 50.90 | 51.36 | 50.04 |
512
+ | 384 | 24 | 41.93 | 42.17 | 41.36 | 77.25 | 78.16 | 76.05 |
513
+ | 384 | 32 | 54.65 | 54.87 | 54.06 | 102.44 | 103.09 | 101.30 |
514
+ | 384 | 64 | 109.78 | 110.42 | 108.24 | 200.58 | 201.20 | 198.68 |
515
+ | 384 | 128 | 227.46 | 228.80 | 223.92 | 401.33 | 402.14 | 399.24 |
516
+
517
+ # #### BERT Large
518
+
519
+ | Sequence Length | Batch Size | INT8 Latency (ms) | | | FP16 Latency (ms) | | |
520
+ | -----------------| ------------| -----------------| -----------------| ---------| -----------------| -----------------| ---------|
521
+ | | | 95th Percentile | 99th Percentile | Average | 95th Percentile | 99th Percentile | Average |
522
+ | 128 | 1 | 3.59 | 3.61 | 3.51 | 5.10 | 5.18 | 5.02 |
523
+ | 128 | 2 | 4.93 | 5.03 | 4.83 | 7.72 | 7.73 | 7.58 |
524
+ | 128 | 4 | 8.15 | 8.19 | 7.93 | 13.67 | 13.85 | 13.56 |
525
+ | 128 | 8 | 14.21 | 14.23 | 13.89 | 26.88 | 27.66 | 26.35 |
526
+ | 128 | 12 | 22.41 | 22.47 | 21.91 | 41.04 | 41.29 | 40.30 |
527
+ | 128 | 16 | 29.30 | 29.83 | 28.82 | 55.04 | 55.27 | 54.05 |
528
+ | 128 | 24 | 44.60 | 44.63 | 43.92 | 81.24 | 82.28 | 79.59 |
529
+ | 128 | 32 | 60.88 | 61.48 | 58.97 | 114.13 | 114.47 | 112.78 |
530
+ | 128 | 64 | 111.78 | 112.02 | 110.77 | 224.24 | 225.02 | 221.97 |
531
+ | 128 | 128 | 223.99 | 224.28 | 222.33 | 417.56 | 418.54 | 415.33 |
532
+ | 384 | 1 | 7.18 | 7.27 | 7.07 | 11.74 | 11.96 | 11.51 |
533
+ | 384 | 2 | 12.22 | 12.25 | 11.92 | 21.47 | 21.61 | 20.97 |
534
+ | 384 | 4 | 35.95 | 36.43 | 35.63 | 42.03 | 42.35 | 41.36 |
535
+ | 384 | 8 | 47.06 | 47.22 | 46.41 | 83.16 | 83.51 | 82.06 |
536
+ | 384 | 12 | 66.04 | 66.04 | 65.89 | 127.10 | 127.99 | 127.46 |
537
+ | 384 | 16 | 87.98 | 88.45 | 87.13 | 164.13 | 165.12 | 161.96 |
538
+ | 384 | 24 | 132.56 | 132.96 | 131.24 | 262.76 | 263.68 | 258.96 |
539
+ | 384 | 32 | 179.44 | 180.61 | 176.66 | 329.99 | 331.67 | 325.59 |
540
+ | 384 | 64 | 352.81 | 353.39 | 350.21 | 684.19 | 686.39 | 674.76 |
541
+ | 384 | 128 | 706.85 | 707.73 | 704.38 | 1318.74 | 1320.22 | 1315.10 |
0 commit comments