[PATCH] download smallest files first but always use one parallel job to download large files concurrently
When downloading small files the download time seems dominated by connection latency rather than bandwidth. Downloading several small files in parallel is an effective way to reduce the impact of latency. However, downloading many small files in parallel does not always saturate bandwidth, which is inefficient. This commit attempts to fully utilize parallelism to download small files while maintaining high bandwidth utilization. This is accomplished by downloading the smallest files first using all but 1 parallel connection while downloading a large file using the remaining parallel connection. The result seems to maintain a more stable download speed throughout entire transactions. This is in contrast to the usual behavior I observed when downloading many packages, where the download speed progressively declines as smaller packages are downloaded. When my entire cache is deleted and all packages are redownloaded using ` pacman -Qqn | sudo pacman -Sw -`, the mean download speed is 47.8 MiB/s. After this patch, the mean download speed is 54.0 MiB/s. In terms of time savings, this patch causes a 14.9 GiB download to go from 5 minutes 20 seconds to 4 minutes 43 seconds on my system and network. Your mileage may vary on different systems, networks, and selections of packages. I expect there to be virtually no effect on selections of packages that are all fairly large. Signed-off-by: Thomas Reichel <tom.p.reichel@gmail.com> --- lib/libalpm/dload.c | 39 +++++++++++++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 4 deletions(-) diff --git a/lib/libalpm/dload.c b/lib/libalpm/dload.c index 7c344760..4d8777f9 100644 --- a/lib/libalpm/dload.c +++ b/lib/libalpm/dload.c @@ -884,7 +884,7 @@ cleanup: } /* - * Use to sort payloads by max size in decending order (largest -> smallest) + * Use to sort payloads by max size in ascending order (smallest -> largest) */ static int compare_dload_payload_sizes(const void *left_ptr, const void *right_ptr) { @@ -893,7 +893,7 @@ static int compare_dload_payload_sizes(const void *left_ptr, const void *right_p left = (struct dload_payload *) left_ptr; right = (struct dload_payload *) right_ptr; - return right->max_size - left->max_size; + return left->max_size - right->max_size; } /* Returns -1 if an error happened for a required file @@ -910,18 +910,39 @@ static int curl_download_internal(alpm_handle_t *handle, int updated = 0; /* was a file actually updated */ CURLM *curlm = handle->curlm; size_t payloads_size = alpm_list_count(payloads); + /* download a single large file along with smaller files */ + struct dload_payload *large_file = NULL; + /* a reverse iterator for payloads */ + alpm_list_t* payloads_reverse; /* Sort payloads by package size */ payloads = alpm_list_msort(payloads, payloads_size, &compare_dload_payload_sizes); + payloads_reverse = alpm_list_last(payloads); + while(active_downloads_num > 0 || payloads) { CURLMcode mc; for(; active_downloads_num < max_streams && payloads; active_downloads_num++) { - struct dload_payload *payload = payloads->data; + struct dload_payload *payload; + if (large_file == NULL) { + payload = payloads_reverse->data; + } else { + payload = payloads->data; + } if(curl_add_payload(handle, curlm, payload, localpath) == 0) { - payloads = payloads->next; + if (payloads_reverse == payloads) { + /* pointers met in the middle, we finished by advancing either */ + payloads = NULL; + break; + } + if (large_file == NULL) { + large_file = payloads_reverse->data; + payloads_reverse = payloads_reverse->prev; + } else { + payloads = payloads->next; + } } else { /* The payload failed to start. Do not start any new downloads. * Wait until all active downloads complete. @@ -959,6 +980,16 @@ static int curl_download_internal(alpm_handle_t *handle, payloads = NULL; err = -1; } else if(ret == 0) { + struct dload_payload *finished_payload; + CURLcode curlerr; + /* which file downloaded? */ + curlerr = curl_easy_getinfo(msg->easy_handle, CURLINFO_PRIVATE, &finished_payload); + ASSERT(curlerr == CURLE_OK, RET_ERR(handle, ALPM_ERR_LIBCURL, -1)); + /* did we just finish the large file that was downloading? */ + if (large_file != NULL && finished_payload == large_file) { + large_file = NULL; + } + updated = 1; } } else { -- 2.43.1
Hello Thomas, On 2024-02-14 04:02, Thomas Reichel wrote:
When downloading small files the download time seems dominated by connection latency rather than bandwidth. Downloading several small files in parallel is an effective way to reduce the impact of latency. However, downloading many small files in parallel does not always saturate bandwidth, which is inefficient. This commit attempts to fully utilize parallelism to download small files while maintaining high bandwidth utilization.
I agree that letting the small files take over the download concurrency, so to speak, degrades the overall performance of the paralleled downloading. I've observed that many times.
This is accomplished by downloading the smallest files first using all but 1 parallel connection while downloading a large file using the remaining parallel connection. The result seems to maintain a more stable download speed throughout entire transactions. This is in contrast to the usual behavior I observed when downloading many packages, where the download speed progressively declines as smaller packages are downloaded.
Just to make sure I understood this correctly, first the largest file is selected and downloaded using one connection, while the remaining connections are used to download small files in parallel? The process continues until the small files are all downloaded, at which point the remaining large files are downloaded using all connections in parallel? Could you, please, let me know am I correct there?
When my entire cache is deleted and all packages are redownloaded using ` pacman -Qqn | sudo pacman -Sw -`, the mean download speed is 47.8 MiB/s. After this patch, the mean download speed is 54.0 MiB/s. In terms of time savings, this patch causes a 14.9 GiB download to go from 5 minutes 20 seconds to 4 minutes 43 seconds on my system and network.
Your mileage may vary on different systems, networks, and selections of packages. I expect there to be virtually no effect on selections of packages that are all fairly large.
Signed-off-by: Thomas Reichel <tom.p.reichel@gmail.com> --- lib/libalpm/dload.c | 39 +++++++++++++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 4 deletions(-)
diff --git a/lib/libalpm/dload.c b/lib/libalpm/dload.c index 7c344760..4d8777f9 100644 --- a/lib/libalpm/dload.c +++ b/lib/libalpm/dload.c @@ -884,7 +884,7 @@ cleanup: }
/* - * Use to sort payloads by max size in decending order (largest -> smallest) + * Use to sort payloads by max size in ascending order (smallest -> largest) */ static int compare_dload_payload_sizes(const void *left_ptr, const void *right_ptr) { @@ -893,7 +893,7 @@ static int compare_dload_payload_sizes(const void *left_ptr, const void *right_p left = (struct dload_payload *) left_ptr; right = (struct dload_payload *) right_ptr;
- return right->max_size - left->max_size; + return left->max_size - right->max_size; }
/* Returns -1 if an error happened for a required file @@ -910,18 +910,39 @@ static int curl_download_internal(alpm_handle_t *handle, int updated = 0; /* was a file actually updated */ CURLM *curlm = handle->curlm; size_t payloads_size = alpm_list_count(payloads); + /* download a single large file along with smaller files */ + struct dload_payload *large_file = NULL; + /* a reverse iterator for payloads */ + alpm_list_t* payloads_reverse;
/* Sort payloads by package size */ payloads = alpm_list_msort(payloads, payloads_size, &compare_dload_payload_sizes);
+ payloads_reverse = alpm_list_last(payloads); + while(active_downloads_num > 0 || payloads) { CURLMcode mc;
for(; active_downloads_num < max_streams && payloads; active_downloads_num++) { - struct dload_payload *payload = payloads->data; + struct dload_payload *payload; + if (large_file == NULL) { + payload = payloads_reverse->data; + } else { + payload = payloads->data; + }
if(curl_add_payload(handle, curlm, payload, localpath) == 0) { - payloads = payloads->next; + if (payloads_reverse == payloads) { + /* pointers met in the middle, we finished by advancing either */ + payloads = NULL; + break; + } + if (large_file == NULL) { + large_file = payloads_reverse->data; + payloads_reverse = payloads_reverse->prev; + } else { + payloads = payloads->next; + } } else { /* The payload failed to start. Do not start any new downloads. * Wait until all active downloads complete. @@ -959,6 +980,16 @@ static int curl_download_internal(alpm_handle_t *handle, payloads = NULL; err = -1; } else if(ret == 0) { + struct dload_payload *finished_payload; + CURLcode curlerr; + /* which file downloaded? */ + curlerr = curl_easy_getinfo(msg->easy_handle, CURLINFO_PRIVATE, &finished_payload); + ASSERT(curlerr == CURLE_OK, RET_ERR(handle, ALPM_ERR_LIBCURL, -1)); + /* did we just finish the large file that was downloading? */ + if (large_file != NULL && finished_payload == large_file) { + large_file = NULL; + } + updated = 1; } } else {
Hi Dragan, Assuming there are `n` available parallel connections: initially, the largest file is selected and downloaded using one connection while the other `n-1` connections download the smallest files in parallel. As files are downloaded, the `n-1` connections continue to download files from smallest to largest while the 1 remaining connection continues down the file list in the opposite direction-- largest to smallest. Eventually, the two groups of connections meet somewhere in the middle of the file list, having completed all of the larger and smaller files, and that's how we know we're done. Hopefully we run out of small files at some point before the end and when the two groups of connections "meet in the middle", all connections are working on moderately sized files, though that's not guaranteed. I hope this is more clear. On 2/13/24 22:36, Dragan Simic wrote:
Hello Thomas,
On 2024-02-14 04:02, Thomas Reichel wrote:
When downloading small files the download time seems dominated by connection latency rather than bandwidth. Downloading several small files in parallel is an effective way to reduce the impact of latency. However, downloading many small files in parallel does not always saturate bandwidth, which is inefficient. This commit attempts to fully utilize parallelism to download small files while maintaining high bandwidth utilization.
I agree that letting the small files take over the download concurrency, so to speak, degrades the overall performance of the paralleled downloading. I've observed that many times.
This is accomplished by downloading the smallest files first using all but 1 parallel connection while downloading a large file using the remaining parallel connection. The result seems to maintain a more stable download speed throughout entire transactions. This is in contrast to the usual behavior I observed when downloading many packages, where the download speed progressively declines as smaller packages are downloaded.
Just to make sure I understood this correctly, first the largest file is selected and downloaded using one connection, while the remaining connections are used to download small files in parallel? The process continues until the small files are all downloaded, at which point the remaining large files are downloaded using all connections in parallel?
Could you, please, let me know am I correct there?
When my entire cache is deleted and all packages are redownloaded using ` pacman -Qqn | sudo pacman -Sw -`, the mean download speed is 47.8 MiB/s. After this patch, the mean download speed is 54.0 MiB/s. In terms of time savings, this patch causes a 14.9 GiB download to go from 5 minutes 20 seconds to 4 minutes 43 seconds on my system and network.
Your mileage may vary on different systems, networks, and selections of packages. I expect there to be virtually no effect on selections of packages that are all fairly large.
Signed-off-by: Thomas Reichel <tom.p.reichel@gmail.com> --- lib/libalpm/dload.c | 39 +++++++++++++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 4 deletions(-)
diff --git a/lib/libalpm/dload.c b/lib/libalpm/dload.c index 7c344760..4d8777f9 100644 --- a/lib/libalpm/dload.c +++ b/lib/libalpm/dload.c @@ -884,7 +884,7 @@ cleanup: }
/* - * Use to sort payloads by max size in decending order (largest -> smallest) + * Use to sort payloads by max size in ascending order (smallest -> largest) */ static int compare_dload_payload_sizes(const void *left_ptr, const void *right_ptr) { @@ -893,7 +893,7 @@ static int compare_dload_payload_sizes(const void *left_ptr, const void *right_p left = (struct dload_payload *) left_ptr; right = (struct dload_payload *) right_ptr;
- return right->max_size - left->max_size; + return left->max_size - right->max_size; }
/* Returns -1 if an error happened for a required file @@ -910,18 +910,39 @@ static int curl_download_internal(alpm_handle_t *handle, int updated = 0; /* was a file actually updated */ CURLM *curlm = handle->curlm; size_t payloads_size = alpm_list_count(payloads); + /* download a single large file along with smaller files */ + struct dload_payload *large_file = NULL; + /* a reverse iterator for payloads */ + alpm_list_t* payloads_reverse;
/* Sort payloads by package size */ payloads = alpm_list_msort(payloads, payloads_size, &compare_dload_payload_sizes);
+ payloads_reverse = alpm_list_last(payloads); + while(active_downloads_num > 0 || payloads) { CURLMcode mc;
for(; active_downloads_num < max_streams && payloads; active_downloads_num++) { - struct dload_payload *payload = payloads->data; + struct dload_payload *payload; + if (large_file == NULL) { + payload = payloads_reverse->data; + } else { + payload = payloads->data; + }
if(curl_add_payload(handle, curlm, payload, localpath) == 0) { - payloads = payloads->next; + if (payloads_reverse == payloads) { + /* pointers met in the middle, we finished by advancing either */ + payloads = NULL; + break; + } + if (large_file == NULL) { + large_file = payloads_reverse->data; + payloads_reverse = payloads_reverse->prev; + } else { + payloads = payloads->next; + } } else { /* The payload failed to start. Do not start any new downloads. * Wait until all active downloads complete. @@ -959,6 +980,16 @@ static int curl_download_internal(alpm_handle_t *handle, payloads = NULL; err = -1; } else if(ret == 0) { + struct dload_payload *finished_payload; + CURLcode curlerr; + /* which file downloaded? */ + curlerr = curl_easy_getinfo(msg->easy_handle, CURLINFO_PRIVATE, &finished_payload); + ASSERT(curlerr == CURLE_OK, RET_ERR(handle, ALPM_ERR_LIBCURL, -1)); + /* did we just finish the large file that was downloading? */ + if (large_file != NULL && finished_payload == large_file) { + large_file = NULL; + } + updated = 1; } } else {
On 2024-02-14 07:07, Thomas Reichel wrote:
Assuming there are `n` available parallel connections: initially, the largest file is selected and downloaded using one connection while the other `n-1` connections download the smallest files in parallel.
As files are downloaded, the `n-1` connections continue to download files from smallest to largest while the 1 remaining connection continues down the file list in the opposite direction-- largest to smallest. Eventually, the two groups of connections meet somewhere in the middle of the file list, having completed all of the larger and smaller files, and that's how we know we're done. Hopefully we run out of small files at some point before the end and when the two groups of connections "meet in the middle", all connections are working on moderately sized files, though that's not guaranteed.
I hope this is more clear.
Thanks for the additional clarification! Yes, that's how I understood it initially, but just wanted to make sure we're on the same page. The proposed improvements look perfect to me.
On 2/13/24 22:36, Dragan Simic wrote:
Hello Thomas,
On 2024-02-14 04:02, Thomas Reichel wrote:
When downloading small files the download time seems dominated by connection latency rather than bandwidth. Downloading several small files in parallel is an effective way to reduce the impact of latency. However, downloading many small files in parallel does not always saturate bandwidth, which is inefficient. This commit attempts to fully utilize parallelism to download small files while maintaining high bandwidth utilization.
I agree that letting the small files take over the download concurrency, so to speak, degrades the overall performance of the paralleled downloading. I've observed that many times.
This is accomplished by downloading the smallest files first using all but 1 parallel connection while downloading a large file using the remaining parallel connection. The result seems to maintain a more stable download speed throughout entire transactions. This is in contrast to the usual behavior I observed when downloading many packages, where the download speed progressively declines as smaller packages are downloaded.
Just to make sure I understood this correctly, first the largest file is selected and downloaded using one connection, while the remaining connections are used to download small files in parallel? The process continues until the small files are all downloaded, at which point the remaining large files are downloaded using all connections in parallel?
Could you, please, let me know am I correct there?
When my entire cache is deleted and all packages are redownloaded using ` pacman -Qqn | sudo pacman -Sw -`, the mean download speed is 47.8 MiB/s. After this patch, the mean download speed is 54.0 MiB/s. In terms of time savings, this patch causes a 14.9 GiB download to go from 5 minutes 20 seconds to 4 minutes 43 seconds on my system and network.
Your mileage may vary on different systems, networks, and selections of packages. I expect there to be virtually no effect on selections of packages that are all fairly large.
Signed-off-by: Thomas Reichel <tom.p.reichel@gmail.com> --- lib/libalpm/dload.c | 39 +++++++++++++++++++++++++++++++++++---- 1 file changed, 35 insertions(+), 4 deletions(-)
diff --git a/lib/libalpm/dload.c b/lib/libalpm/dload.c index 7c344760..4d8777f9 100644 --- a/lib/libalpm/dload.c +++ b/lib/libalpm/dload.c @@ -884,7 +884,7 @@ cleanup: }
/* - * Use to sort payloads by max size in decending order (largest -> smallest) + * Use to sort payloads by max size in ascending order (smallest -> largest) */ static int compare_dload_payload_sizes(const void *left_ptr, const void *right_ptr) { @@ -893,7 +893,7 @@ static int compare_dload_payload_sizes(const void *left_ptr, const void *right_p left = (struct dload_payload *) left_ptr; right = (struct dload_payload *) right_ptr;
- return right->max_size - left->max_size; + return left->max_size - right->max_size; }
/* Returns -1 if an error happened for a required file @@ -910,18 +910,39 @@ static int curl_download_internal(alpm_handle_t *handle, int updated = 0; /* was a file actually updated */ CURLM *curlm = handle->curlm; size_t payloads_size = alpm_list_count(payloads); + /* download a single large file along with smaller files */ + struct dload_payload *large_file = NULL; + /* a reverse iterator for payloads */ + alpm_list_t* payloads_reverse;
/* Sort payloads by package size */ payloads = alpm_list_msort(payloads, payloads_size, &compare_dload_payload_sizes);
+ payloads_reverse = alpm_list_last(payloads); + while(active_downloads_num > 0 || payloads) { CURLMcode mc;
for(; active_downloads_num < max_streams && payloads; active_downloads_num++) { - struct dload_payload *payload = payloads->data; + struct dload_payload *payload; + if (large_file == NULL) { + payload = payloads_reverse->data; + } else { + payload = payloads->data; + }
if(curl_add_payload(handle, curlm, payload, localpath) == 0) { - payloads = payloads->next; + if (payloads_reverse == payloads) { + /* pointers met in the middle, we finished by advancing either */ + payloads = NULL; + break; + } + if (large_file == NULL) { + large_file = payloads_reverse->data; + payloads_reverse = payloads_reverse->prev; + } else { + payloads = payloads->next; + } } else { /* The payload failed to start. Do not start any new downloads. * Wait until all active downloads complete. @@ -959,6 +980,16 @@ static int curl_download_internal(alpm_handle_t *handle, payloads = NULL; err = -1; } else if(ret == 0) { + struct dload_payload *finished_payload; + CURLcode curlerr; + /* which file downloaded? */ + curlerr = curl_easy_getinfo(msg->easy_handle, CURLINFO_PRIVATE, &finished_payload); + ASSERT(curlerr == CURLE_OK, RET_ERR(handle, ALPM_ERR_LIBCURL, -1)); + /* did we just finish the large file that was downloading? */ + if (large_file != NULL && finished_payload == large_file) { + large_file = NULL; + } + updated = 1; } } else {
On 14/2/24 13:02, Thomas Reichel wrote:
When downloading small files the download time seems dominated by connection latency rather than bandwidth. Downloading several small files in parallel is an effective way to reduce the impact of latency. However, downloading many small files in parallel does not always saturate bandwidth, which is inefficient. This commit attempts to fully utilize parallelism to download small files while maintaining high bandwidth utilization.
This is accomplished by downloading the smallest files first using all but 1 parallel connection while downloading a large file using the remaining parallel connection. The result seems to maintain a more stable download speed throughout entire transactions. This is in contrast to the usual behavior I observed when downloading many packages, where the download speed progressively declines as smaller packages are downloaded.
When my entire cache is deleted and all packages are redownloaded using ` pacman -Qqn | sudo pacman -Sw -`, the mean download speed is 47.8 MiB/s. After this patch, the mean download speed is 54.0 MiB/s. In terms of time savings, this patch causes a 14.9 GiB download to go from 5 minutes 20 seconds to 4 minutes 43 seconds on my system and network.
Your mileage may vary on different systems, networks, and selections of packages. I expect there to be virtually no effect on selections of packages that are all fairly large.
I just tested on my current update and it was a 20% deficit upon applying this patch. I guess there is less benefit (or negative benefit...) if your download speed is slower and your relatively latency lower. The sorting based on size was implemented after a back of the envelope calculation that showed this was a good choice over a range of possible speeds/latency/package size combinations. I won't consider this patch without a more thorough analysis. Also, patches should go via: https://gitlab.archlinux.org/pacman/pacman Allan
Hello Allan, On 2024-02-14 08:53, Allan McRae wrote:
On 14/2/24 13:02, Thomas Reichel wrote:
When downloading small files the download time seems dominated by connection latency rather than bandwidth. Downloading several small files in parallel is an effective way to reduce the impact of latency. However, downloading many small files in parallel does not always saturate bandwidth, which is inefficient. This commit attempts to fully utilize parallelism to download small files while maintaining high bandwidth utilization.
This is accomplished by downloading the smallest files first using all but 1 parallel connection while downloading a large file using the remaining parallel connection. The result seems to maintain a more stable download speed throughout entire transactions. This is in contrast to the usual behavior I observed when downloading many packages, where the download speed progressively declines as smaller packages are downloaded.
When my entire cache is deleted and all packages are redownloaded using ` pacman -Qqn | sudo pacman -Sw -`, the mean download speed is 47.8 MiB/s. After this patch, the mean download speed is 54.0 MiB/s. In terms of time savings, this patch causes a 14.9 GiB download to go from 5 minutes 20 seconds to 4 minutes 43 seconds on my system and network.
Your mileage may vary on different systems, networks, and selections of packages. I expect there to be virtually no effect on selections of packages that are all fairly large.
I just tested on my current update and it was a 20% deficit upon applying this patch. I guess there is less benefit (or negative benefit...) if your download speed is slower and your relatively latency lower.
The sorting based on size was implemented after a back of the envelope calculation that showed this was a good choice over a range of possible speeds/latency/package size combinations. I won't consider this patch without a more thorough analysis.
Maybe it's, at least partially, up to the effects of bufferbloat at the upstream router, which may "drown" the small-file downloads by the single large-file download, causing the performance regression. Of course, that just adds another variable into the equation. If that's the case, maybe limiting the download speed of the single large-file download could result in better performance. With the emphasis on "maybe", of course.
Also, patches should go via: https://gitlab.archlinux.org/pacman/pacman
Hi Allan, I ran a fairly comprehensive test over random sets of packages of varying lengths across varying mirrors to simulate different bandwidths and latencies and my result is that for 10-100 packages the geometric mean ratio of download times between the unpatched and patched versions of pacman is between 0.98 to 1.02 with 95% confidence-- which is to say this has virtually no effect on typical performance. This patch may have a statistically significant effect on very very large numbers of packages, like my initial test where I downloaded about 3000 packages all at once, but that is hard to get a confidence interval on because it would require repeatedly doing 14+ GB downloads. I got an email from my ISP when I ran the tests with just 10-100 packages, so I won't be running the 1000 package tests any time soon. Not to mention that the typical update does not contain thousands of packages. Maybe if the number of packages in an typical user's `pacman -Syu` increases dramatically over time we can revisit this, but until then I think the current behavior is ideal. Best, Tom On 2/14/24 01:53, Allan McRae wrote:
On 14/2/24 13:02, Thomas Reichel wrote:
When downloading small files the download time seems dominated by connection latency rather than bandwidth. Downloading several small files in parallel is an effective way to reduce the impact of latency. However, downloading many small files in parallel does not always saturate bandwidth, which is inefficient. This commit attempts to fully utilize parallelism to download small files while maintaining high bandwidth utilization.
This is accomplished by downloading the smallest files first using all but 1 parallel connection while downloading a large file using the remaining parallel connection. The result seems to maintain a more stable download speed throughout entire transactions. This is in contrast to the usual behavior I observed when downloading many packages, where the download speed progressively declines as smaller packages are downloaded.
When my entire cache is deleted and all packages are redownloaded using ` pacman -Qqn | sudo pacman -Sw -`, the mean download speed is 47.8 MiB/s. After this patch, the mean download speed is 54.0 MiB/s. In terms of time savings, this patch causes a 14.9 GiB download to go from 5 minutes 20 seconds to 4 minutes 43 seconds on my system and network.
Your mileage may vary on different systems, networks, and selections of packages. I expect there to be virtually no effect on selections of packages that are all fairly large.
I just tested on my current update and it was a 20% deficit upon applying this patch. I guess there is less benefit (or negative benefit...) if your download speed is slower and your relatively latency lower.
The sorting based on size was implemented after a back of the envelope calculation that showed this was a good choice over a range of possible speeds/latency/package size combinations. I won't consider this patch without a more thorough analysis.
Also, patches should go via: https://gitlab.archlinux.org/pacman/pacman
Allan
On 16/2/24 12:16, Thomas Reichel wrote:
Hi Allan,
I ran a fairly comprehensive test over random sets of packages of varying lengths across varying mirrors to simulate different bandwidths and latencies and my result is that for 10-100 packages the geometric mean ratio of download times between the unpatched and patched versions of pacman is between 0.98 to 1.02 with 95% confidence-- which is to say this has virtually no effect on typical performance.
This patch may have a statistically significant effect on very very large numbers of packages, like my initial test where I downloaded about 3000 packages all at once, but that is hard to get a confidence interval on because it would require repeatedly doing 14+ GB downloads. I got an email from my ISP when I ran the tests with just 10-100 packages, so I won't be running the 1000 package tests any time soon. Not to mention that the typical update does not contain thousands of packages.
Maybe if the number of packages in an typical user's `pacman -Syu` increases dramatically over time we can revisit this, but until then I think the current behavior is ideal.
Thanks - great to have some benchmarks done! I am really surprised it made near zero difference. Cheers, Allan
participants (3)
-
Allan McRae
-
Dragan Simic
-
Thomas Reichel