[pacman-dev] delay writes a fsync
Hi, You have probably heard about this: http://linux.slashdot.org/article.pl?sid=09/03/11/2031231 Here is the link to the bug report that describes the cause of the problem: https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 A key part of the bug report comment... Applications are expected to use fsync() or fdatasync(), and if that impacts their performance too much, to use a single berkdb or other binary database file, and not do something stupid with hundreds of tiny text files that only hold a few bytes of data in each text file. Hmmm... something sound familiar there? So what should we do? Maybe, add some sort of syncing after a db write. But then after each file or just at the end of the transaction? Allan
On Thu, Mar 12, 2009 at 9:19 AM, Allan McRae <allan@archlinux.org> wrote:
Hi,
You have probably heard about this: http://linux.slashdot.org/article.pl?sid=09/03/11/2031231
Here is the link to the bug report that describes the cause of the problem: https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54
A key part of the bug report comment... Applications are expected to use fsync() or fdatasync(), and if that impacts their performance too much, to use a single berkdb or other binary database file, and not do something stupid with hundreds of tiny text files that only hold a few bytes of data in each text file.
Hmmm... something sound familiar there? So what should we do? Maybe, add some sort of syncing after a db write. But then after each file or just at the end of the transaction?
If I remember correctly, the code does this : pkg1 : extract files ; write db entry pkg2 : extract files ; write db entry etc .. So yeah, it could make sense to sync after each db write to have a consistent database after each package install. Otherwise, go go sqlite :)
On 12/03/2009, at 7:18 PM, Xavier wrote:
Otherwise, go go sqlite :)
Wasn't there a (set of) patch(es) for a "packed" format for the databases (tar-like, iirc)? From what I remember there was a performance improvement.
Sebastian Nowicki wrote:
On 12/03/2009, at 7:18 PM, Xavier wrote:
Otherwise, go go sqlite :)
Wasn't there a (set of) patch(es) for a "packed" format for the databases (tar-like, iirc)? From what I remember there was a performance improvement.
There was talk about moving to a tar based back-end but I don't remember patches. There was some patches for an sqlite backend but given we already need libarchive to extract the packages, a tar based backend makes more sense to me. Allan
Sebastian Nowicki wrote:
On 12/03/2009, at 7:18 PM, Xavier wrote:
Otherwise, go go sqlite :)
Wasn't there a (set of) patch(es) for a "packed" format for the databases (tar-like, iirc)? From what I remember there was a performance improvement.
There was talk about moving to a tar based back-end but I don't remember patches. There was some patches for an sqlite backend but given we already need libarchive to extract the packages, a tar based backend makes more sense to me.
Allan
tar backend is optimal with read-only (== sync) databases, but I am not sure it would work with local database. See also: http://bugs.archlinux.org/task/8586 And Dan has a "backend" branch which may make the first steps. Bye
On 12/03/2009, at 7:25 PM, Allan McRae wrote:
Sebastian Nowicki wrote:
On 12/03/2009, at 7:18 PM, Xavier wrote:
Otherwise, go go sqlite :)
Wasn't there a (set of) patch(es) for a "packed" format for the databases (tar-like, iirc)? From what I remember there was a performance improvement.
There was talk about moving to a tar based back-end but I don't remember patches. There was some patches for an sqlite backend but given we already need libarchive to extract the packages, a tar based backend makes more sense to me.
Allan
I was referring to the packed format[1], which is yet another backed. Not sure what actually happened with it. Anyway, back on topic ;). [1] http://www.archlinux.org/pipermail/pacman-dev/2008-December/007805.html
On Thu, Mar 12, 2009 at 5:18 AM, Xavier <shiningxc@gmail.com> wrote:
On Thu, Mar 12, 2009 at 9:19 AM, Allan McRae <allan@archlinux.org> wrote:
Hi,
You have probably heard about this: http://linux.slashdot.org/article.pl?sid=09/03/11/2031231
Here is the link to the bug report that describes the cause of the problem: https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54
A key part of the bug report comment... Applications are expected to use fsync() or fdatasync(), and if that impacts their performance too much, to use a single berkdb or other binary database file, and not do something stupid with hundreds of tiny text files that only hold a few bytes of data in each text file.
Hmmm... something sound familiar there? So what should we do? Maybe, add some sort of syncing after a db write. But then after each file or just at the end of the transaction?
If I remember correctly, the code does this : pkg1 : extract files ; write db entry pkg2 : extract files ; write db entry etc ..
So yeah, it could make sense to sync after each db write to have a consistent database after each package install.
Adding an fsync() in the write_db_entry() call would probably make sense. However, note the funny part here- if we sync our DB entries, and then your machine gets powered off, you might end up with a DB that got committed but files in the package never actually got written to disk. -Dan
On Thu, Mar 12, 2009 at 4:06 PM, Dan McGee <dpmcgee@gmail.com> wrote:
Adding an fsync() in the write_db_entry() call would probably make sense.
However, note the funny part here- if we sync our DB entries, and then your machine gets powered off, you might end up with a DB that got committed but files in the package never actually got written to disk.
Do we know that for sure? It depends on how libarchive is written, right?
Am Donnerstag, 12. März 2009 17:00:46 schrieb Xavier:
On Thu, Mar 12, 2009 at 4:06 PM, Dan McGee <dpmcgee@gmail.com> wrote:
Adding an fsync() in the write_db_entry() call would probably make sense.
However, note the funny part here- if we sync our DB entries, and then your machine gets powered off, you might end up with a DB that got committed but files in the package never actually got written to disk.
Do we know that for sure? It depends on how libarchive is written, right?
This is interesting. I think we just need to make sure that the meta data are written and the pacman.log. So, if something fails during a transaction the user could reinstall the packages manually. OT: I have read that Sun uses ZFS snapshots to implement transcation in their package manager. Doing this on FS level seems to be the only sane way imho. Its hard to revert the changes of an install script from pacman's point of view. -- Pierre Schmitz Clemens-August-Straße 76 53115 Bonn Telefon 0228 9716608 Mobil 0160 95269831 Jabber pierre@jabber.archlinux.de WWW http://www.archlinux.de
We don't make calls to fsync() or fdatasync() when we are attempting to do something transactionally, such as writing out one of our DB entries. Add a call to fdatasync() when we write DB entries, and also ensure we sync our log changes to disc whenever we close it. Another important thing to ensure is that we commit removals of DB entries. The method isn't necessarily pretty, but it works in _alpm_db_remove(). Signed-off-by: Dan McGee <dan@archlinux.org> --- lib/libalpm/be_files.c | 22 +++++++++++++++------- lib/libalpm/handle.c | 6 +++++- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/lib/libalpm/be_files.c b/lib/libalpm/be_files.c index 0765b8d..4faa61e 100644 --- a/lib/libalpm/be_files.c +++ b/lib/libalpm/be_files.c @@ -27,11 +27,13 @@ #include <string.h> #include <stdint.h> /* uintmax_t, intmax_t */ #include <sys/stat.h> +#include <sys/types.h> #include <dirent.h> #include <ctype.h> #include <time.h> #include <limits.h> /* PATH_MAX */ #include <locale.h> /* setlocale */ +#include <fcntl.h> /* open, close */ /* libalpm */ #include "db.h" @@ -776,8 +778,6 @@ int _alpm_db_write(pmdb_t *db, pmpkg_t *info, pmdbinfrq_t inforeq) "%s\n\n", info->md5sum); } } - fclose(fp); - fp = NULL; } /* FILES */ @@ -804,8 +804,6 @@ int _alpm_db_write(pmdb_t *db, pmpkg_t *info, pmdbinfrq_t inforeq) } fprintf(fp, "\n"); } - fclose(fp); - fp = NULL; } /* DEPENDS */ @@ -848,8 +846,6 @@ int _alpm_db_write(pmdb_t *db, pmpkg_t *info, pmdbinfrq_t inforeq) } fprintf(fp, "\n"); } - fclose(fp); - fp = NULL; } /* INSTALL */ @@ -860,7 +856,10 @@ cleanup: free(pkgpath); if(fp) { + fflush(fp); + fdatasync(fileno(fp)); fclose(fp); + fp = NULL; } return(retval); @@ -868,7 +867,7 @@ cleanup: int _alpm_db_remove(pmdb_t *db, pmpkg_t *info) { - int ret = 0; + int fd, ret; char *pkgpath = NULL; ALPM_LOG_FUNC; @@ -881,6 +880,15 @@ int _alpm_db_remove(pmdb_t *db, pmpkg_t *info) ret = _alpm_rmrf(pkgpath); free(pkgpath); + /* by syncing the parent directory, we can be sure the removal is + * committed to disk. */ + fd = open(db->path, 0); + if(fd != -1) { + fsync(fd); + close(fd); + } else { + ret = -1; + } if(ret != 0) { ret = -1; } diff --git a/lib/libalpm/handle.c b/lib/libalpm/handle.c index 813f439..488459e 100644 --- a/lib/libalpm/handle.c +++ b/lib/libalpm/handle.c @@ -63,8 +63,10 @@ void _alpm_handle_free(pmhandle_t *handle) /* close logfile */ if(handle->logstream) { + fflush(handle->logstream); + fdatasync(fileno(handle->logstream)); fclose(handle->logstream); - handle->logstream= NULL; + handle->logstream = NULL; } if(handle->usesyslog) { handle->usesyslog = 0; @@ -419,6 +421,8 @@ int SYMEXPORT alpm_option_set_logfile(const char *logfile) FREE(oldlogfile); } if(handle->logstream) { + fflush(handle->logstream); + fdatasync(fileno(handle->logstream)); fclose(handle->logstream); handle->logstream = NULL; } -- 1.6.2
Dan McGee wrote:
We don't make calls to fsync() or fdatasync() when we are attempting to do something transactionally, such as writing out one of our DB entries. Add a call to fdatasync() when we write DB entries, and also ensure we sync our log changes to disc whenever we close it.
Another important thing to ensure is that we commit removals of DB entries. The method isn't necessarily pretty, but it works in _alpm_db_remove().
This looks good to me. Just wonding how much overhead the fsyncing adds? Have you noticed any speed decrease when running with this patch? Not that there is much choice here... Allan
On Fri, Mar 13, 2009 at 9:55 PM, Allan McRae <allan@archlinux.org> wrote:
Dan McGee wrote:
We don't make calls to fsync() or fdatasync() when we are attempting to do something transactionally, such as writing out one of our DB entries. Add a call to fdatasync() when we write DB entries, and also ensure we sync our log changes to disc whenever we close it.
Another important thing to ensure is that we commit removals of DB entries. The method isn't necessarily pretty, but it works in _alpm_db_remove().
This looks good to me. Just wonding how much overhead the fsyncing adds? Have you noticed any speed decrease when running with this patch? Not that there is much choice here...
Little if any, actually, as we are syncing only our DB files and not all of the files we lay down on disk. For the most part pacman is already I/O bound, so this doesn't add any noticeable hiccups. -Dan
Dan McGee wrote:
On Fri, Mar 13, 2009 at 9:55 PM, Allan McRae <allan@archlinux.org> wrote:
Dan McGee wrote:
We don't make calls to fsync() or fdatasync() when we are attempting to do something transactionally, such as writing out one of our DB entries. Add a call to fdatasync() when we write DB entries, and also ensure we sync our log changes to disc whenever we close it.
Another important thing to ensure is that we commit removals of DB entries. The method isn't necessarily pretty, but it works in _alpm_db_remove().
This looks good to me. Just wonding how much overhead the fsyncing adds? Have you noticed any speed decrease when running with this patch? Not that there is much choice here...
Little if any, actually, as we are syncing only our DB files and not all of the files we lay down on disk. For the most part pacman is already I/O bound, so this doesn't add any noticeable hiccups.
OK, that sounds good. BTW, won't all files will be synced if partitions are mounted with the "ordered" option? But that is a users choice, so... Allan
Dan McGee schrieb:
We don't make calls to fsync() or fdatasync() when we are attempting to do something transactionally, such as writing out one of our DB entries. Add a call to fdatasync() when we write DB entries, and also ensure we sync our log changes to disc whenever we close it.
Another important thing to ensure is that we commit removals of DB entries. The method isn't necessarily pretty, but it works in _alpm_db_remove().
Can we get this in the maint branch as well? I had this several times yesterday when experimenting with 2.6.29-rc8 and some experimental intel driver stuff and it crashed a lot. I eventually had to disable delalloc on my / to make this go away.
@@ -881,6 +880,15 @@ int _alpm_db_remove(pmdb_t *db, pmpkg_t *info)
ret = _alpm_rmrf(pkgpath); free(pkgpath); + /* by syncing the parent directory, we can be sure the removal is + * committed to disk. */ + fd = open(db->path, 0); + if(fd != -1) { + fsync(fd); + close(fd); + } else { + ret = -1; + } if(ret != 0) { ret = -1; }
If I understood it correctly, we also have to sync the parent directory when adding new files, not only when removing. Not entirely sure though.
On Sat, Mar 14, 2009 at 1:40 PM, Thomas Bächler <thomas@archlinux.org> wrote:
Dan McGee schrieb:
We don't make calls to fsync() or fdatasync() when we are attempting to do something transactionally, such as writing out one of our DB entries. Add a call to fdatasync() when we write DB entries, and also ensure we sync our log changes to disc whenever we close it.
Another important thing to ensure is that we commit removals of DB entries. The method isn't necessarily pretty, but it works in _alpm_db_remove().
Can we get this in the maint branch as well? I had this several times yesterday when experimenting with 2.6.29-rc8 and some experimental intel driver stuff and it crashed a lot. I eventually had to disable delalloc on my / to make this go away.
From http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt... : """ What’s the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I haven’t quantified what
the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won’t have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclena shutdown really should use fsync. """" Short term : use nodelalloc if your system is likely to crash (using experimental or unstable drivers) Long term : fix the apps I don't think that improving pacman on this side ASAP will solve the general problem. You will still have many other apps affected on your system, some of them also dealing with important files. Or not? Anyway, I am not opposed to a new quick maint release, this decision is up to Dan anyway. I am just saying I don't see a big need for it. And in any cases, we should try to move forward for a 3.3 release :)
On Sat, Mar 14, 2009 at 8:02 AM, Xavier <shiningxc@gmail.com> wrote:
On Sat, Mar 14, 2009 at 1:40 PM, Thomas Bächler <thomas@archlinux.org> wrote:
Dan McGee schrieb:
We don't make calls to fsync() or fdatasync() when we are attempting to do something transactionally, such as writing out one of our DB entries. Add a call to fdatasync() when we write DB entries, and also ensure we sync our log changes to disc whenever we close it.
Another important thing to ensure is that we commit removals of DB entries. The method isn't necessarily pretty, but it works in _alpm_db_remove().
Can we get this in the maint branch as well? I had this several times yesterday when experimenting with 2.6.29-rc8 and some experimental intel driver stuff and it crashed a lot. I eventually had to disable delalloc on my / to make this go away.
From http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-lengt... : """ What’s the best path forward? For now, what I would recommend to Ubuntu gamers whose systems crash all the time and who want to use ext4, to use the nodelalloc mount option. I haven’t quantified what the performance penalty will be for this mode of operation, but the performance will be better than ext3, and at least this way they won’t have to worry about files getting lost as a result of delayed allocation. Long term, application writers who are worried about files getting lost on an unclena shutdown really should use fsync. """"
Short term : use nodelalloc if your system is likely to crash (using experimental or unstable drivers) Long term : fix the apps
I don't think that improving pacman on this side ASAP will solve the general problem. You will still have many other apps affected on your system, some of them also dealing with important files. Or not?
Anyway, I am not opposed to a new quick maint release, this decision is up to Dan anyway. I am just saying I don't see a big need for it. And in any cases, we should try to move forward for a 3.3 release :)
I never said quick with this stuff at all. Does your system crash all the time? Or ever? Mine sure doesn't- I'm in little rush to get this out. Thomas points to one case where his system does, but in that case he is going to have a lot of broken stuff as he says as other apps also do it wrong as well. My main point here is just to do it right- I'm not trying to address a specific problem, just the general concern that is "if I write a file and my app exits, can I be sure it actually got written?". Thomas- I believe you are right, we should sync the db directory when we write to it as well. -Dan
Dan McGee schrieb:
Anyway, I am not opposed to a new quick maint release, this decision is up to Dan anyway. I am just saying I don't see a big need for it. And in any cases, we should try to move forward for a 3.3 release :)
I never said quick with this stuff at all. Does your system crash all the time? Or ever? Mine sure doesn't- I'm in little rush to get this out.
I only wanted this patch to be applied to the maint branch, I never said we should release it NOW.
Thomas points to one case where his system does, but in that case he is going to have a lot of broken stuff as he says as other apps also do it wrong as well.
This case is actually a VERY common case. You experiment with new applications: 1) Install new package 2) Try it 3) It crashes your machine (because you are running experimental software with experimental kernel interfaces) 4) You just installed the package 10 seconds before the crash, so now your pacman db is corrupted My point is, running pacman directly before a system crash might be more common than you might think. Or the other way around: System crashes are more likely to happen directly after a pacman action than at any other time.
My main point here is just to do it right- I'm not trying to address a specific problem, just the general concern that is "if I write a file and my app exits, can I be sure it actually got written?".
Yes, of course we should ensure that our db is in a consistent state when pacman exits, I am glad this has been done so fast.
Thomas- I believe you are right, we should sync the db directory when we write to it as well.
Okay.
Thomas Bächler wrote:
Dan McGee schrieb:
Anyway, I am not opposed to a new quick maint release, this decision is up to Dan anyway. I am just saying I don't see a big need for it. And in any cases, we should try to move forward for a 3.3 release :)
I never said quick with this stuff at all. Does your system crash all the time? Or ever? Mine sure doesn't- I'm in little rush to get this out.
I only wanted this patch to be applied to the maint branch, I never said we should release it NOW.
I should finish everything that needs done with makepkg in the next week so I guess we will be gearing up for the next official release. So it could go on maint but I guess there is little point. Allan
Allan McRae schrieb:
A key part of the bug report comment... Applications are expected to use fsync() or fdatasync(), and if that impacts their performance too much, to use a single berkdb or other binary database file, and not do something stupid with hundreds of tiny text files that only hold a few bytes of data in each text file.
It looks like this is called with the fd of an open file. What happens if you call fclose()? Is it synced automatically then, or do you need to run fsync() before fclose()?
Thomas Bächler schrieb:
Allan McRae schrieb:
A key part of the bug report comment... Applications are expected to use fsync() or fdatasync(), and if that impacts their performance too much, to use a single berkdb or other binary database file, and not do something stupid with hundreds of tiny text files that only hold a few bytes of data in each text file.
It looks like this is called with the fd of an open file. What happens if you call fclose()? Is it synced automatically then, or do you need to run fsync() before fclose()?
Okay, I read up on it and apparently, fclose does not guarantee fsync. For ext4, one can probably work around that problem by mounting with the "nodelalloc" option, but that will result in performance penalties and more fragmentation.
participants (8)
-
Allan McRae
-
Dan McGee
-
Dan McGee
-
Nagy Gabor
-
Pierre Schmitz
-
Sebastian Nowicki
-
Thomas Bächler
-
Xavier