We're not making exceptions for threading in this discussion yet... however, if you're running under circumstances where you're hitting a hard RAM limit, as I said, the OS will generally have to kill something and from its own perspective, it might just as well be anything, including the process with the failed malloc. Subsequent error reporting and recovery are just as likely to fail OOM in modern OSes, really -- especially considering there are other processes competing for the same resources, with the same "rights" not to lose data because of their own respective OOMs.
And at any given time, there is just so much RAM to be shared... tough choices.
LambdaGrok. | #ubuntu-programming on FreeNode
j_g: You have uncovered the biggest secret the Open Source Software community would like to keep a lid on. Fact is, Linux is being pushed so hard in the press and projects are so pressured to release new code with new functionality that very little attention is paid to "peer review" and tonnes upon tonnes of "codemonkey"-style code becomes the norm. I wish you all the luck in your attempts to change this "status quo", but I hope you realize that this trend has been snowballing for a long time. I suspect your most effective approach would be to convince someone "high up" on the 'authority ladder' of Debian or Ubuntu to establish quality code standards. Otherwise, in the end, Ubuntu will become nothing but a large dirty snowball that nobody wants.
But I'm not out to change the world. If someone wants to put a poor design into a distro I'm using, I'll either look for a way to disable/remove it, or look for a distro that is more acceptable to me. If things got so bad all around the Linux landscape that it ended up being worse than Windows for me, I'd just go back to Windows. (Actually, I haven't left Windows. Right now, I use Linux and Windows equally because Linux is not yet better than Windows for me. They're both about on a par, with a slight nod to Windows, for my purposes). Brand loyalty? Not for me. If something personally offers me more, I'll drop what I'm using and move to it, with no regrets at all. I've done it numerous times before.
What I'm simply saying here is...
"I know some of you are learning programming. And you're probably looking at code such as this shared lib example that I recently came across. One day, you may decide to write a shared lib too. Don't do this specific thing right here, that I'm showing you with this one example. It's bad, for reasons I'm going to describe now. Don't do it. Just... don't. Endusers, and other programmers who use your lib, will thank you for it."
That's all I'm saying. If it ends up changing the world, great. If it ends up just making one programmer aware of a better way to do things, and I happen to use that guy's stuff, that's just as great as far as I'm concerned.
I'm too much of a pragmatist to ever bother changing the world. I don't need that. I just need programmers to stop doing things like putting exit()'s in shared libs, especially ones that are difficult for me to avoid because, for example, maybe Ubuntu devs put it into the next release as a major component, and set all of the desktop apps to use it by default. Sigh.
There seems to be some misunderstandings about how killing processe when out of memory happens.
1. in linux, malloc never returns 0, unless the amount asked to allocate is ridiculuosly large (just too large to fullfill)
2. When a process asks for more memory and malloc is about to fail, that process is NOT killed. instead, the code in linux/mm/oom_kill.c is used to pick a process (or in case of fork bombs, processes) to kill.
Malloc returning 0 when OOM is a ridiculously bad idea on a shared system, because quite likely the victim of OOM is not going to be the same as the process that caused it. because of this, just coding good OOM handling in your own program is not enough to make it OOM resistant, as when you use it all up, the next malloc that would fail is about as likely to be something in X or d-Bus or gnome or stuff like that. Now, none of those projects is interested in writing good oom handling for themselves, and in many cases (D-bus) they wouldn't be able to do it without data loss if they wanted.
So, when memory hits 0 on linux, linux tries to look up something that is
a) using a lot of memory.
b) not likely to be useful
and kill it.
It'll settle for a if it can't find b.
The code at oom_kill.c is ver well commented, from the first comment block:
Go read it before debating more about situations that never happen. (start at the last function, it is the one called from outside when OOM)Code:* Since we won't call these routines often (on a well-configured * machine) this file will double as a 'coding guide' and a signpost * for newbie kernel hackers. It features several pointers to major * kernel subsystems and hints as to where to find out what things do.
The thing that pisses me off most about this thread, and the one before it, is that in well designed programs OOM doesn't cause data loss, not even when using those badly designed libraries. This is not because they do malloc returning 0 properly, but because they do handling termination signals properly. When linux decides to kill a task, it gives it a warning signal, some free memory from kernel reserves and plenty of time to do what it needs to to save data.
So to handle OOM properly without losing data write a term signal handler that saves data and tries to use as little dynamic memory as possible.
Last edited by Tuna-Fish; November 17th, 2007 at 02:22 PM.
We hopefully teach people to do proper error checking, such as checking malloc for a 0 return and trying to recover -- so that the OS doesn't have to resort to the worse case scenario, and start dropping services and apps on its own. So if your apps are meticulous about error checking, and don't do braindead things like, let's say, busy-loop around that call to malloc until it succeeds, but rather simply not do whatever operation it was planning to do, then you're safe to write an OS that is more stable under low memory conditions. Windows seems pretty good at handling low memory conditions without abruptly terminating apps, but instead, giving them every opportunity to try to recover. I've personally experienced it myself. The operating system literally tells the enduser that he's running low on memory, and he should choose processes/windows/whatever to close down. Windows plans ahead for this, and that's why, if it happens, it can present that notice. Now I haven't personally run Linux under low memory conditions, and I'm not looking to even try, because I have heard that it does not gracefully handle that situation. But that is not how it should be, and programmers should strive to write their software so that it doesn't have to be that way.
Yes, I realize that it's difficult to integrate error handling among components that were written by people who may have never even spoken to each other. Obviously, Windows is written by one entity where the teams are pretty much compelled to interoperate with each other. So they can make sure that, if nearly every byte of RAM has truly been put to use, there is still enough "squirreled away" to make sure the GUI informs the user he's running low on memory, and he should choose some processes/windows/whatever to close down. The Linux kernel isn't even written with a GUI in mind. That's fully outside the scope of the kernel. So yes, I imagine that it is very difficult for the Gnome desktop, for example, to do something like tell the enduser he's running low on memory if the kernel (ie, the entity doling out memory) doesn't even know what Gnome is really for.when you use it all up, the next malloc that would fail is about as likely to be something in X or d-Bus or gnome or stuff like that.
So, if the kernel can't be sure it can advise the enduser of this condition, I suppose it has to make a decision how to recover on its own (rather than letting the enduser choose how he wants to regain some memory -- close a window, unplug some device that may cause mem to be freed, maybe close down some process, etc).
But that's irrelevant. It's a different discussion. It has no bearing upon whether an app should not bother checking malloc for a 0 return, and attempting to recover. (What you're talking about means only that a Win32 app has a better chance of not being terminated without the enduser's consent than a Linux app has). If the app asks pa_xalloc for "a ridiculously huge amount of RAM", and malloc fails, then the app will be terminated. End of story. Finito. Try it. You'll see. Why? Because there's a call to exit() there. And it shouldn't be there. It shouldn't be there in other Linux shared libs either.
That's what we've been telling people in these threads.
Last edited by j_g; November 17th, 2007 at 05:51 PM.
How familiar with OSS development? There is only peer review in OSS, in fact, that is how it grows. Coding standards exist, look at the GNU site. And different projects have different standards.
Code Monkeys is a great show, but I sense you are using the original meaning. I wouldn't call the oldest and most stable software the result of code monkeys.
What is clear to me though is that malloc can return NULL way before the kernel's OOM killer kicks in, which allows applications (well, the ones doing proper error handling) to take the appropriate actions.
Anyway, either case the decision to exit should not belong to the library, it is the application responsibility to handle such cases, just because the author of a library has no clue where exactly his library may be used in the end.
Damn, that's for reasons like those that I'm glad I use C++ (at least if I don't have to deal with misdesigned C libraries). No memory available? Let the exception propagate...
Not even tinfoil can save us now...
All we know for sure is that malloc can indeed return a 0 on Linux (I've had it happen to me before), and if you fall into exit(), then you kill off the process. This is a nasty thing to do to an enduser (because it could result in him losing unsaved work, and is just plain annoying and unreassuring). It's also a nasty thing to do to an app programmer using your shared lib.
OOM issues have nothing to do with calling exit() when malloc fails, or calling exit() for any reason in a shared lib used by numerous apps. (It's true that Linux does allow "over-committing memory", and because of that, an OOM killer exists. I really, really wish it weren't so. I really wish that Linux didn't over-commit, and also informed the enduser when a low memory condition was upon him. But that's another problem).that it is simply your mis-understanding of how OOM issues are typically handled in Linux environments.
Yes. Check the return of malloc for 0, and gracefully handle the situation by informing the app of this situation. Do not abort the app.Are you absolutely SURE that you are spreading the *correct* advice to programmers who are new to shared library coding?
Follow the advice here http://www.linuxdevcenter.com/pub/a/...ry.html?page=2 (under "Check for NULL Pointer after Memory Allocation").
Last edited by j_g; November 19th, 2007 at 10:19 AM.