PDA

View Full Version : Memory operation



LlmL
December 5th, 2008, 11:31 AM
I'm working on a large project concerning many memory operations.
Our program has got a "core.*" for many times. From the core information, I am sure that some invalid memory operation has coursed that. But the project is so complicated that reviewing codes costs too much time, besides, this program started up over 80 threads, and it seems that the "core" shows up irregularly.
I am thinking about setting some "hooks" in the system memory operations like memcpy, malloc .etc. How can I do this? Should I rewrite & recompile the gnu libs?

samjh
December 5th, 2008, 12:21 PM
I'm sorry, but I cannot understand what you're trying to say.

What is "core.*"? What do you mean that you have it "many times"? What do you mean by "hooks"?

iponeverything
December 5th, 2008, 12:48 PM
I'm sorry, but I cannot understand what you're trying to say.

What is "core.*"? What do you mean that you have it "many times"? What do you mean by "hooks"?

His program is segfaulting and dumping a core. Do a man core or man signal.

@LlmL Its quite hard to debug an application if you can't reliably reproduce the condition.

thk
December 5th, 2008, 03:20 PM
Take a look at valgrind and similar solutions.

Cracauer
December 6th, 2008, 12:36 AM
I'm working on a large project concerning many memory operations.
Our program has got a "core.*" for many times. From the core information, I am sure that some invalid memory operation has coursed that. But the project is so complicated that reviewing codes costs too much time, besides, this program started up over 80 threads, and it seems that the "core" shows up irregularly.
I am thinking about setting some "hooks" in the system memory operations like memcpy, malloc .etc. How can I do this? Should I rewrite & recompile the gnu libs?

You are confused.

It's not the memory options that cause the segfaults that presumably are behind the core dumps. It is regular code messing in memory areas where it shouldn't.


As people have said, try a memory checker that introduces bounds guard VM pages.

LlmL
December 8th, 2008, 04:05 AM
You are confused.

It's not the memory options that cause the segfaults that presumably are behind the core dumps. It is regular code messing in memory areas where it shouldn't.


As people have said, try a memory checker that introduces bounds guard VM pages.


What is memory checker/bounds guard VM pages?
How can I do that?

LlmL
December 8th, 2008, 04:18 AM
Take a look at valgrind and similar solutions.

Using valgrind slows down my program and it takes longer and is harder to reproduce this condition (possibly over one year).

wmcbrine
December 8th, 2008, 04:33 AM
Yes, well, you should've tested it better as you were building it. Now it's harder. But you'll have to do the work, or be stuck with a buggy program. There's no magic shortcut.

Why are you using 80 threads, anyway?

LlmL
December 8th, 2008, 01:20 PM
Yes, well, you should've tested it better as you were building it. Now it's harder. But you'll have to do the work, or be stuck with a buggy program. There's no magic shortcut.

Why are you using 80 threads, anyway?

This program serves as a server for socket connections. Some of the threads are managing connections, some are dealing with package event. It has a large client population. We are trying to add memory hooks on these operations to solve it.

dwhitney67
December 8th, 2008, 01:45 PM
"Memory Hooks" to solve what? If your application is crashing with a core dump, it is either because it is accessing memory at an illegal address (i.e. an area outside the program space), or because memory that has been allocated has been trampled on by some operation.

Typically it is the latter situation that occurs most often. Here's a C++ example that compiles fine, but generates a segmentation violation when ran.



#include <iostream>
#include <cstring>

int main()
{
char str[10] = {0};
int* val1 = new int(5);
int val2 = 10;

memcpy(str, "Hello World", 12);

std::cout << "str = " << str << std::endl;
std::cout << "val1 = " << *val1 << std::endl;
std::cout << "val2 = " << val2 << std::endl;

delete val1;

return 0;
}


Here the problem is with the copying of 12 bytes into a 10-byte array; this will cause a data overflow, which in turn affects the next variable (val1).

Anyhow, I do not know what effort was placed into developing your server application, but there probably is one thread to accept client connections, and then other threads to deal with client requests. Each of the client request handling threads should be identical, thus it does not seem impossible for you to go back and review the code for some silly mistake a programmer may have made when performing a memory operation.

Off the top of my head, these are some of the operations you should look for:
malloc
memcpy
memset
strcpy (note, this function should never be used; use strncpy instead)
recv
recvfrom

Since the application crashes unpredictably, I would focus your attention on the threads that accept client data. It is probably there where the issue lies.

wmcbrine
December 8th, 2008, 07:03 PM
Really you have to look at every pointer dereference. And that includes every array access (a[b] is the same as *(a + b)).

nvteighen
December 8th, 2008, 10:15 PM
Using valgrind slows down my program and it takes longer and is harder to reproduce this condition (possibly over one year).

Well, take it or not.

Maybe you could try using a lab test suit? I mean, simulate your program in a smaller scale (reduce that 80 threads to, say, 3 or 4... that won't affect any memory related bug). First use gdb to check the conditions of your program just before the error occurs. Then use valgrind to detect where the error occurs and also where the memory was allocated (gdb will only tell you where the segfault occurs).

And take your conclusions... but I'd always use valgrind along with gdb. Never just valgrind.

Cracauer
December 10th, 2008, 04:23 AM
Overrunning the stack can also cause segfaults.

I didn't use a checker for a while, but I spotted that OpenBSD's malloc(3) has a guard page option. That sounds like what is needed (not that the Op is still around :)).

Using just guard VM pages around malloc'ed blocks doesn't slow down normal progra operations at all. But of course it makes malloc itself an order of magnitude (or two) slower.

Of course, even if you have this and you get the segfault, then you still need gdb to debug it. Which is what you needed in the first place. The best course of action probably is to run the whole damn thing in gdb, as-is and wait for the segfault. If you are lucky it's a direct operation without previous corruption.

dribeas
December 10th, 2008, 02:22 PM
See? you scared the poster :P

Now, seriously, since the poster does have cores, you can open the core with gdb and at least get a stact trace at the point where the program died. That helps figure what is wrong. Now it is usually much harder to detect what made it go wrong... or when