Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty results of some probes #212

Closed
alexey-yarmosh opened this issue Oct 5, 2022 · 26 comments
Closed

Empty results of some probes #212

alexey-yarmosh opened this issue Oct 5, 2022 · 26 comments
Assignees

Comments

@alexey-yarmosh
Copy link
Member

alexey-yarmosh commented Oct 5, 2022

Sometimes some of the probes of the finished measurements have result fields. Maybe data is not sent by the probe or API didn't handle it properly. We need to adress it.

Steps to reproduce (may not reproduce every time):

  • send measurement request with the body
{
    "target": "yarmosh.by",
    "type": "ping",
    "measurementOptions": {
        "packets": 16
    },
    "limit": 500,
    "locations": []
}
  • wait until measurement status is finished
  • check all the result fields
    Expected value:
    some valid result data
    Actual value:

image

@alexey-yarmosh alexey-yarmosh changed the title Lost data from a probe Empty results of some probes Oct 5, 2022
@MartinKolarik
Copy link
Member

Isn't this a case of the measuremnt timing out? I believe we don't have any explicit way to indicate per-measurement errors so this is what happens then.

@jimaek
Copy link
Member

jimaek commented Oct 6, 2022

The question is why it timed out. Our current probes should be reliable enough to work. If you give me some probes I can try to pull the local logs.

We probably need a status field to clearly mark what happened in each result. Did the whole probe time out because of bad internet, meaning the target endpoint is not at fault, did it time out because of an exception in which case we need to fix it, did the API fail and we need to fix something, did the target endpoint fail and the user can safely mark it as down

@patrykcieszkowski
Copy link
Contributor

Isn't this a case of the measuremnt timing out? I believe we don't have any explicit way to indicate per-measurement errors so this is what happens then.

no. A measurement timeout is handled by the server, and the server doesn't include any other value than rawOutput.

@alexey-yarmosh
Copy link
Member Author

Such output may occur for several reasons:

  • stdout of the result is empty (code)
  • stdout of the execa error is empty (code)
  • error happened but it is not execa error (code)

Also we are putting the reason of failed measurement (e.g. "Private IP ranges are not allowed") in rawOutput which is used for execa stdout.

I think we should add a status field to the measurement + use rawOutput only for execa output and also add message field for the reason of the failure. E.g.:

  • successful measurement
"result": {
  "status": "success",
  "message": "",
  "rawOutput": "PING ...",
  "resolvedAddress": "...",
  "resolvedHostname": "...",
  "timings": [...],
  "stats": {...}
}
  • stdout of the execa error is empty
"result": {
  "status": "failed",
  "message": "execa error",
  "rawOutput": "",
  "resolvedAddress": null,
  "resolvedHostname": null,
  "timings": [],
  "stats": {...}
}
  • error happened but it is not execa error
"result": {
  "status": "failed",
  "message": "cannot get property a of undefined",
  "rawOutput": "PING ...",
  "resolvedAddress": null,
  "resolvedHostname": null,
  "timings": [],
  "stats": {...}
}
  • private ip
"result": {
  "status": "failed",
  "message": "Private IP ranges are not allowed",
  "rawOutput": "",
  "resolvedAddress": null,
  "resolvedHostname": null,
  "timings": [],
  "stats": {...}
}

@alexey-yarmosh
Copy link
Member Author

@jimaek @MartinKolarik what do you think?

@MartinKolarik
Copy link
Member

Not sure if the message is necessary. Also, in the third case, we shouldn't expose the error message, only log it. Probe-level status maybe makes sense, just need to think how it interacts with the overall measurement status.

@alexey-yarmosh
Copy link
Member Author

Not sure if the message is necessary

Idea is not to mix up reason of status and raw output.

Also, in the third case, we shouldn't expose the error message, only log it

@jimaek, how we are able to get logs from the probe?

Not sure if the message is necessary. Also, in the third case, we shouldn't expose the error message, only log it. Probe-level status maybe makes sense, just need to think how it interacts with the overall measurement status.

I think overall measurement status logic may be left as is. We can just add "in progress" (or "not finished") status for progress messages. So when the overall measurement is finished we understand that all "in progress" measurements mean timeout.

@jimaek
Copy link
Member

jimaek commented Jan 9, 2023

Probe logs are local and if we don't control it we have no way to access them

@patrykcieszkowski
Copy link
Contributor

Probe logs are local and if we don't control it we have no way to access them

maybe that's a potential idea for another feature - save probe logs to file, and make it accessible through the API?

@jimaek
Copy link
Member

jimaek commented Jan 18, 2023

It sounds nice, but I can imagine it would be too complicated. e.g. hardware probes are read-only, there is no way for the user to interact. We could build a system to trigger log upload on adopted probes via the dashboard but that sounds like a huge project to implement correctly and securely

@alexey-yarmosh
Copy link
Member Author

So the new field that we are adding is status: "not-finished" | "success" | "failed" - at the very start all probes are "not-finished" and when global measurement is "finished" we are stopping all the probe updates. So we are able to see that some probes are "success", some are "failed", and some where not finished in 30 sec and still have "not-finished" status - that means "timeout";

Pushing "Private IP ranges are not allowed" to rawOutput makes sense if some globalping-based app shows rawOutput to the end user. if that is true keeping message there seems valid.

According to sending logs, maybe we can use socket.io for that? On probe side it can be simple socket.emit('probe:log:info', someData) or socket.emit('probe:log:error', error) in error handlers. And on API side we will listen and log that messages with special prefix. Or smth like winston-socket.io transport can be used as well.

@MartinKolarik
Copy link
Member

Let's use values in-progress, finished, and failed. When the global measurement timer runs out, change all probes that are in-progress to failed and maybe add a message "The request timed out" to rawOutput. That way the clients can easily differentiate between success/error and also provide details to the users in a similar way like for the "Private IP ranges are not allowed" case.

@alexey-yarmosh
Copy link
Member Author

Seems we can override long running script's rawOutput with the "The request timed out" value, not sure if that is ok.

@MartinKolarik
Copy link
Member

MartinKolarik commented Jan 30, 2023

Assuming the rawOutput is empty when the timeout is reached, it definitely makes sense to me. If there's already some output, maybe add it as the last line?

@alexey-yarmosh
Copy link
Member Author

or timeout may be one of the status values

@MartinKolarik
Copy link
Member

I considered that but don't think we should have a special type for each type of error. From the client's perspective, it's simply a failure and the raw output has the message with additional details that can be shown to the user.

@alexey-yarmosh
Copy link
Member Author

alexey-yarmosh commented Jan 30, 2023

Ok, another thing is an extra load to redis. As I see the only way to that is:

const measurement = this.getMeasurement(id);
const ids = measurement.results.filter(filterInProgress);
await Promise.all(ids.map(id => this.redis.json.set(key, `$.results.${id}.result.status`, 'failed')));

So under high load when timeouts will happen more frequently we will have event more redis writes.

@MartinKolarik
Copy link
Member

MartinKolarik commented Jan 30, 2023

We can do one set on the whole JSON document, can't we? Not sure how exactly redis implements the individual field operations, but I imagine if there are more than a few fields to update, the whole document set might be faster.

@alexey-yarmosh
Copy link
Member Author

Yes, seems whole document set is not a perfect, but a better option.

@alexey-yarmosh
Copy link
Member Author

alexey-yarmosh commented Feb 10, 2023

So, we added statuses and they help with timeouts, but there are still a few cases when we are able to get the same {status: 'failed', rawOutput: ''} result:

  • stdout of the result is empty (code)
  • stdout of the execa error is empty (code)
  • error happened but it is not execa error (code)

We need to differentiate them, possible options are:

Option 1: Add messages to the rawOutput.

  • stdout of the result is empty
    remain as is
  • stdout of the execa error is empty
    add message to the rawOutput field like result.rawOutput += '\n\nExeca error happened.'
  • error happened but it is not execa error
    change the response from {status: 'failed', rawOutput: ''} to {status: 'failed', rawOutput: 'Runtime error happened.'}

Option 2: Use exitCode from execa response.

  • stdout of the result is empty
    add exitCode field to the result (it should be 0 all the time)
  • stdout of the execa error is empty
    add exitCode field to the result (it should be non-0 all the time).
  • error happened but it is not execa error
    change the response from {status: 'failed', exitCode: 0, rawOutput: ''} to {status: 'failed', exitCode: 0, rawOutput: 'Runtime error happened.'}

Option 3: Use other execa fields.

  • Possible fields are failed, timedOut, isCanceled, killed. We can use them in the same way exitCode is used in option 2.

Addition. Here is the example of execa error object:

{
  shortMessage: "Command failed with exit code 68: unbuffer ping -c 3 -i 0.2 asdf.asdf",
  command: "unbuffer ping -c 3 -i 0.2 asdf.asdf",
  escapedCommand: "unbuffer ping -c 3 -i 0.2 asdf.asdf",
  exitCode: 68, // for some reason it is 68 for ping of unknown host
  signal: undefined,
  signalDescription: undefined,
  stdout: "ping: cannot resolve asdf.asdf: Unknown host",
  stderr: "",
  failed: true,
  timedOut: false,
  isCanceled: false,
  killed: false,
}

Here is the example of execa successful object:

{
  command: "unbuffer ping -c 3 -i 0.2 google.com",
  escapedCommand: "unbuffer ping -c 3 -i 0.2 google.com",
  exitCode: 0,
  stdout: "PING google.com (142.250.75.14): 56 data bytes\n64 bytes from 142.250.75.14: icmp_seq=0 ttl=120 time=7.826 ms\n64 bytes from 142.250.75.14: icmp_seq=1 ttl=120 time=6.043 ms\n64 bytes from 142.250.75.14: icmp_seq=2 ttl=120 time=7.785 ms\n\n--- google.com ping statistics ---\n3 packets transmitted, 3 packets received, 0.0% packet loss\nround-trip min/avg/max/stddev = 6.043/7.218/7.826/0.831 ms",
  stderr: "",
  all: undefined,
  failed: false,
  timedOut: false,
  isCanceled: false,
  killed: false,
}

The best option from my perspective is to use exitCode field - we are not polluting rawOutput plus we can clearly see either problem is in JS code or execa command.
@jimaek @MartinKolarik please share your thoughts.

@MartinKolarik
Copy link
Member

MartinKolarik commented Feb 16, 2023

  1. From the user's perspective, what matters is that the test failed and that there is an indicative error message. That means failed status and a message added to rawOutput because that's what the tools will show.
  2. The exit code, on the other hand, is worthless to the user and also bounds us to a specific implementation. What if we later reimplement the DNS command so that it's executed within node and there are no exit codes at all?

Solution 1 makes the most sense to me. The message itself may change depending on the exit code (or depending on the execa fields as in option 3) if you feel it explains the problem better to the user / helps us in debugging.

@jimaek
Copy link
Member

jimaek commented Feb 16, 2023

Maybe we can also locally log with a lot more detail when issues like that happen? To make it easier to debug in case a user complains that a test failed.

@alexey-yarmosh
Copy link
Member Author

Sounds valid, I'll prepare a draft PR for a ping command.

@alexey-yarmosh
Copy link
Member Author

Here it is:
jsdelivr/globalping-probe#114

@alexey-yarmosh
Copy link
Member Author

alexey-yarmosh commented Feb 20, 2023

So, there are only 4 faulty probes, they constantly respond with “Test failed. Please try again.” in case of dns, ping, traceroute or mtr measurements. http type is working fine. Seems like these probes are behind a firewall or smth that blocks everything except http.

Issue is still valid so I am not closing it. But to proceed we should either:

@alexey-yarmosh
Copy link
Member Author

Fixed by #52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants