Resolved Problem with compressed XML sitemap

Status
Not open for further replies.

costas1

Member
Messages
134
Reaction score
3
Points
18
I have a problem with accessing my site's sitemaps.

I have a sitemap index which points to several other compressed (.gz) XML sitemaps.

When I try to open a compressed sitemap file via HTTP, accessing it by the site's URL, I get a corrupted file with a wrong filename. While the filename on the server has the format [filename].xml.gz, when I try to open it with the browser I get a corrupted [filename].xml-[no].gz, where [no] is a number that increases by 1 each time I'm trying to open the file.
Similar thing happens when trying to open the files via DirectAdmin. The files this time open with the wrong name and the XML within the .gz has also the [no] in the end of its file name, after the extension. This time the XML can be opened after renaming it, or by right clicking it and choosing a program to open it.

If instead of opening them I choose download, then the file is saved correctly and I can open the .gz and then the XML, without an issue.

Any ideas?
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
It sounds like you want a rule in your .htaccess to force the file to download, instead of content. Maybe something like this:

Code:
<FilesMatch "*.xml.gz$">
  ForceType application/gzip
  Header set Content-Disposition attachment
</FilesMatch>
 

costas1

Member
Messages
134
Reaction score
3
Points
18
I'm not really sure if I need anything, because search engines seem to be able to access the sitemaps without errors.

I still wonder what causes the archives' corruption and their rename. The archives get corrupted when opened with any browser, but if I open them through DirectAdmin they only get renamed and I can open them. The thing is that the file in the archive is also renamed, so obviously this happens server side.

Maybe that's something you can fix.
 

cybersav

Member
Messages
56
Reaction score
3
Points
8
Hello costas1.

This is caused because when you access or rename a file with special characters inside the file or in the name, they will be lost or corrupted and the encoding will mess up.

To prevent this, if you need to edit the file, use an ftp client with the same encoding settings as the file.

I personally and highly recommend multicommander to safely and carefuly work on you files without disruption.

http://multicommander.com - Free softwere (Not freemium)
 

costas1

Member
Messages
134
Reaction score
3
Points
18
Hello costas1.

This is caused because when you access or rename a file with special characters inside the file or in the name, they will be lost or corrupted and the encoding will mess up.

To prevent this, if you need to edit the file, use an ftp client with the same encoding settings as the file.

I personally and highly recommend multicommander to safely and carefuly work on you files without disruption.

http://multicommander.com - Free softwere (Not freemium)

I'm not sure what a file manager has to do with this. Every time I access the files via HTTP a number that auto-increments has been appended to them. The files are static and obviously have a standard name. Whatever happens, it happens server side.
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
Interesting. If anything, it sounds like an issue with DirectAdmin. Maybe it doesn't handle that specific file correctly.
 

costas1

Member
Messages
134
Reaction score
3
Points
18
Interesting. If anything, it sounds like an issue with DirectAdmin. Maybe it doesn't handle that specific file correctly.

The problem happens not only when opening the files through DA's file manager, but also when accessing them with their URL and choosing open instead of download, with a browser. Has DA anything to do with that too?

By the way, if it's not a problem, visit the compressed sitemaps of my .org domain (they are in "sitemap" folder) and try to recreate the problem. I'm curious if it would be happen in your case too, and I guess it would be interesting to explore that behavior.
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
I don't have access to see what your domains are; only your "primary" domain is visible, which is your .x10.mx or whatever you chose at signup.

I'm wondering if changing the HTTP server has any effect on this.
 

costas1

Member
Messages
134
Reaction score
3
Points
18
I don't have access to see what your domains are; only your "primary" domain is visible, which is your .x10.mx or whatever you chose at signup.

I'm wondering if changing the HTTP server has any effect on this.

No, the problem persists till this moment. In that case, I just sent you the URL in a PM.
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
replacing the HTTP server program may affect this. Please try it now and see if it's any different. I replied to your PM as well.
 

costas1

Member
Messages
134
Reaction score
3
Points
18
Sorry for they delay in my reply. It was a busy day.
I still don't see any difference. I'm trying now from my laptop using Firefox. Same thing as with my desktop. I get a corrupted file. Repeated tries to open the file lead to this thing with the auto-incrementing number.

I also tried with Opera using my laptop. Again I get a corrupted .gz. Chrome to downloads a corrupted file.

No difference from trying using my desktop.

Getting the same error using 2 different computers and 3 different browsers is probably not random.

I can only download the files without any issues using FTP. On the other had both Google and Bing seem to access the sitemaps without an issue.
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
The file I'm getting from your PM is not compressed. Have you tried gzipping it first?

You may need a plugin for Wordpress, or etc. if you're using software that can automate this.
 

costas1

Member
Messages
134
Reaction score
3
Points
18
The file I'm getting from your PM is not compressed. Have you tried gzipping it first?

You may need a plugin for Wordpress, or etc. if you're using software that can automate this.

What do you mean it is not compressed? The link I shared leads to a .gz file. What type of file do you download when visiting the URL? Isn't it a gzip?

I personally get a corrupted gzip file., but when accessing it through FTP the file's integrity is fine.
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
It's not a gzip file; it's a .xml file. You have to use a program to create a gzip, using the xml file as input. It's like creating a .zip file. Renaming file.xml to file.gz doesn't work and it looks corrupt because it's not a real gzip file, it's just a renamed file.

Look for Peazip or 7zip for Windows. Linux has gzip built in. I'm not sure what to use for Mac.
 

costas1

Member
Messages
134
Reaction score
3
Points
18
It's not a gzip file; it's a .xml file. You have to use a program to create a gzip, using the xml file as input. It's like creating a .zip file. Renaming file.xml to file.gz doesn't work and it looks corrupt because it's not a real gzip file, it's just a renamed file.

Look for Peazip or 7zip for Windows. Linux has gzip built in. I'm not sure what to use for Mac.

It's a file automatically generated by Mediawiki's maintenance scripts.
Are you sure it's not a gzip? When I download it with an FTP client, I can open it with 7zip without renaming it. 7zip shows that it contains an XML file, that opens fine with a web browser. I guess if it was not a gzip it wouldn't open the file at all.
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
Look at these two files. The .xml file is the one I get from your site (but your site calls it .xml.gz, which is not true). The .xml.gz file I created on my computer using

Code:
gzip -c sitemap-Driverspedia-NS_0-0.xml > sitemap-Driverspedia-NS_0-0.xml.gz

The input file is from your site, but renamed .xml (not .xml.gz). The output file I uploaded here.

I had to add .txt to both files to get around the block on the forum. When you download these files, rename them so they don't have a .txt extension.

You can see the .xml file looks like a bunch of normal text. The .xml.gz file has a bunch of binary gibberish in it, so when you open it in notepad, it looks like it's encrypted. The gzip compression makes the file unreadable unless you run gzip to reverse the compression.

Look at the file size. The .xml.gz file is 1/4 of the original size.
 

Attachments

  • sitemap-Driverspedia-NS_0-0.xml.gz.txt
    451 bytes · Views: 2
  • sitemap-Driverspedia-NS_0-0.xml.txt
    2 KB · Views: 1

costas1

Member
Messages
134
Reaction score
3
Points
18
Look at these two files. The .xml file is the one I get from your site (but your site calls it .xml.gz, which is not true). The .xml.gz file I created on my computer using

Code:
gzip -c sitemap-Driverspedia-NS_0-0.xml > sitemap-Driverspedia-NS_0-0.xml.gz

The input file is from your site, but renamed .xml (not .xml.gz). The output file I uploaded here.

I had to add .txt to both files to get around the block on the forum. When you download these files, rename them so they don't have a .txt extension.

You can see the .xml file looks like a bunch of normal text. The .xml.gz file has a bunch of binary gibberish in it, so when you open it in notepad, it looks like it's encrypted. The gzip compression makes the file unreadable unless you run gzip to reverse the compression.

Look at the file size. The .xml.gz file is 1/4 of the original size.

When I download it via FTP 7zip opens the .gz and I can access the XML in it. The .gz is 1 KB.
When I download the same file with Firefox its size is 3 KB, but it is corrupted and 7zip gives an error: "The archive is either in unknown format or damaged".

That's what happens for me.
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
My browser is doing what you're seeing. Apparently, most, if not all, browsers will see a gzipped XML and unzip it automatically. They do the same for CSS, JS, etc. because it saves network traffic. I can't find out how to disable it, but I tried everything on my x10 account to force my browser to stop, and it won't.

The easiest solution is just to rename the file to an extension that doesn't exist like ".xgz" and your browser handles it just fine. If search engines will tolerate that, I'm not sure.

Here's what I tried:
* Disable Content-Encoding header
* Set Content-Encoding to gzip
* Set content type to application/x-gzip, application/gzip, application/xml, text/xml, application/x-do-not-unzip-this
* Disable Vary header
* Disable Etag header
* Set Content-Disposition to attachment
* Set Content-Disposition filename to .xgz
* Disable gzip for this file (.htaccess <FilesMatch ".xml.gz$">SetEnv no-gzip 1</FilesMatch>)

In every case, it downloaded a 2kb XML file, named [something].xml.gz. It never downloaded a 1/2kb gzip file.
 

costas1

Member
Messages
134
Reaction score
3
Points
18
I'm still not sure that this behavior has anything to do with the browser.
For example try this link: https://www.verif.com/sitemap/verif_sitemap1662.xml.gz

It's a random compressed sitemap I found online. I can download it and open it without a problem. It does not get corrupted. I'm using the same browsers that download my sitemap as a corrupted file.
 

garrettroyce

Community Support
Community Support
Messages
5,611
Reaction score
249
Points
63
I figured out what I did wrong yesterday. I thought I disabled .gz file filtering, but I used the command wrong. After an hour, I was pretty fed up with troubleshooting this :banghead:

Put this in your .htaccess file in the folder where you have your sitemaps

Code:
<FilesMatch "[.]xml[.]gz$">
    Header set Content-Type "application/x-gzip"
    Header unset Content-Encoding
    SetEnv no-gzip 1
    RemoveOutputFilter gz
</FilesMatch>

Some of these may not be required, but this is what got it working for me, so you can play with the directives if you think you need to. The important one appears to be RemoveOutputFilter.

So, the way this works is:
If a file matches the format "[.]xml[.]gz$", then apply the following rules. Note that in Regular Expressions, the "$" has a special meaning and it means the file name has to end with .xml.gz. Something like .xml.gz.exe will not work, because it doesn't end with .xml.gz exactly. The periods are inside of square brackets because periods also have a special meaning. When anything is in a square bracket, it will not have a special meaning. I don't know if that's strictly necessary for .htaccess specifically, but it works and the square brackets don't hurt anything.
 
Status
Not open for further replies.
Top