Suggestions for best way to build a MySQL table from raw data?

ParallelLogic

Member
Messages
35
Reaction score
0
Points
6
I have a lot of usernames stored in text files stored on my computer. I want to transfer these into a database I have already created. This is not a one time operation however, new usernames will be added to those text files every day, so I need a reliable way to transfer that data into the database.

I have two potential outlines for programs I need, but I don't know how to execute either one:

1. The program that is collecting the usernames and placing them in a file is written in Java (the programming language I know best). The program manually moves the mouse and types on the keyboard to navigate the site in order to collect info (so I can't use my computer over night, but that's not a problem). One solution I have thought of to populate the database would be to crate a PHP page and then have the Java program navigate to the PHP page and manually enter the data into text boxes and then hit an 'acept' button to send the data off to the database. In this instance I don't know how to make the data I'm sending 'safe'. The users have descriptions they can customize and so I'm worried that they have the option to enter harmful commands into their description that may interfere with the database when it is inserted. I also noticed that when I optputed the data from the database, it replaced ' with ' ' which I presume is its way of adding in an escape character (like backspash before a quote in Java). I als notice that if I try to insert a < into the databse, it seems to corrupt the description or something so that when I retrieve that data through PHP, the last thing it returns from the database is the character right before < and then stops executing, so I had to replace all < with the < code (but I'm concerned there may be other special characters or cases I am missing). If there was a built in function in PHP or Java that made text 'safe' for MySQL, that would be very helpful.

2. The other option I considered (but have absolutely no idea how to do) would be to connect directly between Java and PHP and send the data directly to the database. I don't know the necessary code for Java to interact online.

:happysad: Any help is much appreciated, thank you
 

xadrieth

New Member
Messages
62
Reaction score
1
Points
0
Can we see an example of the how your username text file is structured?

Maybe you might not need Java for this, just a PHP script.
 

ParallelLogic

Member
Messages
35
Reaction score
0
Points
6
Well, right now I have two different file types:
I have several files separated by first letter of username (Java can't handle a single file that has 2 million usernames in it apparently) with just one username on each line.

The other file I generated by hand and won't be the final format I will be using in the future. It has a username on one line, followed by the key words on the next line, followed by the description, a blank line, and then the next record.

Honestly, I'm pretty sure I can manipulate the files anyway I need to with Java (since the Java program will be the one collecting the data, I'll just format it however it needs to be done when it's saving the files in the future).
 

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
The program manually moves the mouse and types on the keyboard to navigate the site in order to collect info (so I can't use my computer over night, but that's not a problem). [...] have the Java program navigate to the PHP page and manually enter the data into text boxes and then hit an 'acept' button to send the data off to the database.

Programmatically controlling the console (mouse & keyboard) is okay when you're testing a UI, but is too brittle for other purposes. Layout changes, even changing the position of a widget by half of a dimension (width, height), will break your app.

Better approach: use java.net.URLConnection to download a webpage, javax.xml.parsers.DocumentBuilder to parse it and org.w3.dom package to extract the data you want. Then use URLConnection to post the data (or the file) in a batch to a form handler. The form handler, written in any server side scripting language you wish, inserts the data into the database. The file becomes unnecessary, but you may want to use it in case your spider crashes before it can post the new data.

If the MySQL server allows remote connections, you can cut out the PHP script and use MySQL Connector/J. Going this route lets you use java.sql.PreparedStatement to prevent SQL injection. Java Samples has a small tutorial on using PreparedStatement, should you want to see how to use it.

In this instance I don't know how to make the data I'm sending 'safe'. The users have descriptions they can customize and so I'm worried that they have the option to enter harmful commands into their description that may interfere with the database when it is inserted.
Use standard techniques to prevent SQL Injection: sanitization (in PHP, use the filter functions or the DB driver's quote function, such as mysqli::real_escape_string or mysql_escape_string) or prepared statements (e.g. mysqli::prepare or PDO::prepare). If you're going to display any of the fields in a web page, you'll need to prevent XSS attacks. Filter the fields with the FILTER_SANITIZE_SPECIAL_CHARS sanitization filter or filter HTML tags and attributes based on a whitelist. For the latter, you'll have to find a 3rd party function or write it yourself.

All sanitization should be handled by whatever inserts the data. Sanitizing in the java applet but inserting in a PHP script isn't secure.

I also noticed that when I optputed the data from the database, it replaced ' with ' ' which I presume is its way of adding in an escape character (like backspash before a quote in Java).
That is indeed one way of escaping a single quote, but you shouldn't worry about how a string should be altered to sanitize it, as you should rely on the DB driver's functions to make the data safe.

I als notice that if I try to insert a < into the databse, it seems to corrupt the description or something so that when I retrieve that data through PHP, the last thing it returns from the database is the character right before < and then stops executing, so I had to replace all < with the < code (but I'm concerned there may be other special characters or cases I am missing).

Are you sure the query stops? Check the page source; the "<" is probably being interpreted as the start of a tag. This demonstrates the need to filter data for HTML tags.

If there was a built in function in PHP or Java that made text 'safe' for MySQL, that would be very helpful.
For Java, use prepared statements. For PHP, use prepared statements or (failing that) the previously mentioned sanitization functions.
 
Last edited:

ParallelLogic

Member
Messages
35
Reaction score
0
Points
6
even changing the position of a widget by half of a dimension (width, height), will break your app.
I actually have an unrelated app that I've been running for over a year (once a week over night) that does just that, and it really is a hassle when the site's design changes :mad:. The problem is that I don't know the post codes (I also didn't know how to post directly either, so thank you for that link as well); I don't know how what varibles to post to...

I actually looked through the HTML code (this is for a different part of the same project, so please bare with me) and found this:
Code:
<label title="Separate with commas or spaces">Recipient(s):</label> &nbsp;

                            <input class="text l_4180" type="text" tabIndex="1" id="0a6312b48a7049f096c0" name="0a6312b48a7049f096c0" value="" size="30" maxlength="250" />
                            <input class="text l_4579" type="text" tabIndex="1" id="ee06772a189329e45768" name="ee06772a189329e45768" value="" size="30" maxlength="250" />
                            <input class="text l_9885" type="text" tabIndex="1" id="11cc16bdbe999b2c9990" name="11cc16bdbe999b2c9990" value="" size="30" maxlength="250" />
                            <input class="text l_3789" type="text" tabIndex="1" id="15505148bae1436d2a2f" name="15505148bae1436d2a2f" value="" size="30" maxlength="250" />
                            <input class="text l_9913" type="text" tabIndex="1" id="05f7ed476be430707109" name="05f7ed476be430707109" value="" size="30" maxlength="250" />
            
                            <select id="friends" name="friends" tabIndex="1" onchange="document.getElementById('05f7ed476be430707109').value = document.getElementById('friends').value;">
                    <option value="">-</option>
                                                <option value="buyer-1, catdevnull, elnino, F-City, F09, fourquestionmarks, hq, notagroup, r09, stinkbean, supergroup">Unsorted</option>

                                                        <option value="buyer-1">&nbsp;&nbsp;&nbsp;buyer-1</option>
                                                    <option value="catdevnull">&nbsp;&nbsp;&nbsp;catdevnull</option>
                                        </select>

                        <br /><br />
            <label for="8311ac1c525b03a71ac5">Subject:</label> &nbsp; &nbsp; &nbsp; &nbsp; <input class="text" type="text" id="8311ac1c525b03a71ac5" name="8311ac1c525b03a71ac5" value="" size="30" tabIndex="1" maxlength="80" />
            <br /><br />
            <span style="float: right;"><a href="http://comments.deviantart.com/emoticons" class="emoticon-link" rel="popup(name:emoticons,width:400,height:600)">Emoticon Legend</a></span>
            <label for="notebody">Message:</label><br />

            <textarea id="notebody" style="width: 100%;" tabIndex="1" name="body" rows="15" cols="40"></textarea>

[... more code ...]

.l_4180{display:none} .l_4579{display:none} .l_9885{display:none} .l_3789{display:none} .l_9913{}
</style>
This code outputs three things on the page in order to send a message from one user to another (I need to automate this process to notify people they have been added to the database):
Recipient(s): <textbox>
Subject: <textbox>
Message: <textbox>
and what I would like to do it use the "Then use URLConnection to post the data (or the file) in a batch to a form handler." that you described. Do you think I could get away with sending:
05f7ed476be430707109="ParallelLogic"
8311ac1c525b03a71ac5="TEST TITLE"
body="MESSAGE"
through the Java method you linked to? In other words can I replace the "string=" with the correct name and use the Java code?
Code:
OutputStreamWriter out = new OutputStreamWriter(
                              connection.getOutputStream());
	out.write("string=" + stringToReverse);
	out.close();
Do you know how I would send multiple variables through the Java method? Would I just use multiple write commands? The example only shows a single variable "string" that is being assigned, when compared to the three variables I'm working with.

Actually, after some experimentation, it seems I can't even connect with the notes system (for sending messages from one user to another -- I wouldn't be able to get the page source code I pasted above)) through Java, so I guess that's a moot point unless there's something I'm missing (I'm guessing Java can't handle the necessary cookies to be a virtual logged-in user?)

Then use URLConnection to post the data (or the file)
Alright, so back to the database communication, I would need to design a fairly simple PHP script that would take in the data Java sends to it (I'd prefer not to do MySQL to Java direct communication since I'm just starting out with PHP and MySQL). One thing I've been somewhat curious about is when we were talking on another page http://forums.x10hosting.com/programming-help/98997-php-mysql-communication-basics-2.html#post554241 about privacy and protection, it was mentioned that setting a PHP page to protection 0600 would keep average users from being able to access it to get the password -- but there was still some odd way to access the password potentially. I'm wondering if it might be possible to have the Java app send the PHP file the password so that way the password has to be known before being able to sign on as a user of the MySQL database.

Better approach: use java.net.URLConnection to download a webpage, javax.xml.parsers.DocumentBuilder to parse it and org.w3.dom package to extract the data you want.
Ah, thank you very much for the links. This kinda comes back to the issue I mentioned earlier however: I have to log in as a user in order to access the data, so I can't do this (at least not through Java) without some kind of cookie support (to show I'm [the Java program] is a valid virtual-user).

Sanitizing in the java applet but inserting in a PHP script isn't secure.
Ah, doubly noted.

Are you sure the query stops?
Well, what happens is I send in the query and have the PHP code echo all the results. The results stop getting sent back after the bad character.

For PHP, use prepared statements or
You mentioned FILTER_SANITIZE_SPECIAL_CHARS but that doesn't remove 'and " , I'm guessing I should also use FILTER_FLAG_ENCODE_HIGH ?

Thank you for the links, they have been very informative :happysad:
 
Last edited:

misson

Community Paragon
Community Support
Messages
2,572
Reaction score
72
Points
48
Do you think I could get away with sending:
05f7ed476be430707109="ParallelLogic"
8311ac1c525b03a71ac5="TEST TITLE"
body="MESSAGE"
through the Java method you linked to?
From the form source, the message should be stored in "notebody", not "body". As for the other fields, they might be dynamically generated when the page is, in which case the above probably won't work. What your applet would need to do is request the page, parse it and extract the field names, then use that info to construct a message for the form handler. Depending on how you design your classes, it's not as bad as it sounds.

Do you know how I would send multiple variables through the Java method? Would I just use multiple write commands? The example only shows a single variable "string" that is being assigned, when compared to the three variables I'm working with.
It depends on what content type you're using. I believe the default is "application/x-www-form-urlencoded" (if not, you can explicitly set the Content-type header), which requires sending the data as a URL-encoded query string:
Code:
OutputStreamWriter out = new OutputStreamWriter(
                              connection.getOutputStream());
HashMap<String,String> vars = new HashMap<String,String>;

out.write("05f7ed476be430707109 =" + URLEncode.encode(recipients, "UTF-8")
	+"&8311ac1c525b03a71ac5=" + URLEncode.encode(subject, "UTF-8")
	+"&notebody=" + URLEncode.encode(message, "UTF-8"));
For binary data and files, there's the multipart/form-data type.

Of course, you'll want to have a class that handles POST-ing form data to abstract away the HTTP details. Something like the following, which stores the form variables & values in a HashMap.
Code:
class HTTPForm {
    protected HashMap<String, String> vars;
    protected string url;
    ...
    HTTPForm(URL url) {
        this.url = url;
        vars = new HashMap<String, String>();
    }
    HTTPForm(String url) {
        this.(new URL(url));
    }
    HTTPForm(URL url, HashMap<String, String> vars) {
        this.url = url;
        this.vars = vars.clone();
    }

    public setVar(name, val) {
        vars.put(name, val);
    }
    public setURL(URL url) {
        this.url = url;
    }
    public static encode(HashMap<String, String> vars) {
        StringList varPairs = new StringList();
        vairPairs.ensureCapacity(vars.size());
        for (Map.Entry<String,String> var: vars.entrySet()) {
            varPairs.add(URLEncoder.encode(var.getKey() + '=' + var.getValue(), "UTF-8"));
        }
        return vairPairs.join("&");
    }

    public boolean send() {
        send(url, vars);
    }

    public static send(URL url, HashMap<String, String> vars) {
        String data = encode(vars);
        URLConnection connection = url.openConnection();
	connection.setDoOutput(true);
	OutputStreamWriter out = new OutputStreamWriter(
                              connection.getOutputStream());
	out.write(data);
	out.close();
        // now get & parse the response
        // ...
    }
}

....

// send like this:
msgHandlerURL=new URL("http://foospace.com/user/"+uid+"/send");

vars = new HashMap<String, String>();
vars.put('05f7ed476be430707109', recipients);
vars.put('8311ac1c525b03a71ac5', subject);
vars.put('notebody', message);
HTTPForm.send(msgHandlerURL, vars);

// or this:
HTTPForm msgForm = new HTTPForm(msgHandlerURL);
msgForm.setVar('05f7ed476be430707109', recipients);
msgForm.setVar('8311ac1c525b03a71ac5', subject);
msgForm.setVar('notebody', message);
msgForm.send();
You can easily overload the methods that take a URL to take a String, like the constructor HTTPForm(String).

If you want to examine interface designs, check out some of the interfaces for client-side SOAP and XML-RPC, though nothing beats Python's xmlrpclib, which turns native method calls into XML-RPC calls. It's pretty damn sweet.

Actually, after some experimentation, it seems I can't even connect with the notes system (for sending messages from one user to another -- I wouldn't be able to get the page source code I pasted above)) through Java, so I guess that's a moot point unless there's something I'm missing (I'm guessing Java can't handle the necessary cookies to be a virtual logged-in user?)
You have to take care of logging in and setting cookies yourself. See java.net.CookieHandler and a Java-and-Cookies how-to or two. Again, with a good design, it can all be quite simple to use and fairly simple to write.

One thing I've been somewhat curious about is when we were talking on another page http://forums.x10hosting.com/programming-help/98997-php-mysql-communication-basics-2.html#post554241 about privacy and protection, it was mentioned that setting a PHP page to protection 0600 would keep average users from being able to access it to get the password -- but there was still some odd way to access the password potentially. I'm wondering if it might be possible to have the Java app send the PHP file the password so that way the password has to be known before being able to sign on as a user of the MySQL database.
Mode 0600 keeps local users from reading the source code. For the most part, permissions won't apply in securing the file from remote users, as they won't be able to read the source directly. Note that another local user's website might have a security hole that allows reading arbitrary files, in which case mode 0600 will protect your script from that hole.

As for odd ways of accessing the password, imagine somehow a local user couldn't read the script but found a way to include it from another script (they couldn't, but imagine they could), or imagine that a visitor could somehow inject PHP code and get it interpreted (possible, but very unlikely), or suppose there's an exploit you haven't even thought of that lets a user get programmatic access to the script. A global variable or constant would be accessible in any of those cases. Take a page from OOP, the one that covers encapsulation and information hiding, to restrict programmatic access. You can use scope (storing the password as a local function variable) or visibility (storing the password in a private instance variable) to achieve information hiding. Of course, we still haven't addressed memory dumps, though only the superuser should be able to access the memory of other users' processes.

Are you sure the query stops?

Well, what happens is I send in the query and have the PHP code echo all the results. The results stop getting sent back after the bad character.
Be sure to check the page source or have the PHP script output plain text rather than HTML (after setting the Content-type header). Does the DB driver error function (eg mysql_error, mysqli::error) report anything? A "<" shouldn't matter to MySQL, but it will matter to an HTML processor. Also check the size of the description field in the table. If you only checked one entry, it's possible the occurrence of the "<" is a coincidence and the value is being clipped due to size.

You mentioned FILTER_SANITIZE_SPECIAL_CHARS but that doesn't remove 'and " , I'm guessing I should also use FILTER_FLAG_ENCODE_HIGH ?
Use something like filter_var() with the FILTER_SANITIZE_SPECIAL_CHARS filter to handle HTML and prevent XSS, but also use prepared statements or a DB string escape function to prevent SQL injection.
 
Top