curl_close() is similar in many ways to
mysql_close().
curl_close() for when you no longer need to fetch resources using a curl session (
mysql_close() is for when you're done with a MySQL connection). The counterpart to
curl_close() is
curl_init() (like
mysql_close()/
mysql_connect()) and the number of calls to
curl_close() must be no greater than the number of calls to
curl_open(). You can reuse a curl session (as you can reuse MySQL connections) by simply calling
curl_setopt() to set a new URL (along with any other curl options), followed by
curl_exec(). Also, when a curl session is garbage collected, it gets closed. As a consequence,
curl_close() doesn't necessarily gain you much. If your script has stages where it doesn't need to fetch anything, closing a curl session early may help. If you only have one curl session or need it throughout the script, there isn't a great benefit.
The world wide web forms a graph. Resources (anything with a URL) is a node, links are edges. Resources that can't contain anchors (such as images) are leaves. You want to
traverse a portion of this graph, which leads to two algorithms:
breadth-first search (BFS) and
depth-first search (DFS). In the former, you process all nodes at a given distance from the starting node before processing nodes further out; in the latter, you fully process one branch before processing the next. Each is very similar. Here's an outline for both:
- Put the root node in the list N
- While there's a node left in list N
- remove the next node, store it as the current node
- (preorder) process current node
- add each child of the current node to the list N
- (postorder) process current node
The main difference between a BFS and DFS is the data structure used to hold the list of nodes to process. A BFS uses a queue (first-in, first-out) and a DFS uses a stack (first-in, last-out). PHP doesn't have specialized queues and stacks, but you can implement them with arrays and
array_push() with
array_pop() or
array_shift(). You can implement DFS recursively, in which case the node list is the call stack. This also gives you a new version of every local variable, which may or may not be desirable. If you only want one instance of local variables, don't use the call stack. This is the main source of memory usage. Don't use the call stack and reuse whichever resources you can (such as the DOMDocument and curl session) to reduce memory usage.
Another axis along which traversal algorithms differ is the point at which additional node processing (i.e., beyond adding its children to the list) is performed: before, during or after adding children to the list, also called pre-, in- and post-order traversal. Pre- and post-order are marked in the outline. In your case, extracting image URLs and adding them to the DB counts as additional processing. You can re-use the curl session and DOMDocument with either pre- or post-order traversal.
The linked articles have more specific information about tree traversal.
If you can rely on
allow_url_fopen being set, you don't need to use curl. Simply pass the URL to
DOMDocument::load().
One other issue is that BFS and DFS are designed to work on trees, which are connected graphs without cycles. Since the web most decidedly has cycles, you'll have to do something to break them. Use a set to record URLs. Here's the pseudocoded algorithm, updated to handle cycles:
Code:
set Seen to {}
add root to Nodes
while size(Nodes) > 0:
remove next element of Nodes and store it as current node
[additional processing of current node]
for each child of current:
if child is not in Seen:
add child to Seen and Nodes
[additional processing of current node]
In PHP, you can use an associative array as a set of URLs. Mapping set operations to array operations:
- $item is in $Set := isset($Set[$item])
- add $item to $Set := $Set[$item] = true
- remove $item from $Set := unset($Set[$item])